hckrnws
Employees are feeding sensitive data to ChatGPT, raising security fears
by taubek
We block ChatGPT, as do most federal contractors. I think it’s a horrible exploit waiting to happen:
- there’s no way they’re manually scrubbing out sensitive data so its bound to spill out from the training data when prompting the model
- OpenAI is openly storing all this data they’re collecting to the extent that they’ve had several leaks now where people can see others’ conversations and data. We are one step away if it hasn’t already happened from an exploit of their systems (that likely weren’t built with security as the top priority as opposed to scale and performance) that could leak a monumental amount of data from users.
In the most innocent case they could leak the personal info of naive users. But largely if Linkedin is any indication, the business world is filled with dopes who genuinely believe the AI is free thinking and better than their employees. For every org that restricts ChatGPT use, there are fifty others that don’t, most of which have at least one of said dopes who are ready to upload confidential data at a moments notice.
Wouldn’t even put it past military personnel putting S/TS information into it at this point. OpenAI should include more brazen warnings against providing this type of data if they want to keep up this facade of “we can’t release it because ethics” because cybersecurity is a much more real liability than a supervised LM turning into terminator.
This really depends on the cost/benefit tradeoff for the entity in question. If using ChatGPT makes you X% more productive (shipping faster / lowers labor costs / etc), but comes with Y% risk of data leakage, is that worth it in expectation or not? I would argue that there definitely exist companies for which it's worth the tradeoff.
By the way, OpenAI says they wont use data submitted through its API for model training - https://techcrunch.com/2023/03/01/addressing-criticism-opena...
To anyone who may be pasting code along the lines of 'convert this sql table schema into a [pydantic model|JSON Schema]' where you're pasting in the text, just ask it instead to write you a [python|go|bash|...] function that reads in a text file and 'converts an sql table schema to output x' or whatever. Related/not-related--great pandas docs replacement is another great+safe use-case.
Point is, for a meaningful subset of high-value use-cases you don't need to move your important private stuff across any trust boundaries, and it still can be pretty helpful...so just calling that out in case that's useful to anyone...
At first I was impressed by how easy it was to reach a data model with chatgpt, then I laughed as I tried to tweak it and use it. I realized it didn't really have any model concepts and was just using its various KB.
I am unsure if the so called AI can think in models but so far, not but still an impressive assisting tool if you take care of its limitations.
Another point where it lacks is in logic, my daughter has a lot of fun with the book "what is the name of this book?" but she was struggling with the "map of baal" explanation, her the answer was a certain map, yet the book had another answer, I had a third one as I interpreted a proposition. I never got an answer without a contradiction in chatgpt reasoning, and the book had been mistranslated to French so one of its propositions was changed (C, both A and B were knaves) but not the answer.
> At first I was impressed by how easy it was to reach a data model with chatgpt, then I laughed as I tried to tweak it and use it. I realized it didn't really have any model concepts and was just using its various KB.
> I am unsure if the so called AI can think in models but so far, not but still an impressive assisting tool if you take care of its limitations.
I don't know. I'm using it for exactly that ("here's a problem, come up with a data model") and it gives a great starting point.[0]
Not perfect, but after that it's easy to tweak it the old-fashioned way.
I find its data modelling capabilities (in the domain I'm using it for - API services) to be rougly on par with a mid-level developer (for a handwavy definition of "midlevel").
Did you prime it before asking, so it was answering in the appropriate context?
I'm doing that since day one. I can't believe people are pasting real data into this corporate black boxes.
What about Google Docs, Office 365, Github, AWS, Azure, Google Cloud, JIRA, Zendesk, etc?
What is different about ChatGPT (if anything)?
We have data standards and agreements with those companies, we pay them to have expectations. Even then, we're strict about what touches vendor servers and it's audited and monitored. Accounts are managed by us and tied into onboarding and offboarding. If they have a security incident, they notify, there's response and remediation.
ChatGPT seems to be used more like a fast stackoverflow, except people aren't thinking of it like a forum where others will see their question so they aren't as cautious. We're just waiting for some company's data to show up remixed into an answer for someone else and then plastered all over the internet for the infosec lulz of the week.
> We have data standards and agreements with those companies, we pay them to have expectations. Even then, we're strict about what touches vendor servers and it's audited and monitored. Accounts are managed by us and tied into onboarding and offboarding.
For every company like yours there are hundreds that don't. People use free gmail address for sensitive company stuff, paste random things in random pastebins, put their private keys in public repos, etc.
Yes, data leaks from OpenAI are bound to happen (again), and they should beef up their security practices.
But thinking people are using only ChatGPT in an insecure way vastly overestimates their security practices elsewhere.
The solution is education, not avoiding new tools.
Doesn't OpenAI explicitly say that your Q/A on the free ChatGPT are stored and sent to human reviewers to be put in their RL database? Now of course we can't be sure what google, AWS etc do with the data on disks there, but it would be a pretty big scandal if some whistleblower eventually comes out and say that google employees sit and laugh at private bucket contents on GCP or private Google Docs. So there's a difference in stated intention at least..
Who in their right mind is using free ChatGPT through that shitty no good web interface of theirs, that can barely handle two queries-and-replies before grinding down to a halt? Surely everyone is using the pay-as-you-go API keys and any one of the alternative ffrontends or integrations?
And, IIRC, pay-as-you-go API requests are explicitly not used for training data. I'm sad GPT-4 isn't there yet - except for those who won the waitlist lottery.
It's really funny to see these types of comments. I would assume a vast majority of users are using the Web interface, particularly in a corporate context where an account for the API could take ages or not be accepted.
If people were smart and performed according to best practices, articles like this one would not be necessary.
I mean, if you're using a free web interface in corporate context, you may just as well use a paid API with your personal account - either way, you're using it of your own volition, and not as approved by your employer. And getting API keys to ChatGPT equivalent (i.e. GPT-3.5) takes... a minute, maybe less.
I am honestly confused how people can use this thing with the interface OpenAI runs. The app has been near-unusable for me, for months, on every device I tried it on.
> and any one of the alternative ffrontends or integrations?
And what sort of understanding do you have with the alternative frontends/integrations about how they handle your API keys and data? This might be a better solution for a variety of reasons but it doesn't automatically mean your data is being handled any better or worse than by openai.com
I wonder what the distribution of tokens / sec at OpenAI is between the free ChatGPT, paid ChatGPT, and APIs. I’d have to think the free interface is getting slammed. Quite the scaling project, and still nowhere near peaking.
To quote a children's TV show: "Which ones of these things are not like the other ones?"
Some of those are document tools working on language / knowledge. Others are infrastructure, working on ... whatever your infra does, and your infra manages your data (knowledge).
If you read their data policies, you'll find they are not the same.
I wouldn't put sensitive work data/employer IP in a personal Google Doc (et al.) either, no?
Dont use any of it
[flagged]
To your average user who interfaces with these figurative black boxes with a black box in their hand, how is this particular black box any different than the other black boxes that this user hands their data to every second of every day?
there are plenty of disallowed 'black boxes' within the federal sphere; chatgpt is just yet another.
to take a stab at your question, though : my cell phone doesn't learn to get better by absorbing my telecommunications; it's just used as a means to spy on my personal life by The Powers That Be. The primary purpose of my cell phone is for the conveyance of telecommunications.
chatGPT hordes data for training and self-improvement in its' current state. It's whole modus operandi involves the capture of data, rather than it being used for that tangentially. It could not meaningfully exist without training on something, and at this stage of the game it's the trend to self-train with user data.
Until that trend changes people should probably be a bit more suspect about what kind of stuff gets thrown into the training bin.
Those typically have MSAs with legalese where parties stipulate what they will and will not do and often whether or not it’s zero knowledge and often option to have your own instance encryption keys.
If people are using the free version of chatGPT then it’s unlikely there is a contract between the companies and more likely just a terms of use applied by chatGPT and ignored by the users.
No idea
I simply don't give a crap if my employer loses data. I don't care if my carelessness costs my employer a billion bucks down the line as I won't be working for them next year.
Writing that is a really good way to end up on the wrong side of a civil suit.
I have a addon, were every other sentence is generated by Chat GPT. Good luck holding me liable for a robots actions.
"I do not take any kind of responsibility about what I'm doing, or not doing, or thinking about doing or not doing, or thinking about whenever I should be doing or not doing, or thinking about whenever I should be thinking about doing or not doing".
Unless you can prove a given sentence was generated by ChatGPT, it will be assumed it wasn't.
As a moral questionable answering robot however, i must aks, why all things else should be tainted by the machinery, but evidence like text should not?
Why don’t you feel any responsibility?
I am treating my employment like a corporation would. Risks I do not pay for and do not benefit from mitigating are waste that could allow me to transfer time back to my own priorities, increasing my personal "profit."
Not who you replied to, but if you agree, even a little, with the phrase, "the social contract between employees & employers is broken in the US"... well it goes both ways.
Do you really think the people asking ChatGPT to write their code can make that abstraction?
The fact that the can't do this is the whole reason they have to use ChatGPT.
I use it because it's 10-100x more interesting, fun, and fast as a way to program, instead of me having to personally hand-craft hundreds of lines of boilerplate API interaction code every time I want to get something done.
Besides, it's not like it puts out great code (or even always working code), so I still have to read everything and debug it. And sometimes it writes code that is just fine and fit for purpose and horrendously ugly, so I still have to scrap everything and do it myself.
(And then sometimes I spend 10x as long doing that, because it turns out it's also just plain good fun to grow an aesthetic corner of the code just for the hell of it, too — as long as I don't have to.)
And even after all that extra time is factored back in: it's still way faster and more fun than the before-times. I'm actually enjoying building things again.
Pair-programming with ChatGPT is like having an idiot-savant friend who always surprises you. Doesn’t matter if the code is horrible, amazing, or something inbetween. It’s always interesting.
And I agree it’s fun. Maybe it’s the simulated social interaction without consequences. I can be completely honest with my robot friend about the shitty or awesome code and no one’s feelings are going to get hurt. ChatGPT will just keep trying to be helpful.
People aren’t using ChatGPT because they can’t do it themselves, they’re using it to save time.
You can be an experienced developers with years building complex applications behind you and still find ChatGPT useful. I've found it useful for documenting individual methods or simply explaining my own/other's code or writing unit test methods or just using it to add boilerplate stuff that saves me an hour that I use elsewhere.
I think many people find ChatGPT useful specifically because they have years of experience building complex applications.
If you know exactly what you want to ask of it, and have the ability to evaluate and verify what it produces, it's incredible what you can get out of it. Sure it's nothing I couldn't have done otherwise... eventually. The productivity it enables is worth every cent.
Easily the best $20 I've spent in ages, they should have run with the initial idea of charging $42.
But holy moly anyone putting confidential information into it needs to stop
I’ve been doing this kind of thing pretty regularly for the past few weeks, even though I know how to do any of the tasks in question. It’s usually still faster, even when taking the time to anonymize the details; and I don’t paste anything I wouldn’t put on a public gist (lots of “foo, bar”, etc)
Precisely because I can abstract it is why I use ChatGPT. It can do the boring, tedious, repetitive stuff instead of me and has shown me the joy of using programming to solve ACTUAL problems yet again, instead of having to spend hours on unimportant problems like "how do I do X with library Y".
But that's the API, not the Chat input or Playground.
Companies can use Azure OpenAI Services to get around this -- there's data privacy, encryption, SLAs even. The problem is it's very hard to get access to (right now).
the #1 problem with corporations saying things is that many things they say are not regulated or are taken on good faith. What happens with OpenAI are acquired and the rules change? These comments are often entirely worthless.
These are contractual terms.
> If using ChatGPT makes you X% more productive (shipping faster / lowers labor costs / etc), but comes with Y% risk of data leakage
X and Y are not alike, and should not be compared. X is a benefit to you(r employer), whereas Y is a risk to the customer who has entrusted you with their data.
You've certainly not worked with _real_ sensitive data. The kind that can bankrupt your business.
I do and if it could be leaked through ChatGPT I would have it blocked.
Risk isn't a single dimension, it's a combination of exposure (chance of happening) and impact (how much will you lose)
Mate. You aren’t special. It’s the nature of the profession that most of us are in, that we end up dealing with the “sensitive” data that you’re describing, barring most people working in Big Companies with proper internal controls.
Nothing you’ve said negates anything OP said. It’s simply an elaboration wrapped in elitism.
Risk of leakage? It is not a risk, it is a matter of time.
Let's also not discount that for every "dope" there is at least one "bad actor" who is willing to take the risk to get an edge in their workplace or appease their managers demands. The warnings will only deter the first group.
> Wouldn’t even put it past military personnel putting S/TS information into it at this point.
Hey, they need someone to proofread their War Thunder forum posts to make sure they're using correct spelling and grammar when leaking classified info. ;-)
(Ref if you don't get the joke: https://taskandpurpose.com/news/war-thunder-forum-military-t...)
Not only that, but the European theater nuclear forces leaking security arragements and even door PIN-codes for nuclear weapons bunkers via online flash card sites might be a better example.
As the leaks ware more inadvertent.
I am curious, do you block MS Edge? It has a grammar check for all input boxes that sends data to MS servers to check. Similar to what Grammarly does.
MS also "helpfully" asks you if you want to use that enhanced grammar check in MS Word(as far as I have seen, might be there in other office products too). I cannot imagine sending all my documents to MS. But I am not sure most users will realize what is happening.
All these companies offer helpful services but are hovering up data and no one knows the consequences yet. It feels like ChatGPT is just one symptom of a bigger problem.
You can disable the feature entirely in group policies, I imagine organizations with a decent IT org will do so before deploying the update.
We don't but the grammar check is disabled. In general anything cloud-based services are vetted before being allowed.
I think MS Edge is getting even worse about this, with the big fucking Bing icon in the corner and making it impossibly hard to get rid of it.
Some military folks put nuclear weapons storage training materials onto Quizlet, so I don’t doubt for a second people would try to put ChatGPT onto a classified computer system.
Possibly I don't know how this all works, but I think if the host of a ChatGPT interface were willing to provide their own API key (and pay), they could then provide a "service" to others (and collect all input).
In that case, you wouldn't know to block them until it was too late.
Ultimately either you must watch/block all outgoing traffic, or you must train your people so thoroughly that they become suspicious of everything. Sadly, being paranoid is probably the most economical attitude these days if IP and company secrets have any value.
> Possibly I don’t know how this all works, but I think if the host of a ChatGPT interface were willing to provide their own API key (and pay), they could then provide a “service” to others (and collect all input).
Well, GP was referring to blocking ChatGPT as a federal contractor. I suspect that as a federal contractor, they are also vetting other people that they share data with, not just blocking ChatGPT as a one-off thing. I mean, generic federal data isn’t as tightly regulated as, say, HIPAA PHI (having spent quite a lot of time working for a place that handles both), but there are externally-imposed rules and consequences, unlike simple internal-proprietary data.
But it really seems like a cat and mouse game. For example, a very determined bad actor could infiltrate some lesser approved government contractor and provide an additional interface/API which would invite such information leaking, and possibly nobody would notice for a long time.
And then they could face death penalty for espionage if they leaked sensitive enough data. You would have to be really stupid to build such a service for government contractors unless you actually are a foreign spy.
At least then we would finally find out if it is constituional to execute someone for espionage.
If someone is determined to break the rules then yes they break the rules. Network blocking is really just a thing to stop casual mistakes.
> We block ChatGPT, as do most federal contractors. I think it’s a horrible exploit waiting to happen:
Do you also block pastebin? Anything else that has a web form? How is ChatGPT special compared to any other service on the Internet where people can paste data in a form?
I mean... I see the problem, but I think one needs to realize that it's a far more generic problem that has basically nothing to do with ChatGPT and AI. If people paste confidential data into random webpages that's of course bad. But if you block ChatGPT because you fear that, it means you expect that people might do that. And then your problem is not ChatGPT, but lack of awareness what is confidential data and what to do with it.
> Do you also block pastebin? Anything else that has a web form?
pastebin and indeed most things that has some sort of public webform is blocked in all the companies I have worked with.
It is probably a losing battle though, as it is very hard to block everything without default deny.
Paradoxically, maybe GPT could be used to veto websites on first access :)
> pastebin and indeed most things that has some sort of public webform is blocked in all the companies I have worked with.
Search engines too? And these days, that means web browsers, because the (IMHO stupid) idea of combining address and search bars into one means everything you type while trying to open a website gets leaked to some party (most likely Google).
Search engines (and url bars) are indeed not blocked, but I do worry every time I use them. Internal url leaks to google must be extremely common.
I imagine they must be. I'm habitually careful to either click on a link, paste the entirety of the internal URL at once, or enter only the most generic word or words that will surface the URL I want as a history suggestion - all to minimize the chances of leaking anything this way.
You can turn the auto search off
There's a lot of things that you can turn off, but nobody actually does - which is the very reason they ship turned on in the first place.
Because "awareness only" has such a great track record when it comes to security-adjacent issues, and totally satisfies auditors/customers/regulators/...?
I don't think awareness only has any reasonable track record and I would always prefer a technical control if there is one. But I have a hard time seeing any alternative here.
I don't think the idea that you can give people access to the www and at the same time preventing them from putting things in forms can be done. That's simply not how it works. And if you're blocking access to a few services where they might do that, well, they have a million others, and you're deceiving yourself that you've done something.
By using Azure, you can access ChatGPT and GPT, which come with enterprise-grade security and established data agreements, setting them apart from OpenAI. I'm not entirely sure of the technical details, but you can explore this option.
Problem is you will have to heavily advertise it in your org because people will not understand why, where etc.
They will go to OpenAI directly and do stuff because "they want it now" and they don't understand why not.
Microsoft is already building it into Office 365 with the same enterprise-grade agreements so it might get easier that way.
Does US intelligence have access to OpenAI data? Private organizations is one thing. But with all the dopes in government positions around the world, OpenAI logs would probably be a treasure trove for intelligence gathering.
They are just one national security letter away from all US-held data.
Microsoft is well known for piping data to US intelligence as a service. It's almost certainly why they bought Skype, then removed all the end to end encryption.
USA has the patriot act and the cloud act to request the data from any USA company. Like AWS, Microsoft 365, Google…
> they’ve had several leaks now where people can see others’ conversations and data
Do you have a source for this? I know some people have claimed to see others' data, but I haven't seen any evidence that that's what's actually being seen, vs LLM hallucinations. OpenAI claims, and I can't imagine they're lying, that the training data is fixed and ends in 2021, so I don't see how it would be possible for user prompts to be leaking into output, absent a massive and very unlikely bug (compared to the much more likely AI hallucination explanation).
Yep: https://openai.com/blog/march-20-chatgpt-outage
Some kind of concurrency bug in a library they were using to retrieve cached data from Redis led to this leak.
> We took ChatGPT offline earlier this week due to a bug in an open-source library which allowed some users to see titles from another active user’s chat history. It’s also possible that the first message of a newly-created conversation was visible in someone else’s chat history if both users were active around the same time.
I think they’re referencing this: https://openai.com/blog/march-20-chatgpt-outage
Comment was deleted :(
> there’s no way they’re manually scrubbing out sensitive data
I was under the impression OpenAI weren't using questions as training data for future models. I recall Sam Altman saying they delete questions after 1 month, but I can't locate the source for that.
Even if they do today, they could change their mind tomorrow, or even tonight.
Even without training the model, it will still end up in logs. For example you can now see your previous questions in the UI and it'll probably be stored in other places of their backend too.
This is still a serious data loss risk.
Right. They also haven't been caught lying about everything else from cuttoff dates to live functionality.
Does blocking ever work? People are smart and usually just work around them.
It works in the sense that it does add an extra "reminder" and requires specific intent. I mean, in this scenario all the people already have been informed that they're absolutely not allowed to do things like that, but if someone has forgotten that, or simply is careless and just wants to "try something out" then if it's unblocked they might actually do it, but if they need to work around a restriction, that forces them to acknowledge that there is a restriction and they shouldn't try to work around it even if they can.
The smart ones don’t paste in all their private data.
And yes, if the bypassing the block is combined with disciplinary action, it does work. It’s not worth getting fired over. This is likely what heavily regulated industries like financial services and defense are doing.
Blocks are effective reminders of policies.
I remember someone trying to look up winning lottery numbers at work. The site came up "Blocked: Gambling". It was a little reminder that they're watching our web browsing at work..
Those are pre-configured firewall rules. These firewalls can go deep packet inspection and block traffic.
It's a fairly standard practice. I wouldn't associate it with overreaching surveillance.
Well, a firewall rule based on a cloud-populated access control list interrupted traffic. More likely vendor-related than employer-related.
[dead]
If your competitor use ChatGPT to compete with you and they're 10x productive than yours, are you still willing to insist? If the productive is 100x, will you?
It might be just as likely that ChapGPT will cause a mistake like Knight Capital because no one bothered to thoroughly verify the AI's looks-good-but-deeply-flawed answer, and the two aren't mutually exclusive possibilities.
Right. I've had ChatGPT completely fail at something as simple as writing a batch file to find and replace text in a text file.
Sure, but humans do that all the time as well
Humans are a lot better at "I don't know how to do this; hey Alice, can you look this over if you've got a sec and tell me if I'm making a noob mistake"
Perhaps the actual phenomenon is that humans are much better at saying "Alice wrote this code, she's pretty good at scripting but she might have made a noob mistake, better check it", or even "I wrote this code.." than they are at saying "ChatGPT wrote this code, but that application is not guaranteed to have correctly identified my problem, but may have just returned something that seems right both to the statistical model and to me, but which is actually deeply flawed, better check it".
The Knight meltdown was more of a disfunction of change management and trading system operations than it was of using a decommissioned feature flag.
Source: worked there after the meltdown.
This isn't an argument of ChatGPT vs nothing. This is an argument of "external" ChatGPT vs some other AI sitting on your own secured hardware, maybe even a branch of ChatGPT.
> some other AI sitting on your own secured hardware, maybe even a branch of ChatGPT.
Where can I, a random employee, get that? I know how to get ChatGPT.
You can't. So maybe you as a random employee should just do without whatever IT hasn't approved whether you agree or not.
Right, thus meaning your employer gets outcompeted by a company willing to take the risk of handing their data to OpenAI.
Uh, there's no sign of that yet.
[flagged]
> Think about how long it's taken tools like pandas to reach the point that it is now. That entire package can be built to the level it is now in a couple of days.
I am having trouble parsing this statement. You're saying a person equipped with chatGPT trained on data prior to December 2007 (the month before the initial pandas release) could have put together the entire pandas library in a couple of days?
That seems obviously wrong, starting with the fact that one would need to know "what" to build in the first place. If you're saying that chatGPT in 2023 can spit out pandas library source code when asked for it directly, that's obvious.
Somewhere between the impossible statement and the obvious statement I made above, there must be something interesting that you were trying to claim. What was it?
They couldn't even do it today with pandas being in the training set. People are being crazy about this tech.
> Think about how long it's taken tools like pandas to reach the point that it is now. That entire package can be built to the level it is now in a couple of days.
I don't think that is true at all. Do you have an example of a significant project being duplicated in days, or even months, with ANY of these tools?
By significant, I mean something on the order of pandas which you claimed.
And this is completely ignoring the fact, that the real hard problem is the design. Spitting boilerplate code is not. How pandas could be designed perfectly in one afternoon (and generated with GPT) is beyond my comprehension.
I guess they are thinking that ChatGPT would also handle that part ...
Prompt 1: What would be an amazing tech project that would make me rich?
P2: Produce an excellent design for that project. Should be elegant and use microservices and scale to billions of users.
P3: Write all the code for this design.
P4: Tell me how to test and deploy all that code.
P5: How to sell all this for billions?
Cool, please provide a link to a library of similar size and complexity to pandas which was written using ChatGPT in the span of a few days. We'll be waiting.
> Think about how long it's taken tools like pandas to reach the point that it is now. That entire package can be built to the level it is now in a couple of days.
Let me hand you a mirror: You're absolutely and completely wrong
> There is an immense amount of evidence of that
Then it should be easy to provide some?
Oh course not!
Security and privacy should be table stakes. Speaking for my country, we needs privacy laws with teeth to punish bad actors from shitting people's private information where ever they want in the name of a dollar.
Man the fanboyism is out of control here.
Welcome to Sam Altman News. You must be new here.
So you block internet access for all employees? Cos anything you think is being pasted into ChatGPT is being pasted everywhere, whether its Google, Slack, Chrome Plugins, public Wifi.
Yes, these things are sometimes blocked in higher security workplaces… up to and including the public internet. Honestly airgapped systems are not all that uncommon anywhere that human life is at risk.
Or all the ChatGPT clones that have sprung up and will continue to spring up every other day.
It's a stupid and patronizing position, but corporate IT are sadly incentivised to be stupid and patronizing.
We saw these same fears with the release of Gmail. Why would you trust your email to Google?!! Aren't they going to train their spam filters on all your data? Aren't they going to sell it, or use it to sell you ads?
Corporations constantly put their most sensitive data in 3rd party tools. The executive in the article was probably copying his company strategy from Google docs.
Yes, there are good reasons for concern, but the power of the tool is simply too great to ignore.
Banning these tools will go the same way as prohibition did in the US, people will simply ignore it until it becomes too absurd to maintain and too profitable to not participate in.
Companies which are able to operate without these fears will move faster, grow more quickly, and ultimately challenge companies restricted to operate without.
Now I think the article should be a wake-up call for OpenAI. Messaging around what is and what is not used for training could be improved. Corporate accounts for Chat with clearer privacy policies would be great and warnings that, yes, LLMs do memorize data and you should treat anything you put into a free product on the web as fair game for someone's training algorithm.
I think this is different in that ChatGPT is expressly using your data as training in a probabilistic model. This means:
* Their contractors can (and do!) see your chat data to tune the model
* If the model is trained on your confidential data, it may start returning this data to other users (as we've seen with Github Copilot regurgitating licensed software)
* The site even _tells you_ not to put confidential data in for these reasons.
Until OpenAI makes a version that you can stick on a server in your own datacenter, I wouldn't trust it with anything confidential.
> I think this is different in that ChatGPT is expressly using your data as training in a probabilistic model.
Google tries hard to sell you on their auto-answers for emails ('smart reply'), wonder how those got trained...
Google had all the same problems, until it found a balance of functionality, security, and privacy.
OpenAI just hasn't started to try adding privacy and security yet.
A language model inherently has a privacy problem. How would you guarantee no leaks?
You simply don’t train on the user imputs. There are enough unread books, public repos, and new articles.
Not that I don't expect them to do this, but how is it expressly said to be so?
https://help.openai.com/en/articles/5722486-how-your-data-is...
> OpenAI does not use data submitted by customers via our API to train OpenAI models or improve OpenAI’s service offering. In order to support the continuous improvement of our models, you can fill out this form to opt-in to share your data with us. Sharing your data with us not only helps our models become more accurate and better at solving your specific problem, it also helps improve their general capabilities and safety.
Did you read the next paragraph?
> When you use our non-API consumer services ChatGPT or DALL-E, we may use the data you provide us to improve our models.
I definitely did not correctly read that. Thanks for the clarification. Totally misread the 'our API' bit!
It's also in the FAQ: https://help.openai.com/en/articles/6783457-chatgpt-general-...
> Will you use my conversations for training?
> Yes. Your conversations may be reviewed by our AI trainers to improve our systems.
As the saying goes, if the product is free, then you are the product.
The product is $20!
The product is pay-as-you-go, just sign up for the API keys and use the playground, or an alternative client, instead of their ChatGPT webapp.
chatgpt is (still) free.
Hehe old tos trick. Here it doesn't say "will never use" but say "does not use" and I wager below or somewhere will say that they can change the tos at any time in the future unilaterally
Sticking it in your own datacenter doesn't really prevent any of these problems (except maybe #2), only now your leaks are internal and because of all the false sense of security, you might wind up leaking far more confidential and specific information (ie. an executive leaking to the rest of the team in advance that they are planning layoffs for noted reasons, whereas that executive might have used more vague terms when speaking to public chatGPT).
Sticking it in your own private datacenter would imply that you can opt in or out of using your data to train the next generation. ChatGPT does not dynamically train itself in realtime.
The implication is that you would bother with ChatGPT at all to train it on the relevant local data, the key value aspect to ChatGPT beyond general public use.
It prevents all of those problems as it puts all the data / data movement under your control.
How so?
Because it's your own data center, which means you own the data and can set up firewall rules to prevent any software running there from leaking data to outside the data center.
Well, you can stick it on Azure.
Trusting Gmail with corporate communication was was a terrible idea (and explicitly illegal in a lot of industries), and companies didn't start to adopt it until Google released an enterprise version with table-stakes security features like no training on the data, no ad targeting, auditing, compliance holds and more.
There's a huge difference between trusting a third party service with strict security and data privacy agreements in place vs one that can (legally) do whatever they want with your corporate data.
This is vital for professional adoption. We cannot live in a world where basically all commercial information, all secrets are being submitted to one company.
Was?
Well it was, until Google Workspace (G Suite) came along and provided essentially an enterprise version of Gmail.
I still question the wisdom of giving data to the worlds largest spyware company that makes its money by converting mass surveillance into dollars.
Hosting your own servers for email and business files is infinitely more costly from a performance, uptime, and personnel standpoint, and self-hosted office with network shares is not suitable for most businesses' needs of multi-user collaboration (sure, you can use Office / M365 desktop apps which do collaboration, but then you're forced to use the desktop apps).
Google Workspace solves the issues of data privacy both by having extreme user data & datacenter access controls[0,1], a robust terms document that details how data is collected and used[2], and enterprise customers can access an audit report that details what and when things are accessed by Google employees[3].
0: https://storage.googleapis.com/gfw-touched-accounts-pdfs/goo...
1: https://workspace.google.com/security/
> Companies which are able to operate without these fears will move faster
Or the fears are real and companies that operate without them will be exploited, or extinguished for annoying their customers.
Company I work for uses GMail - but we have business relation with them as we pay for Business licenses that have business data handling in the agreement.
If employee sets random GMail account that is not covered by agreement that is personal account. Sending company data to personal email account might be grounds for firing person.
Setting up some account at random with OpenAI and putting company details like customer names or else there is data breach.
Companies will let people use the tools - but it is not like one can start setting up random accounts without approval from management. Of course there are different types of companies with less or more red-tape.
If your company's code is all repositories on Github (or bitbucket, or any similar service), worrying about ChatGPT is quite silly.
And on the other hand, if your company doens't use Github etc due to security concern, it's a very good sign telling you need to ban ChatGPT too.
No, it's not silly to worry about it. Many companies store data in third party systems to which they retain control over access. Once you put data into chatGPT what control do you have over it?
If a company/government risk model allows for giving Google all the most sensitive information, then that says something about trust. It has not gone unnoticed how those risk model differs when something like tiktok arrived, or with earlier Huawei 5G modems.
The enterprise version of gmail was just an additional step to instill trust. In practice it is still a decision based on trust rather than physics. An "enterprise version of privacy guaranties" for Huawei 5G modems or tiktok apps would not make governments suddenly happy with the risk model where sensitive data would have a minor risk of ending up in China.
> We saw these same fears with the release of Gmail. Why would you trust your email to Google?!!
The original Gmail TOS explicitly stated that they scanned the content. They only stopped for the rollout of Gsuite.
IIRC (been a while, so maybe I'm wrong) Google was also the reason Amazon swapped email formats for purchases. They realized they were giving a ton of data to Google through the receipts about products purchased, so now they just give you the vague order emails.
I recommend you take all your proprietary code and copypaste it to ChatGPT. You can help improve our collective generator.
If you’re an artist, just send all your work to DALL-E. Why have money or fame?
Google wasn't yet evil when most people adopted Gmail.
And corporations have strict agreements with their providers. They are even required to in many cases due to GDPR and the likes. Users connecting to ChatGPT on their own accounts bypass this.
> too profitable to not participate in
Sorry, but I really struggle to see how a non AI company will actually become more profitable simply by getting their employees to use ChatGPT. In fact, the more companies that use it, the more demand there will be for "human only" services.
This is the issue with a tool so powerful, you can't just tell people not to use it, or to use it responsibly. Because there's too much incentive for them to use it. If it saves hours of a persons' workday, and they're not seeing any of the harm caused from data leakage, there's no incentive for them to not use it.
Which is why a private option is so critical. To not fight against human nature, means providing an ability to use the tool in a safe way.
There's a dev here who is using ChatGPT extensively in his work. The rest of the team is just waiting for him to get caught and fired. Sharing company data with unapproved external entities is very definitely a firing offense.
Glad I work for a company where the CEO pays for everyones ChatGPT Plus for the devs. If you think your code is special then you're wrong.
But you created a throwaway account specifically to reply in this thread?
Unless your company really has nothing to hide, it's easy to accidentally dump a company secret or an API key in a chat session. Of course if everyone is aware of this and constantly careful then you may be OK.
That's because accounts get shadow banned all the time when people get upset when you point out hard truths.
If you're copy pasting API keys or such into ANYTHING, you probably shouldn't be a programmer to begin with.
It's like people who use root account key/secret credentials in their codebase. It's not AWSs fault you got a large bill or got hacked, its because you're dumb.
I regularly say shit that pisses people off here and I have never been shadow banned. It sounds like your "hard truths" are something other than just "hard truths", and/or you have a persecution complex.
Your Karma is over 7000, if you get downvoted your stuff is still visible.
Getting downvoted to gray isn't a "shadow ban" at all. It is however a signal that others didn't find your comment worthwhile.
if you didn't use throwaway accounts your karma would presumably be much higher?
Nice so avoiding getting shadowbanned on hackernews is fine but avoiding getting sued is petty ?
I posted my openAI token into a GitHub issue today thinking I'd just kill it right away, which I did but there was already an email from openAI letting me know that it was noticed that my token had become public and was thus revoked.
If your code has API keys in it, you have bigger problems than ChatGPT.
My code is "special" in the fact that the act of sharing it can carry civil and criminal liabilities for myself, essentially threatening my well-being and freedom.
Not my place legally or ethically to share code with 3rd parties that I've been paid to read and write.
If the contract says the code is special, then the code is special.
> If you think your code is special then you're wrong
Your code is not special, but customers data may be. Also, some companies needs to comply to various certifications, and proven leak of source code that was put into some third party tool may be a reason to revoke such certification. Which can cause a serious financial harm to a given company, as it can lead to ex. losing government clients.
This is just the tip of the iceberg.
You are still transferring your business data to an external entity, but on top of it you pay for it.
And if you think that there is no special code then you're wrong.
If you think random snippets of code are special you really don't understand the business you're writing code for. So no, your code is not special, and pasting code snippets is not transferring business data.
> pasting code snippets is not transferring business data.
It literally is exactly that. You don't think the code a business creates is "business data"?
[dead]
What do you know about the code written by the people you are replying to?
Lots of code expresses buisness strategy that is a competitive advantage/ sensitive.
My org has done a risk assessment and accepted the risk of using such tools; arguing there is no risk is short sighted.
You forgot IANAL
That entirely depends on the code. It's not that the code is special, it's that the code can reveal things that are competitive advantages (future plans, etc.)
Code isn't special but what you're working on can be and also it's not your decision if you're not the shareholder.
Your code might not be special, but I think plenty of hedge funds and prop trading firms beg to differ.
Does chatgpt plus collect data for training, or does it have more privacy than the free offering?
Replying to myself, it seems your data is still used, unless you fill in a google form to opt out: https://help.openai.com/en/articles/6950777-chatgpt-plus
ChatGPT & DALL-E (non-API products) are opt-out while their API is opt-in https://help.openai.com/en/articles/7039943-data-usage-for-c...
This Google form requires an organization ID so may not apply to personal GPT+ accounts.
Mine is personal and I just filled out this form with the ID from the docs. Easy, worked fine. Also thanks to the grandparent comment for surfacing this!
If you really care about your company's security, you should report it, otherwise you are just complicit.
Why not talk to the employee first?
This could be valid...but with something as powerful as ChatGPT, if it is providing huge benefits for employee productivity, they are unlikely to dump it based off a co-worker's suggestion. Also, unless managing security is within your roles and responsibilities, this approach would likely turn messy from an interpersonal aspect. Lastly, the security issue has already happened, so if this is truly a security concern, the security team should know that (a) something is already out there (b) this could be a widespread problem in the future.
FWIW I don't think the employee should be fired for this or anything, if anything a company could embrace these new technological advances and provide training on using ChatGPT in a more secure manner(ie don't paste your customer's PII into a prompt, etc...).
> if it is providing huge benefits for employee productivity
With this particular employee, using chatGPT has not increased his productivity or the quality of his work by any noticeable degree.
> I don't think the employee should be fired for this or anything
The problem isn't using the technology. The problem is sharing confidential information with an unapproved entity. That is specifically and clearly spelled out as a firing offense, for pretty obvious reasons.
Even if some people feel that it's an overly tight policy, it's a the stated policy and the company has every right to put and enforce whatever rules it wishes about the use of its own data.
I agree.
If they're more productive by doing it, I think it's an equal chance said dev gets promoted.
He's not more productive, but even if he were, it wouldn't affect his getting fired.
Productivity isn't everything.
Why?
Uploading code to ChatGPT can be done by trainees.
Yeah, but the code coming out of ChatGPT is generally not in a state that you want to commit straight into you repo. Making adjustments (and writing the original prompt) is where your expertise comes in.
You can just submit the trash code and let a senior fix it for you in code review.
> you can't just tell people not to use it
Uh, why can't you tell people not to use it...? If security is that important for your company, of course you can tell your employees which tools to use.
A fun fact: in many areas of TSMC, smart phones are banned. No one says "you can't just tell people not to use smart phones."
That’s very common in a lot of secure places. There are tons of government contractors who have to put smartphones away due to the cameras on them. Not to mention the more secure government places.
> in many areas of TSMC, smart phones are banned
This does not surprise me at all. What I want to know is how they enforce it.
Unless they have something better than "fear of somebody seeing you using the smartphone", it isn't getting enforced. If they do have something better I want to know what.
Metal detector.
No, I'm not joking. One of my high-school classmates works as R&D there. They ask you to pass through a metal detector gate, take away your phone if found, then give your a company phone for emergency call only. It's that strict (at least for R&D. Probably not for management and others).
At Samsung, you typically walk through multiple sets of metal detectors and security before you can actually get into the fab. Anyone working in an office area can have a phone but they do some... stuff to it.
> Anyone working in an office area can have a phone but they do some... stuff to it.
This is what I'm interested in.
I get it, the guys in bunny suits will probably tolerate being groped and wanded every workday for the rest of their career. I have a hard time beliving the scientific staff and executives tolerate that.
> they do some... stuff to it.
They put a special sticker on all of your cameras and inspect if it is still there on the exit.
1.) It's not at all clear it's nearly as powerful as you think. Certainly in my domain--writing about various topics--it's not.
2.) Of course, you can tell people not to use it. Unlike people at SV companies apparently, people in government and government contractors accept restrictions like not having phones in secure labs all the time. Start firing or even prosecuting people and people will discover very quickly they don't really need some tool.
And, yes, private versions of this sort of thing helps a lot.
We published an internal policy for AI tools last week. The basic theme is: "We see the value too, but please don't copypasta our intellectual property until we get a chance to stand up something internal."
We've granted some exceptions to the team responsible for determining how to stand up something internal. Lots of shooting in the dark going on here, so I figured we would need some divulgence of our IP against public tools to gain traction.
Not using Chatgpt is easy, but things like GitHub or VSCode with Copilot (which is a special version of GPT3) and in the future Copilot X (gpt4) this will get hard.
One developer opening a folder in VSCode with Copilot enabled aaaaand it’s gone. You never know what part of the folder left your building.
What if you host your code in GitHub? That concern is weird to me, because you already give Microsoft pretty much everything.
You use Windows, VSCode, etc, all of this has access to your code.
Tech is a big place and not everyone uses GitHub.com - they have an entire self-hosted version for exactly that reason, since many customers have policy or legal requirements – but also consider the distinction between something following its stated policy or doing something else. When you use Windows or VSCode the terms of service do not include sending your personal data to someone else and Microsoft would be in serious legal trouble if they changed that. In contrast, Copilot explicitly does have the right to send some of your code elsewhere so the legal question would come down to whether it reached the point that a judge would no longer consider “snippets of your code” to cover what was sent.
If you use GitHub for sensitive internal stuff you will have a contract with them. This is different from users dumping your data into a service of a vendor you have no business relationship with.
And Windows and VS Code don't upload your data to Microsoft unless you choose to do so.
Inform us when you figured out a way to host something with the quality of ChatGPT internally :-)
You can use chatGPT inside Azure, like any other service. It's not the same one used by OpenAI, and there are different guarantees.
> ChatGPT is now available in Azure OpenAI Service
https://azure.microsoft.com/en-us/blog/chatgpt-is-now-availa...
Sorry but the whole point is to not use a closed source third party API with a dubious privacy police ran buy a multinational surveillance capitalism megacorporation.
Their probably already using AWS/GCP/Azure, so trusting Azure is not going to be considered nearly as big a deal as trusting OpenAI.
When you use it via the first-party Azure service, the data you input is de-facto guaranteed to not be fed back into the model, so you don't expand your circle of trust by feeding MSFT with your code.
I can host llama's 7b model internally. It hallucinates more often than not and has a tendency to ramble, but dammit it's local and secure!
What's it like with code, documentation, regex, etc?
That's all I use ChatGPT for. I don't need it to be able to write poetry.
I did not verify this regex, on the very surface scan seems OK:
./main -m ./models/7B/ggml-model-q4_0.bin -t 4 --temp 0.7 --top_k 40 --top_p 0.5 --repeat_last_n 256 --repeat_penalty 1.17647 -n 1024 -p $'Here is a handy short form regex to validate an email address: '
Here is a handy short form regex to validate an email address: ^([a-zA-Z0-9_.+-]+@[a-zA-Z0-9.-]+\.[a-z\.]{2,6})+$ The first character can be either uppercase or lower case. The second group must contain at least one letter and the third group may contain any number of characters (up to 5). The last part ensures that it ends with @ followed by two more letters separated by dots. If you want to make sure that your input string contains only valid characters for emails then use this regex instead: \A[\w.]@[\w.]\.\w{1,4}\z
Badly wrong, in part because your prompt was badly wrong, because there is no short regex to validate an email address. (It needs a way of saying “I can’t complete that because the prompt makes no sense for this reason”.)
What it emitted accepts a large number of invalid addresses (due to things like not checking dot placement, and the inexplicable (…)+ wrapping around the entire thing), and doesn’t accept a large number of valid addresses (some comparatively esoteric, like local parts containing any of !#$%&'*/=?^`{|}~ or IP addresses for the domain name, and some very reasonable, like TLDs of more than six characters, or internationalised TLDs even in Punycode form).
The description it emits does not match the regular expression at all well, either.
The second regex it emits is even worse than the first, unnecessarily uses PCRE-specific syntax, and is given with a nonsensical description. (Note: the asterisks got turned into italics, backslash-escape them here on HN. With this fixed, the regex was \A[\w.]*@[\w.]*\.\w{1,4}\z.)
> on the very surface scan seems OK
And there’s the danger of this stuff. As a subject-matter expert on regex and email, I glanced at the regular expression and was immediately appalled (… quite apart from the whole “here we go again, this is certain to be terrible” cringe on the prompt). But it looks plausible enough if you aren’t.
It is a bit crazy to me someone posts a regex like that without verifying and saying on surface level it looks good, implying the whole thing was useful and a good result.
I said it looks ok, not good. My comment is mostly about me being surprised a valid regex came out. I also asked it to write a regex to parse html which it happily answered. What does gpt4 say about parsing html ;)
But it is either going to be useful or harmful. Harmful if doing the regex validation itself is worse than not doing any validation at all or a very simple validation just checking that there is @ included somewhere.
For comparison GPT-4 provides the following Python regex and then warns that it does not catch all edge cases and that it’s better to use a dedicated library like email-validator:
email_pattern = r"^(?=.{1,256})(?=.{1,64}@.{1,255}$)(?=\S)(?:(?!@)[\w&'+._%-]+(?:(?<!\\)[,;])?)(?<=\S)@((?=\S)(?!-)[A-Za-z0-9-]{1,63}(?<!-)\.?)+[A-Za-z]{2,19}(?<=\S)$"
I would say either that or just have a basic check that there is an @ somewhere.
There are much longer top level domains than 6 characters though.
And the second one is confusing me. It seems to be matching a single character only for the initial portion?
Neither of them seem good, and especially the last one.
And the way it describes both seems off as well. I would have to say it brings more harm than good based on that.
This looks like good performance. We are keeping an open mind with regard to actually-open alternatives.
llama's 7b model internally is, for me, on a totally different level quality-wise. Even when explicitly instructed to not make up stuff and just say 'I don't know', it will still go ahead and ramble and invent things. When I tell it to only use the prompt data it will still invent, or just ignore the prompt data. It's not useful for production (i.e., to be exposed to 'regular' non AI users). ChatGPT, on the other hand, will listen to those instructions and say when it does not know, and will keep to only the prompt data.
One problem with the current way these models are being trained is that they have no idea of what they're saying. It's just a recursive guess the next word type algorithm. I would not expect the confidence levels of any given fragment, let alone an average, to be a meaningful predictor of truth.
Also completely spitballing, I expect that a big chunk of OpenAI's 'secret sauce' is simple processing layers above and beyond the model. If you input gibberish to llama does it give you an output? If OpenAI is artificially tokenizing inputs (as opposed to just sending inputs straight to the software), it would both dramatically limit the input domain, thus improving output tuning, as well as give "it" the ability to say when it doesn't know something. I put "it" in quotes since that response would not becoming from the LLM, but from the preprocess tokenization system returning an error code in natural language.
I think there's some weak indirect evidence for this in the service itself, since incoherent inputs are instantly rejected, whereas even simple queries take dramatically longer to output even the first word. It's like the input is not even being sent to the LLM software for processing.
This is a very helpful observation.
I've been debating the idea of building tiers or layers of models to accomplish the same.
It very well could be that this go/no-go pre-processor is simply another ML model trained on a binary classification task. Stack a few of these and you can wind up with some interesting programming models.
This would also explain the ease at which ChatGPT gets rid of escapes/bad prompts - they have an additional layer that assesses whether the question could be, for example, racist, and then spits out a 'Sorry as a language-model I am not trained to answer this kind of question'. No need to retrain the main 14B transformer model.
Even if we had a 100% private ChatGPT instance, it wouldn't fully cover our internal use case.
There is way more context to our business than can fit in 4/8/32k tokens. Even if we could fit the 32k token budget, it would be very expensive to run like this 24/7. Fine-tuning a base model is the only practical/affordable path for us.
You can retrieve information on demand based on what the user is asking, like this: https://github.com/openai/chatgpt-retrieval-plugin
Just use the API? It deletes your data after 30 days...
Our policy for now is the exact same.
The problem is the verification of this of course.
I think there's more fear of OpenAI leaking data than say, Airtable or Notion or Github or AWS/S3 or Cloudflare or Vercel or some other company that has gobs of a company's data. Microsoft also has gobs of data: anything on Office and Outlook is your company data — but the fear that they'll leak (intentional or accidental) is somehow more contained.
If we want to be intellectually honest with ourselves, we can either be fearful and have a plan to contain data from ALL of these companies, OR, we address the risk of data leaks through bugs as an equal threat. OpenAI uses Azure behind the scenes, so it'll be as solid (or not solid) as most other cloud-based tools IMO.
As for your data training their data: OpenAI is mostly a Microsoft company now. Most companies use Microsoft for documentation, code, communications, etc. If Microsoft wanted to train on your data, they have all the corporate data in the world. They would (or already could!) train on it.
If there's a fear that OpenAI will train their model on your data submitted through their silly textbox toy, but NOT through training on the troves of private corporate data, then that fear is unwarranted too.
This is where OpenAI should just get a "corporate" tier, charge more for it, and is basically make it HIPAA/SOC2/whatever compliant, and basically do that to assuage the fears of corporate customers.
> I think there's more fear of OpenAI leaking data than say, Airtable or Notion or Github or AWS/S3 or Cloudflare or Vercel or some other company that has gobs of a company's data.
There is zero fear. OpenAI openly writes that they are going to use ChatGPT chats for training. On the popup modal they show you when you load the page. That is not a fear, that is a promise that they will do leak it whatever you tell them.
If i tell you “please give me a dollar, but i warn you you will never see it again” would you describe your feeling over the transaction as “fearful that the loan won’t be repaid”?
1. Azure has the worst security of the major cloud providers; multiple insanely terrible RCE and open readable DB exposures.
2. Azure infrastructure still likely has far better security/privacy by virtue of all their compliance, (HIPAA, FedRAMP, ISO certifications etc.) than whatever startup-move-fast-ignore-compliance crap OpenAI layers on top of it in their application layer.
For most of those tools you can get your own self-hosted version if you’re worried about your data.
Not only do you have to worry about employees directly sharing data, but many companies are also just wrappers around GPT. Or they may use your data in the future to roll out new AI services.
While this is not a new problem -- employees share sensitive data with Google all the time -- the data leakage will be more clear than ever. With ads-based tracking and Google search, the leakage was very indirect. With generative AI, it can literally regurgitate memorized documents.
The security risk goes beyond data exfiltration. Folks are already trying to teach the AI incorrect information by spamming it with something like 2+2 = 5.
Data exfiltration + incorrect data injection are super underrated risks to mass adoption of generative AI tech in the B2B world...
No one cares about security because there is no consequence for getting it wrong. Look at all the major breaches ever. And look specifically at the stock price of those companies. They took small short term hits at best.
Worst case the CISO gets fired and then they all play musical chairs and end up in new roles.
Heck, even Lastpass, ostensibly a security company, doesn't seem particularly affected by their breach.
My point is, especially with ChatGPT, where it can reasonably 10x your productivity, most people will be willing to take the risk.
I don't even agree to begin with the idea that a tool that vastly increases your productivity is a security risk in the grand scheme of things, since you know, you can just allocate the time you were spending before on writing boilerplate towards securing systems.
Endpoint security software is a security risk too.
Yikes! Aside from the fact that nobody will take the "spare" time saved and spend it on internal security, once you have a lot of your sensitive data outside in a third party system you have lost control.
> nobody will take the "spare" time saved and spend it on internal security
Well I have so speak for yourself.
> once you have a lot of your sensitive data outside in a third party system you have lost control.
Every time you search “how do I do this with this software stack” in Google you are leaking data which is nominally sensitive to a third party system. Every time a technical staff member goes to stackoverflow without obscufating their IP address they are leaking sensitive data about what software stacks a company uses. Let’s not even get into people posting their resumes on LinkedIn or cloud services in general.
The goal of security is not to stop all data leakage, it’s to stop the leakage of certain high value data, and LLMs can aid in this end if you use them intelligently and avoid leaking any high value data to them and only feed them with low value data. Attackers are not going to have any qualms about using LLMs both to come up with attacks and as part of attacks. Many many people are in situations where using LLMs to advance in security maturity as quickly as possible is more than worth the risk incurred. Don’t win the battle, win the war.
Ok, fair point. You are the one person who saved time and used it for security. But most people won't do that, as I'm fairly sure you will agree.
There is obviously a very big gap between searching for information and providing your internal code or documents to a third party. One reveals only your search terms, the other gives an attacker your actual proprietary information.
LLMs are ground breaking and have enormous potential. I am not saying that they should not be used. Only that there are huge security issues when employees of most companies post confidential information to third parties.
We went pretty quickly from:
No way I’m giving Google any of my data! I will use 5 different browsers in incognito mode and never log in.
To ->
Sure I will login with my name and email and feed you as much of my most personal thoughts and data as I can dear ChatGPT!
Any examples where those two were the same person?
Because both types of people have always existed. Heck, lack of vigilance among the ancient Greeks is what put the Trojan in Trojan horse.
It was a lack of vigilance among the Trojans, actually. The Greeks did the burning an pillaging. So uh OpenAI is the Greeks, ChatGPT is the horse and Microsoft is king Menelaos or something. Achilles is dead, but he did make TempleOS.
I bet you find some on HN.
Sometimes couriosity beats caution.
yes, without thinking I put a lot of data into bing chat and then I realized what I had done :(
My problem with Google is they'll ban me from gmail for something I do on youtube.
Not quite. These are the same people that use only Chrome while being logged in to their Google account. Convenience wins.
> These are the same people that use only Chrome while being logged in to their Google account
This situation is much dumber than that. ChatGPT is very clear that you shouldn't give it private data and that anything you type into it can/will be used for training.
Google is nowhere near that level of transparency.
Google takes your data and sells it. Literally making your data available to the highest bidder. Is OpenAI doing that? If Google existed in its current form during the early internet it would be classified in the same category as Bonzai Buddy. Spyware. That is what Google is. So I can very reasonably understand why people would trust OpenAI with data they wouldn't trust Google with. OpenAI hasn't spit in the face of its users yet.
> Google takes your data and sells it. Literally making your data available to the highest bidder.
But it doesn't, does it? It sells the fact that it knows everything about everyone and can get any ad to the perfect people for it. It's not going on the open market and telling people I regularly buy 12 lbs of marshmallow fluff and then use it in videos I keep on my google drive.
> Google takes your data and sells it. Literally making your data available to the highest bidder.
No, Google takes money to present ads to people of different demographics, and uses your data to do that. It doesn’t sell your data, which is, in fact their competitive edge in ads – selling your data would be selling the cow when they’d prefer to sell the milk.
Not really. Even the most evil Google one can imagine would realise "your data" is the most valuable thing they possess, selling it would be bad for business. They're selling ads to the highest bidder who's looking for someone with a profile based on your data, but not your data itself.
True, but that's not actually any better. And it still counts as selling your data, just indirectly.
It's completely different in ways that matter to me.
> I can very reasonably understand why people would trust OpenAI with data they wouldn't trust Google with. OpenAI hasn't spit in the face of its users yet.
In other words, having been burnt once by touching a flame, the conclusion these people draw is that the problem was with that particular flame and they're fine with reaching for a different one?
> Google takes your data and sells it. Literally making your data available to the highest bidder.
Even if they are not doing it now(?), what makes you think that they will not do so in the future? It's not like your data has an expiration date.
Because they are completely different business models. If OpenAI decides to become an advertising behemoth then I would show concern. Right now they use your data for training (when they use it).
OpenAI is selling others data in their model responses. Selling others data is their main business model.
If it uses user data to train their models other users could ask "Show me the code for gmail spam filters", and if it was trained on engineers refactoring that spam filter in ChatGPT chances are it would give you the code. If that doesn't count as "selling user data" I don't know what is. They not only sell it, they nicely package and rewrite it to make it easy to avoid copyright claims!
OpenAI has already demonstrated that they're all in for maximizing profit. They may not be advertisers, but advertisers aren't the only sorts of companies that make bank by selling personal data.
I see no reason to think OpenAI would leave that money on the table.
This is nothing new at all. How many people have Grammarly plugins installed? They are advertising aggressively, so I'd think it is the new hotness. Don't tell me Grammarly is not hoovering up all of the Slack, Word, Docs, and Gmail data that everyone sends it, and holding on for some future purpose. We'll see.
Grammarly is an OpSec nightmare that’s somehow managed to slip under most people’s radar. I know folks who won’t use the Okta browser extension because of the extensive permissions it asks for, but will happily use Grammarly on everything.
Last time I looked (maybe things have improved…?) Grammarly would automatically attach itself to any text box you interacted with, and immediately send all of your content to their servers for processing. How this software gets past IT departments is a mystery to me.
2017... "Grammarly Vulnerability Allows Attackers To See Sensitive Data of Their Customers" - https://www.invicti.com/blog/web-security/grammarly-vulnerab...
This is one of the reasons Databricks created Dolly, a slim LLM that unlocks the magic of ChatGPT. A homegrown LLM that can tap into/query the datasets of all the data in an organizations Data Lakehouse will be hugely powerful.
I am working with customers that are looking to train a homegrown LLM that they host and have blocked access to ChatGPT.
https://www.datanami.com/2023/03/24/databricks-bucks-the-her...
This reads like you had an LLM write an ad for you
This is a user led data leak that ranks up there with Facebook and LinkedIn asking for email passwords to “look for your contacts to add”.
In my experience most corporate employees just take the path of least resistance. It is not uncommon for people to paste non public data into websites just to do json formatting, and paste base64 strings to random websites just to decode them. So just telling people not to do something won't accomplish much. Most corporate employees also somehow think they know better than the policy.
Any company that doesn't want to feed data into ChatGPT should need to proactively block both ChatGPT and any website serving as a wrapper over it.
A while back I got to hear about how the IT team running my then-employer's internal time reporting tool was sending all the usage data through Google Analytics and how neat that was for them to look at :\ .
I shudder to think what they are doing now.
I am not sure I see the issue here?
> and any website serving as a wrapper over it
I agree this would be a good move, but it's going to be harder and harder to do that definitively.
Perhaps we need an “LLM Block” browser extension? :)
I believe there were FUD pieces like this when internet search engines were rolled out, and again when social media became popular. I suppose its universal for new technologies.
I had an interview awhile ago at a place where during the phone screen "they can't talk about their tech stack in detail" so I looked on linkedin and figured out their entire tech stack before on the onsite interview. Come on guys, according to linkedin, you have an entire department of people doing AWS with Terraform and Ansible, you don't have to pretend you can't say it in public.
ChatGPT Business Edition seems pretty obvious and I'd surprised if OpenAI isn't already working on it. Separate models for each customer, data silos and protection. The infra is already there on Azure.
It actually is on Azure, exactly as you described.
https://learn.microsoft.com/en-us/azure/cognitive-services/o...
Yep, they just need to provide a business specific frontend chat UI.
MS already announced that. They call it Business Chat. https://blogs.microsoft.com/blog/2023/03/16/introducing-micr...
If it works well it will be a big deal.
For fun I once just made a blank from with a submission button and a giant text field.
It was quite amazing what people would submit unprompted, so I'm not at all surprised that people would feed sensitive data into ChatGPT. The next cycle will be that ChatGPT gets - surprise - trained on that data, and may start using fragments of it - which may well still be sensitive enough to cause trouble - as its output.
Don't paste confidential information into a textbox, in fact don't trust anybody or any company with your confidential information unless there is a strong contractual relationship backed up by penalties if it gets broken. And even then: the ultimate responsibility is yours, you may be able to recover some $ for damages but your reputation may well be toast.
when it first came out and my boss was behind himself about how cool it was, he was feeding it all of his emails with other businesses to have it clean them up. boggled my mind.
do those other businesses use gmail? does your company?
I think those are different models.
Gmail has a vested interested in keeping any knowledge it gains about you secret - it's competitive advantage is knowing more about you than anyone else does.
ChatGPT's strength is its ability to clearly communicate the knowledge it has (including training data it gains from people it interacts with) to give you good responses.
I am still not seeing a huge threat to be honest. This is not how attacks are done. OpenAI also has vested interest to keep your data safe and is strongly linked to Microsoft.
Most corps, companies store a lot of internal data with corps like Google, Microsoft, Amazon and others.
Meanwhile over at Github Copilot...
Hahahahahahaha
If you're using Github already then Copilot isn't seeing anything new.
Correct, but that level of security is expected from GitHub proper, they have all sorts of independent security reviews for their partners. Does all of that exist for Copilot?
Do you think Microsoft would dare to have Copilot with any less standards?
Counter: it already vomits all kinds of licensing issues everywhere, which somehow they didn’t really see coming…so yes?
No it doesn’t.
given widespread stories of how Copilot development was done on a skeleton team (like 6 pple at launch) yes absolutely
Yes
Exactly why our product Codeium (Copilot alternative) supports self-hosting.
"In one case, an executive cut and pasted the firm's 2023 strategy document into ChatGPT and asked it to create a PowerPoint deck."
There's really not much you can do here. This is complete lack of very basic common sense. Having someone like this in your business, particularly at the executive level, is a liability regardless of ChatGPT.
Given that they use all the labor of the Internet without attribution, we should assume that they will use every additional drop of data we give to them for their own ends.
This is what I hate about it.
This is scary, but it doesn't surprise me even in the slightest. ChatGPT is useful for so many things that it's extremely tempting to convince yourself that you should trust it.
For example, I was having some issues with my LTO-6 drive recently, and I had to finagle through a bunch of arcane server logs to diagnose it. I had the idea of simply copypasting the logs into ChatGPT and having it look at them, and it quickly summarized the logs and told me what things to look for. It didn't directly solve the problem, but it made the logs 100x more digestible and I was able to figure out my problem. It made a problem that probably would have taken 2-3 hours of Googling take about 20 minutes of finagling.
I'm not doing anything terribly interesting or proprietary on my home server, so I didn't really have any reservations sharing dmesg logs with it, but obviously that might not be the case in a company. Server logs can often have a ton of data that could be useful for a competitor (whether it should be there or not), and someone not paying attention to what they're pasting into ChatGPT could easily expose that data.
This was my first concern when it came to IDE plugins.
It's alright, I just told it that I don't consent to my data being used. Checkmate openAI!
OpenAI. The heist of the century. I am waiting for A.I. generated blockbuster in the near future.
I was just watching Altman's interview from the Lex Friedman podcast a few days ago
It really does feel like YC was a plot to fund the harvesting of all with an OpenAI climax. Not a serious conclusion I have, its just funny to watch it unfold, as if nobody even cares about the optics.
This cycle happens regularly and it seems often times the service provider wises up and charges for extra controls.
Yammer pre-Microsoft and nowadays Blind — lots of “insider” information seemingly posted.
As usage goes up the target size, and opportunity cost, both go up.
I’m curious if anyone’s employer has set up their own LLM. My employer has a couple of A100 sitting around which could easily host a couple instance of 65B LLaMA or Alpaca. Convincing upper management to allow me is the hard part.
I just run it on my desktop? 64GB of DDR4 is <$150.
I'm assuming a quantized version?
The 65B quantized model fits in 64GB of RAM, which I already had.
Though RDIMMs on eBay are even cheaper than UDIMMs (just over $1/GB) and Broadwell-era Xeon workstations aren't that expensive if you want to run the unquantized version.
what on earth do you need to convince them apart from gestures at all this
Team up with someone from sales, marketing :)
Funnily enough I wrote a cautionary comment on this just 2 days ago :
Let’s not forget that we’re also feeding in all our code into OpenAI Codex.
Many people like me like to paste stuff into an editor to strip the formats.
And Github Copilot will gather that tiny amount of data to who knows where.
Which I assume also means sensitive code that may be .gitignore but being pushed up to OpenAI. I.E. secrets, passwords, api keys.
my understanding of GPT is that the only vector for your data to get "into the model" is if it's used in fine tuning/RLHF. My guess is if you do the thumbs up or thumbs down, the session probably will be, but otherwise probably not. Still wouldn't put in private employer data primarily because of the other exposure risks - it's obviously not stored securely on the OpenAI side. But besides typical IT risk, the big unknown is whether or not the model will spit out what you put into it in somebody else's session. and my understanding is, that's only possible if your conversation is used for RLHF.
I guess another way to say that is, OpenAI (or another service provider with a better security track record) could broker this service in the cloud, with guarantees around not using the session data for RLHF, not storing session data, stronger auth (OpenAI has had a couple of incidents that show that they have pretty lax security in their backend), etc. and could make a killing selling or re-selling ChatGPT to businesses.
Seems like a temporary problem. Surely OpenAI will have a version which runs in a customers public cloud VPC, orchestrated by OpenAI.
First thing that went through my mind as I read the headline was Zuckerberg comments on people posting their info on Facebook.
Yep. I've been playing with it only in a Stack Overflow way for this reason. I would certainly not mention my personal proclivities or ask it to generate a new password for me. That might be appropriate for a locally hosted model, depending on how secure you can keep it.
Good point. And why don't you just unplug your computer and do all your stuff on your paper and pen? I believe it's the most secure way to protect your data. Remember to store these paper in your basement and bring your key.
Could there be eg a browser extension for scrubbing sensitive data from input on paste?
I'm hoping OpenAI will implement something like this on their end soon like data monitoring apps do (sentry etc), but if they don't client-side is an option.
This should be treated like any other data breach and employees should be reprimanded accordingly, especially for breaches of classified data (not sure if that ever happened, but I’m sure some dumbass has tried it).
How do you feed sensitive data to ChatGPT? I've got no account, therefore no idea. I thought you just ask it questions, or task it to write some code. How does the sensitive data get there?
- Translate this upcoming announcement in Spanish
- Can you summarize that product roadmap?
- Can you add comment to this piece of (proprietary) code?
- Please extract the birth dates from this employee list
I could go on and on and on
I see. Thanks.
You know what's hilarious about this - they quote specific examples of information that was prevented from being sent to ChatGPT. Someone at Cyberhaven is reviewing the requests it blocked.
Wouldn't it be trivial to add a "read-only" mode to the LLM's operation, where it uses stored knowledge to answer queries but doesn't ingest new knowledge from those queries?
From https://help.openai.com/en/articles/7039943-data-usage-for-c...:
> You can request to opt out of having your content used to improve our services at any time by filling out this form (https://docs.google.com/forms/d/1t2y-arKhcjlKc1I5ohl9Gb16t6S...). This opt out will apply on a going-forward basis only.
It goes to a google form, which is I guess better then them building their own survey platform from scratch that may have more vulnerabilities.
> them building their own survey platform from scratch
Funny. This nobel prize winner raises an interesting question:
If your AI is so great at coding, why is your software so buggy?: https://paulromer.net/openai-bug/
I'd be worried this is also the "how to get banned from OpenAI in the near future" form. and if OpenAI retains a monopoly like Google does for search, you are basically screwed.
I'm old enough to remember when newspapers reported hackers being banned from using computers.
And IP pirates banned from using the internet. Actually, that one I remember I voted against my local MP after they passed a law to make that the norm.
We don't yet have a social, let alone legal, norm for antisocial use of LLMs; even with social media, government rules are playing catch-up with terms of service, and Facebook is old enough that if it was human it could now vote.
So, yes, likewise computers/internet/social media, being banned from an LLM if it's a monopoly is going to seriously harm people.
But that is likely to be a big "if". The architecture and the the core training data for GPT-3 isn't a secret, and that's already pretty impressive even if lesser than 3.5 and 4.
No, everyone will have GPT-4 level AI in 6-12 months.
Why isn't it read-only by default? it's not even connected to the internet
ChatGPT, and I think all the GPT LLMs, is only accessible over the internet as far as I can tell.
And the thumbs up/down are there on the chat interface because it's partly trained by reinforcement from human feedback.
Nope. LLMs don't use the internet for inference at all unless you give it access to a web search api or something like that. chtGPT is just too massive to run on any local machine. But make no mistake, it does not require the internet.
I didn't say "for inference", and neither did the person I replied to.
GPT uses the internet to connect to users, but rather more importantly chatGPT in particular has a layer on top of GPT which is trained from human feedback.
Keywords search "RLHF".
That feedback mechanism is, if anything, becoming more detailed as time passes, so I must infer that it's still considered highly important, probably even for the 3.5 model.
The model isn't being trained as time goes on.
The RLHF layer is, and that layer is important.
That it does any of this is the specific reason for the story you're commenting on, and why putting data into it isn't like putting the same data into e.g. a Google Docs spreadsheet.
What’s not clear from OpenAI is what they actually store.
Clearly they store information from ChatGPT, but if I hit their chat/completions API with the same exact question do they store that as well?
Comment was deleted :(
Why don’t people just copypaste all their proprietary code to ChatGPT lol. And maybe send all their artworks to DALL-E. Then good luck trying to sue them for remixing your stuff
The wild thing is that you can just annotate the content with the associated organization information then ask GPT to tell you what it sees in the logs.
Employees have been feeding sensitive data to online translators in the same way.
Not to say it isn't a problem, but it's not an AI/chatGPT specific one.
Looking forward to AI on chip not too far down the road. As long as we use APIs for the model itself we can't really use it for much.
Are we similarly concerned about people posting company secrets on the Korean website/app "Blind" on a daily basis?
The real innovation of chatgpt will be how you go about getting permission to use it from your corpo middle management overlords.
I noticed this too, this is why I'm working on a startup that lets you train your data internally through a self-hosted OSS LLM product. It's going to start off with just reading emails for now until we can integrate with other products too, like spreadsheets, Airtable, databases, etc. Basically an OSS version of OpenAI ChatGPT plugins, I suppose. If you're interested, message my at my email in my profile.
I bet they post sensitive data to all platform which may help them.
Also I guess chat with Microsoft teams or slack is ok.
just because your code is art of course, which should be weighted in gold and copyrighted for the next 2000 years living in a cold wallet. Did you get that memo, that humans implemented all the faulty security that lasted the past decades by accident?
Apple will make a killing if they can deploy private LLMs to their M-series chips.
And what percent of employees send sensitive information via unencrypted email?
If you think this is bad, imagine what JSONLint has seen over the years.
It seems like an enterprise product is going to have a market here.
That's why privacy is important in the real world
Imagine if everyone knew your inner secrets just by looking at you at the bar..
Why is it different online? i have no idea.. well i kinda know but.. oh well.. we deserve it i guess
All information should be free, chuds.
Test
Classic fearmongering article targeted to HN crowd.
i) In this world, there are very few people whose private conversation is worth anything to anybody (celebrities, journalists -- so around 10,000 people)
ii) A tiny tiny %age of information is truly secret (mostly private keys).
iii) Business strategies are mostly a result of execution, not any 'trade secrets'. Meta will succeed because it has executed it's metaverse strategy, not because they kept the metaverse strategy secret.
People who take risks and not care about irrelevant details (just like how they took risk with internet shopping, cloud, SaaS) will win. Losers like the ones who thought AWS will steal their data will be left behind
Crafted by Rajat
Source Code