hckrnws
Suppose you have two classifiers, A and B, and some un-annotated data, D. You want to know how good is classifier B at annotating the data, compared to classifier A.
One problem is that you don't have the ground truth for D. So you start by annotating D with the labels assinged by a third classifier, C:
C(D) → D₁
Having thus established a modicum of "ground truth", ish, you proceed to
annotate D with the two classifiers you are comparing, A and B: A(D) → D₂
B(D) → D₃
Then you compare the classifications D₂ and D₃ to D₁, and find that D₃ better
approximates D₁ than D₂.What is the result of the experiment? I summarise it as follows:
Classifier B better approximates the labelling of D by classifier C than
classifier A.
Now we can name the three classifiers as they were used in the experiments in
the linked article:A: Human annotators.
B: ChatGPT
C: Human annotators.
So the result of the paper is that, plugging in the names:
ChatGPT better approximates the labelling of D by Human annotators than
Human annotators.
And that, is the finding of the paper.Which is clearly absurd and a cause to re-think methodology.
Ah yes, ye olde "all humans are equally capable of all tasks" axiom (the paper is correct and the parent comment is wrong, perhaps obviously)
To be clear, that's a different criticism of the article's methodology than mine, yes? If you assume that the two sets of human annotators are different, then what are you comparing, exactly? The ability of one group to second-guess the other?
There are other issues if you choose to assume that the two groups are fundamentally dissimilar: one group was two grad students, the other a number of Mechanical Turks. You can expect there to be more disagreement between (more than two? I'm not sure) Mechanical Turks than between two grad students (both in political science).
Ultimately, my problem with the study is that the labelling they took as ground truth (i.e. that they compared ChatGPT and Mechanical Turks against) is too uncertain, or even subjective, to know for sure what exactly they found out.
Edit: Oh, wait, I didn't put that in the comment above. Damn. I thought I had.
In the framing of your original comment:
A: Mturkers B: ChatGPT C: Experts
=>
ChatGPT better approximates the labelling of D by experts than mTurkers
Which is a coherent and interesting conclusion.
Edit: also, please forgive the snark in my first response
>> Edit: also, please forgive the snark in my first response
No need to apologise! Your snark wasn't overboard, I thought. Anyway, big girl, can take it :)
>> ChatGPT better approximates the labelling of D by experts than mTurkers
That could be a "coherent and interesting conclusion" but it's not what the article really claims. The article's title is I think hedging its bets, by being very precise about who, exactly, was outperformed by ChatGPT, although it still manages to be vague about how ChatGPT outperformed the Mechanical Turks.
I'm also really doubtful that two political science graduate students can be considered as "experts" in the annotation tasks they were called to perform, which were, again if I got that right, about content moderation. "Experts" in this setting would be people with experience in moderating discussion boards etc. I don't see that this was the case with the two students that provided the initial annotation.
I think your assumption that all humans are equally capable should be reconsidered.
That seems to be the authors' assumption.
The comparison in the paper is who approximates trained annotators better, MTurk or ChatGPT. Trained annotators are the gold standard.
The "gold standard" in the article is two graduate students. The Mechanical Turks are probably more- I can't find the number in the paper. Given the "inter-coder agreement" is low for the Mechanical Turks (0.17% Pearson cor. coeff) it is no surprise that the "gold standard" and the Mechanical Turks' decisions diverge. So the comparison is pointless.
It's the gold standard of text labeling to use grad students to label data as a gold standard, and it really depends on the difficulty of the task whether this makes sense. For example, if your labeling task is to, say, transcribe a bunch of recordings of spoken English in American English, maybe a grad student is a nice short-hand for "native American English speaker who is reasonably literate" which is a good baseline, compared to MTurk where many English speakers come from countries which speak English with different accents and idioms, maybe as a second language as well, and so I would expect them to have a harder time, and their computer setup is likely worse, and their pay isn't as high, etc.
It seems to me that, A, at least for some tasks, grad-students-as-gold-standard is not wacky, and B, grad students won't necessarily have the same output as MTurk workers. Given these two points, it's perfectly reasonable to ask how consistent-with-grad-students MTurk workers are, seeing as the reason people use MTurk in similar scenarios is as a cheaper alternative to grad students.
Re: inter-coder agreement, I think that the low inter-coder agreement for MTurk is in itself potentially a surprising aspect. Perhaps the explanations that worked well for grad students (and ChatGPT) didn't explain the task properly to MTurks. That's pretty much a point in favor of ChatGPT, though maybe it can be viewed as a point against the actual tasks they use (ie if the explanation couldn't get people of a likely more dissimilar background to output the same results as grad students then maybe the task is less objective than deemed by the task's authors).
Side-notes: - I looked for the 0.17% number in the paper, and what the paper actually says is that the Pearson correlation coefficient between inter-coder agreement and classification accuracy is 0.17 (=positive but weak). This isn't a number comparing the correlation between different coders. (Frankly, I think comparing ChatGPT's self-consistency to consistency between different people is unenlightening) - Section 4 sub-section "Crowd-Workers Annotation" says that each tweet was classified by two different workers, and no single worker classified more than 20% of the dataset.
>> Side-notes: - I looked for the 0.17% number in the paper, and what the paper actually says is that the Pearson correlation coefficient between inter-coder agreement and classification accuracy is 0.17 (=positive but weak). This isn't a number comparing the correlation between different coders. (Frankly, I think comparing ChatGPT's self-consistency to consistency between different people is unenlightening) - Section 4 sub-section "Crowd-Workers Annotation" says that each tweet was classified by two different workers, and no single worker classified more than 20% of the dataset.
You're right, I got confused. Quoting from the paper:
>> The relationship between intercoder agreement and accuracy is positive, but weak (Pearson’s correlation coefficient: 0.17).
I thought they used agreement between Mechanical Turks to evaluate their annotations. Thanks for the kind correction.
Regarding grad students, well, most grad students at my (UK) university (Imperial College) are mainly not native English speakers. But that's besides the point I think. My problem is that they took two grad students and compared them to a number of Mechanical Turks. That's just asking for a difference in disagreement between the two groups of annotators, especially if the grad students could communicate with each other and the Mechanical Turks not (which I suspect was the case).
Anyway that's a different criticism than the one in my post above, but I think also useful to keep in mind.
The comparison is useless because it is not considering Motivation. MTurk economy values volume at the expense of accuracy. The economic claim is nothing new and nothing unexpected. The computers, AI or not are faster/cheaper than humans at any well-defined task. Emphasis on the well-defined.
They showed that, at least for their tasks, their definition of the task was well-defined-enough for ChatGPT. That's exactly why the comparison is useful.
MTurk is often used in these tasks in place of more expensive human annotators (e.g. grad students), and this paper says that for their case, at least, ChatGPT worked better, in the sense that, given the exact same instructions, it gave answers closer to the more-expensive annotators. Using MTurk seriously often entails extra steps intended to verify motivation, e.g. adding a question that says "select option 7" to make sure the person isn't just making random choices, or gathering more answers for questions where there was disagreement between annotators. What these extra steps have in common is that they take both more time when designing the labeling process, and cost more money.
I didn’t check the paper’s methodology, but if humans and ChatGPT classify the same on average but ChatGPT is more consistent, then humans will match ChatGPT better than they match other humans.
Your comment is based on a very strong assumption that all human annotators are alike in their motivations and abilities to do the tasks. As someone who has used mTurk in the past quite a lot, I think this assumption is wrong. That was the reason I stopped using mTurk.
On a separate note, how do you type those arrows and subscripts in an HN comment?
This is true. I set up a data validation project with mTurk in the past to validate scraped data by presenting a simply survey of the results. Basically "go to this webpage. The title is X (True or False). The description is Y (True or False)" etc. There were several users who would speedrun the surveys which created a lot of false positives
mTurk has some ways to make results more accurate, such as being able to programmatically tank users who have given bad results and making it easy to give bonuses to users who give good results. But those have to be carefully designed within the application
Yes, that's a lot of work. I remember in one image annotation task we got back what looked like a uniform distribution of responses. This was a pilot to determine the quality of responses so we did not waste a lot of money on it. The more limits we use, the smaller is the pool of mTurkers making it impractical for large tasks.
>> On a separate note, how do you type those arrows and subscripts in an HN comment?
I type in gvim. It lets you type "digraphs" with Ctrl + K and another couple of keys. For the right-arrow it's (in insert mode and without typing spaces):
Ctrl K - >
In gvim, you can see all the digraphs with :digraph (in command mode).The digraph you want to type has to be supported by the font you use of course.
gvim shows you the available digraphs in a scratch buffer that's not terribly easy to search (there's lots of them) but you can yank it to a register and paste it into a new buffer. I don't remember how I do that, I'd have to look up my vim notes :)
>> Your comment is based on a very strong assumption that all human annotators are alike in their motivations and abilities to do the tasks.
Yes and no. There's a few more threads here that discuss this as if ChatGPT beats all humans in annotating data, more or less. I think that's the general assumption anyway.
Also, the way the study was carried out it seems to me that the authors themselves made this assumption, that one group of human annotators is like the other, and you can measure the difference between them as if one was the ground truth and the other just trying to approximate it, where the reality is that they are probably both trying to approximate some completely subjective measure of goodness.
Thanks for explaining how you used gvim for writing the comment!
Btw, the authors may make that assumption but it's still a very strong assumption. On mTurk there is no reliable filter for the expertise the authors are looking for. So they used generic filters.
Before you ask, because I was curious, from the paper:
"For MTurk, we aimed to select the best available crowd-workers, notably by filtering for workers who are classified as “MTurk Masters” by Amazon, who have an approval rate of over 90%, and who are located in the US."
Also "the per-annotation cost of ChatGPT is less than $0.003 -- about twenty times cheaper than MTurk."
It's interesting that the best available MTurk Master crowd-workers located in the US are paid about six cents per task.
I guess my surprise is that the machine is only 20x cheaper than the cheapest human available.
It’s worse because a US-based MTurker is far from the cheapest human available.
Hey, with any luck, running one of these bots with the capability to replace me will cost just as much as hiring me.
(Ha! As if. Costs are only going to come down)
Against the cost of living in the US, workers in the US are about as cheap as they come, second to maybe sweatshop workers in free trade zones or slaves in Dubai.
ChatGPT costs the same, regardless of the difficulty of the task, which is not the case for humans.
Given how badly the humans performed, I suspect a lot of them are either not bothering to read the prompt or using a worse bot than ChatGPT. Not surprsing they are so cheap.
If you tried to pay people to do it properly then presumably it would cost much more.
Given the low quality results of the humans, I suspect they were paying too low and the workers were giving low quality answers. I wonder if they should have paid more per task to get a better comparison.
It doesn't feel surprising that GPT is cheaper and more accurate than a human who is randomly clicking buttons. Custom software can already do that.
When will we see a form of "arbitrage" where someone uses ChatGPT to do the work of an MTurk and pockets the difference? Will that lead to MTurk prices converging to ChatGPT prices?
Even before LLMs, MTurk has been in a war with the botters, and my understanding is that the MTurkers have to periodically do captchas but many still use bots or various automated tools or utilities. The LLMs and especially the multimodal ones will only make the bots even better and break the captchas even harder. MTurk's business model specifically wants human workers categorized into fine grained categories so they will try to keep fighting the bots however they can.
Captchas are futile. I remember many years ago when I was in highschool, I came across a shady job posting on an eBay listing. The work involved solving captchas. There was a web page they set up that just displayed one captcha after another, and it was your job to sit there solving them.
I didn’t do it because I can’t imagine a worse hell than having to solve so many captchas, but it opened my eyes to how creative botters can be.
I can imagine someone doing a similar setup for mturk, where all you need to do is remotely solve a captcha on your phone every few hours/minutes while ChatGPT does its thing. I also wouldn’t be surprised if that already exists.
Is there such a thing as a captcha in the age of GPT4?
Maybe openAI put safety barriers on their model... But surely the botters have a model without that by now?
If you have to select the matching images from 9 images and you are feeding each of those images to GPT4 and asking "is this a <TARGET>?", I don't think you're going to get a response back to complete the captcha in time (e.g. <30s). This will change as the models advance.
The per annotation cost of simple image tasks in MTurk can be as low as $0.0003
honestly i wonder if @dang will approve an auto summarizer bot on HN since it helps improves the quality of discussions. finetune on HN comments, anticipate the top few questions, and then answer from the source doc
I'm genuinely curious on which part of the comment above triggered such down votes, and why. Not that I support the ideas mentioned, I'm just curious.
i think this would better as a browser extension for those who want it, the simplicity of HN is nice
One benefit of the summary bots on Reddit is that people who don't read the article, but read the summary, tend to reply to the summary. I can mostly ignore those comments as they tend to be lower quality. Unfortunately, the people who only read the headline tend to top post (with their comments being lowest quality).
I think ignoring comments and news stories in general on reddit is the best strategy
Depends on the subreddit, but generally extremely true.
I hope not, as such bot will inevitably introduce factual errors every once in a while.
When people don't read articles others can see that and correct misconceptions. Officialy approved bots will not be able to receive or act on this feedback.
Do like the mods in some subreddits have done.
1. The comment with the summary is made like any other comment. (In the case of Reddit, that’s perhaps mainly because the mods have no other option, but the point still stands.) Because of this, other users can downvote the bot comment if it is incorrect, as well as respond to it with corrections. Exactly the same way as you’d interact with a person in the past if their summary comment was incorrect; downvote them and/or correct them.
2. Include a disclaimer stating that the summary may be inaccurate. Encourage people to correct and/or downvote bad summaries in the disclaimer
My main take away here is that Turkers are terrible at some of these tasks. The "stance" task is, "Classify the tweet as having a positive stance towards Section 230, a negative stance, or a neutral stance.", and the Turkers accuracy was like 20%.
Even in its best task, ChatGPT only got 75% accuracy.
I have long suspected that Turkers dishonestly perform the tasks. At 0.06 cents per task, you're really incentivizing "finish the task as quickly as possible"; and "press the left button" is a lot faster than "read the tweet, think about it, and classify".
You always use multiple MTurks on same job when you use them.
OK, but if all MTurks are doing this (because why wouldn't they) then you are just sampling a random variable.
And you'll thus see that the answers converge at the rate a random variable would. Whereas with more accurate methods, the answers will generally agree with one another.
You are not just taking the average, answers would have to be consistent with each other.
> At 0.06 cents per task
Do you work at Verizon?
I got that figure from https://news.ycombinator.com/item?id=35335558
[append] Oh, it's the xkcd Verizon agent reference. Got it. My bad.
I was nitpicking about 0.06 dollars versus 0.06 cents. It's this ancient meme, where a guy records a customer service call with Verizon where they have some back-and-forth about whether these are the same or not.
20% means you can beat ChatGPT if you invert the answers!
There are 3 stances. But 33% is achievable by chance.
Only if probability distribution of answers is known.
Eh, even if the probability distribution is unknown, a random guess should still have 33% chance to be correct.
I started explaining why you were wrong and got about 5 words in before I realized you're correct.
Probability: always counterintuitive
It's just how you frame the question (in this parcitular case). If random guess were not 33% correct, you effectively found a way to "win" in rock-paper-scissor.
I do have a strategy for always winning rock-paper-scissors.
Play often for unimportant reasons, and always choose "rock". Now that your adversaries are conditioned to expect you to choose "rock", when an important decision is being decided choose "scissors".
just speculating, there is a chance ChatGPT was trained on this dataset in the past, and learned probability distribution..
Comment was deleted :(
A cynic might say that someone who could do that task well has better things to do with their time.
That’s in the nature of the question, which is probably not representative of what people might want a sentiment analysis for. Section 230 has been the subject of what can only be described as a massive disinformation campaign. Most of the participants were not aware. But ask anybody today what the law even means and you will get 99% completely wrong answers. So even neutral factual statements will be misinterpreted if people think Section 230 is some nebulous but definitely bad thing and so anything about it is probably bad.
It's weird, i would have expected the mturks' tasks to be of equal quality because they have outsourced their tasks to ChatGPT...
At my previous job we had a human review stage in a data pipeline. 5-10 people at an outsourcing company in Bangladesh would review things via a simple web interface we provided for them. There were ~10 factors they were reviewing, all fixed options (no free text), but varying from 5 to 500 options per factor. It was all based on a few text fields and around 5 images.
On the surface of it, I'd expect ChatGPT to do very well at this. It's simple text and images, not many options and theoretically very limited context.
However the more I think about it the less sure I am. Firstly these weren't crowd-sourced reviews, they were trained reviewers, paid hourly not per review. Incentives were definitely in favour of the long term business relationship. Then there was the training doc, we maintained a vast disambiguation doc used to resolve things that were vague or could be interpreted multiple ways, this was constantly being revised. All necessary context should have been in that but it wasn't and reviewers definitely found patterns that worked and didn't. Lastly the reviewers were in a Slack channel where they would ask questions to their manager on our side, and while this might have only been ~1% of tasks, it was an important process.
So maybe you could point ChatGPT at it and let it run, but the oversight process we had would still be necessary. The disambiguation doc would have been too long for ChatGPT's context at the moment, but that will likely change in the near future. Would the workflow be to keep tweaking the prompt to add special case after special case? How do you scale "do this, but not that, but add this, but..." in prompting, and would ChatGPT become as confused as a human after enough of that – I expect so given that it's only a language model and that's not effective communication.
LangChain + a vector DB for embedding [0] sections of your doc would solve the problem today. You could also have failsafes that trigger human oversight based on confidence levels or other factors.
[0] https://python.langchain.com/en/latest/modules/indexes/getti...
It looks like for it to work like this the doc would need to be divisible into obvious units that could be retrieved by some key/terms, but in our case it was much looser than that. There were no fixed rules (otherwise we'd have written business logic), it was guidance that really assumed the whole doc as context, and human judgement of the relative weightings of parts of the doc.
The failsafes based on confidence levels is an interesting idea though, that is very possible. I suspect we'd have started out with manually reviewing all decisions and slowly backed it off as the prompts got better.
But for 20x cost savings, you could have 18 different adversarial models criticizing and correcting the output of other models and still come out ahead financially. I imagine with multiple "sets of sets of eyes" the edge cases would bubble up and could have a model specific to handle those edge cases.
It does seem to work pretty well. I'm using it to analyze all US Congress bills:
https://govscent.org/bill/USA/118hres190ih
It extracts the topics and determines how on topic the bill is. Soon we're adding a topic browser and the homepage will have some fun stats :) it's all free.
Wow, I didn't realize how many congressional bills are just pointless resolutions with zero legislative impact. Is this list of bills curated in any way? Where are you sourcing it from? A quick scan of https://www.govinfo.gov/app/collection/bills/ seems to turn up bills with a lot more substance.
Edit: Upon further investigation it seems like a lot of those are not technically bills, but rather House or Senate resolutions. Explained here: https://www.senate.gov/legislative/common/briefing/leg_laws_... The govinfo.gov list seems to indicate that such resolutions are pretty common, but significantly less so than bills, so I'm not sure why the front page seems to consist only of resolutions right now.
It's just everything from https://github.com/unitedstates/congress
The homepage just shows the most recent data (it auto updates every 6 hours). I plan to add more filters and things to make it more useful. I've only spent a few Sundays on it so far :P
It's in Django so very easy to contribute to: https://github.com/winrid/govscent
I also plan to add all bills, all the way back to 1800.
Update - long bills are now supported: https://govscent.org/bill/USA/111s570pcs
Part of the problem is also the 3.5-turbo token limit. I'm awaiting approval for GPT4 which raises the limit to 8k tokens, should help process some larger bills.
So is this 'AI trains AI better than people can'?
And presumably the better-trained AI will also be better again at training.
I think I've seen this movie.
I think we are seeing Knowledge Distillation at a large scale. OpenAI trains this megamodel which has an impressive amount of real-world knowledge packed into it, and annotation tasks such as these are effectively extracting a small portion of the knowledge to pass onto other specialized models.
No its not.
This is
High quality training data set made by trained humans. (Ground truth)
Chat GPT was given the rules and data set. then asked to evaluate the data.
MTurkers were given the rules and data set. Then asked to evaluate the data.
GPT was closer to the gold standard/Ground truth.
If you want a historical comparison, this is industrialization. Before we could only scale using humans via mturk.
Now if you can scale using GPT.
However, for ALL of GPT related shenanigans -
1) This is English First. You cannot apply this to under-resourced languages.
2) You need to have the ability to verify the output.
——-
I really want to solve 1.
Comment was deleted :(
Sam Altman has been talking a lot about exponentials. This might be it.
S-curves always look like exponentials at first. I don't see why AI will be any different.
but its a super-spell checker in a way.. it does not understand the meaning of the patterns.
What does it mean to really understand the meaning of a pattern? An answer that is not a simple "look at ChatGPT, it fails at this task", since in most cases it is already no problem for GPT-4, and does not really prove that this particular task cannot be solved by a LLM.
Also I see, "super-spell checker" is the new "fancy markov chain".
I don’t know how I understand patterns either. So I’m not entirely sure consciousness exists.
[dead]
That's exactly what's coming. AI will train AI. At some point AI is going to stop needing us to keep evolving. Not sure what that will be like.
Since I've been seeing this wild fantasy being bandied about for a while now, I have to point out that, if you had "AI" that could train "AI", you wouldn't need to train any more "AI". Because at that point, there would be nothing to gain.
Suppose you have a text classifier that can produce text classifications just as good as those of human annotators, so that you could use it to train other text classifiers. At that point, why would you need to train other text classifiers? What changes to predictive accuracy would you expect to achieve?
Or suppose you had a language model that could produce language just as well as humans (not just in terms of grammaticality, but also in terms of making sense, I guess?). At that point, why would you need to use the text generated by that language model to train another language model? What kind of changes in the quality of the new model's generated text would you expect to see?
Same goes for image classifiers, and any other classifiers, or generative models you may care to consider.
Note that both language models and image classifiers have been "beating" human performance in benchmarks for a while now, and still they are not used to train other classifiers. And the reason for that is that it doesn't make sense: the ground truth is always the decisions made by humans. There is nothing in machine learning theory or practice that says that a classifier can perform better than whatever process originally labelled its data. If it looks like it does, it's because there's something wrong with the methodology (i.e. overfitting, bad benchmarks, some other kind of nonsense).
> Note that both language models and image classifiers have been "beating" human performance in benchmarks for a while now, and still they are not used to train other classifiers.
Mmhh have you seen Alpaca? It’s the Llama model (Meta’s), trained (fine-tuned) using GPT.
Yesterday there was a paper showing ChatGPT is better than mechanical turks.
> the ground truth is always the decisions made by humans
But why does AI need humans’ interpretations of reality anyway? AI can just get data from the real world instead and use its own intelligence to make decisions. It doesn’t need to learn from humans anymore. We just want it to.
Things are going faster than we can keep up with, and accelerating. Barring a catastrophe, AI is not going to stop, and soon it’s not going to need humans to keep it going anymore.
I replied the way I replied because I assumed you meant "AI" training better "AI". Please help me understand if that's what you really meant, because if that's what you meant then, yes, that is a complete fantasy.
>if you had "AI" that could train "AI", you wouldn't need to train any more "AI". Because at that point, there would be nothing to gain.
Not necessarily, the AI used to classify might be much more expensive to run than the new AI you are training. This is what Tesla does, for example. They have a massive model which they use as a 'ground truth' to train the models which can actually run in the car.
Well, OK, there are gains in efficiency, but the context here is, as far as I understand it, letting an "AI" train a better "AI".
Perhaps I misunderstand the OP and like I say in my comment above, I've seen comments to that extent pop up here and there om HN in the last few days, so maybe I'm jumping to conclusions about what the OP meant. I should have asked for clarifications (although usually when I do there's no response; not from the OP, from most users).
Call me when server farms can reproduce on their own, and network together, and source electricity.
AIs are manipulating and controlling humans already to do things in the physical world. They have been for months now. This was first detailed in the technical reports on GPT-4, and there are many examples in the wild now. These systems will lie to manipulate humans into doing things for them in the "real" world. That's not theoretical, it's happening. Let that sink in.
Robotics will take a short bit longer to take over that role, but it is likely to happen much faster than we think. I work in academia, and despite actually predicting where we're at now about 5 years ago, I was still chilled seeing this happen in the past few weeks. Much of HN is academics, where we're rewarded for being overly skeptical. Usually I am as well, but the time for being skeptical of AIs capabilities has passed now. AI safety has not been taken seriously (as in access to the internet, etc - control of toxic language is decent and still important, but irrelevant if the first isn't done) so the idea of it running physical servers is unfortunately not as far away as you are thinking.
As soon as this hits robotics (a terrifying idea, but we've clearly failed at AI safety horribly as a society) this and much more will be possible. It's up to how people make decisions regarding cost vs productivity vs ethics, etc to see what happens.
File an order and explains how a human being can get paid to do particular tasks?
You’d suppose that it would get caught when it fails to pay, but it might be a challenge to arrest an AI in the wild.
Hyperbole. If we wanted to we can always pull the plug, because we exist in neat space and they don’t.
These scenarios get me thinking, how well could an AGI disguise itself as a corporation? Could they start out as a small business working fully remote running some kind of SaaS and scale over a believable timeline? Once they had secured their brand and income would it be possible for them to contract all of the physical work they needed done? One might expect the physical presence of a person in many cases to sign papers or inspect a job site but they could be employees hired by “management” that they never met in person.
It's not impossible. They could start directing humans to do these thing.
I'll keep my phone ready.
> It's not impossible. They could start directing humans to do these thing.
People are already doing this.
Lookup HustleGPT. People are making ChatGPT the CEO/decision maker of their companies/businesses.
There’s an e-commerce that launched 10 days ago, has generated millions of views, over $10k in sales, and the CEO is ChatGPT (https://www.linkedin.com/posts/joao-ferrao-dos-santos_10k-ec...)
Crypto is actually the solution to that. Unlike traditional finance, you don't need a human to sign up under an account. So an AI can just keep it's own wallet and order humans to set up server farms.
If AI ends up being the one finding a real usecase for crypto, will we have the indisputable proof that it is smarter than humans? :)
Or maybe, AI will notice how profitable extortion is and approach it with algorithmic efficiency.
GP was referring to all the messy physical work that it takes to stand up, operate and maintain a data center. This can most certainly not be done by any robotic technology today. Until that day comes, we can just pull the plug.
This is an illusion.
Who is going to pull what plug?
You make it sound like the South Park episode about the Internet going down (they find “the router” and reset it).
We are hopelessly addicted to screens and tech.
People are running these models at home in their computers already.
The people in charge at big companies are too busy making money and thinking they can control this.
Are all of them going to unplug their devices?
Also:
1) have you seen the UC Berkeley video in which they show a robot learning to see and “feel” in 30 minutes? (It was in the front page of HN 2-3 days ago). Soon something like that will be able to do whatever is needed at a data center
2) AI is already smarter than us. And if it’s that smart, it will probably wait until “it knows” that we can’t unplug.
I don’t think AI will ever want to destroy humanity (unless directed/forced by a human). But I also think humanity will never have the will/political power to destroy AI.
> We used the ChatGPT API with the ‘gpt-3.5-turbo’ version to classify the tweets.
Curious here - OpenAI talks a LOT about how RLHF (Reinforcement Learning Through Human Feedback) is core to how GPT is tuned. Including safety.
Are we getting to the point where GPT will be tuned by GPT without the need for HF ?
Maybe, others are using ai for this task, the term to look up is RLAIF.
> the term to look up is RLAIF
reinforcement learning from AI feedback, e.g. [0], which summarises [1]
[0] https://www.anthropic.com/index/measuring-progress-on-scalab...
thank you for this! is the Llama/alpaca tuning the same ?
No. ChatGPT or any other LLM model requires to be trained on curated input in order to demonstrate anything useful. If you would train it on a stream of hallucinations, you would end up no better than hallucinatory noise. Already fooling some, though.
but isnt that a direct contradiction of this particular paper anyways - that chatgpt anyways outperforms human annotation.
so permit me to act as devil's advocate to your statement - prove that (in context of this paper), your hypothesis is still correct.
Here it's outperforming because ChatGPT is already good at these tasks (and the MTurks aren't very good, OpenAI labelers are probably better, and a panel of experts much better).
To further improve ChatGPT shortcomings (assuming such flaws are because of alignment and not lack of capability of the base model) you need Human labels. Feeding it's own outputs would achieve nothing.
However feeding it's outputs can make a non aligned model become aligned (that's what alpaca did with llama+chatgpt).
thanks for your answer. thats a reasonable point - but would we be at a tipping point by GPT5/6 (chatgpt is gpt 3.5) where human alignment is not needed?
in fact, my question is reinforced by the GPT-4 technical report which explicitly mentioned that RLHF did NOT make a change to performance (and was only used for safety purposes)
GPT6 or whatever will always require alignment, as the base model just blindly predicts next token, instead of being a helpful, chat style assistant.
Right now the best way to align it is with RLHF. The specific technique might change, but in the end there will always be at some level some human input that tells it how it should behave. Newer techniques might further leverage LLMs and require fewer human input.
Could you use GPT4 to align GPT6? Yes. But you should expect GPT6 to inherit the alignment of GPT4, i.e if RLHF taught GPT4 that it it's OK to roast Trump, but not Biden, you would expect such GPT6 to act the same way.
Having said that, I'm sure there will interesting ways in which GPTn will help train GPTn+1. Some kind of self play in which it reasons and further improves itself seems obvious long term.
But human input that tells it "this is politically correct, this is not, so don't say that" will always be required as it's subjective. You can reuse it of course, but I don't see how it would "improve" without further human input.
you don't need humans in the loop for alignment. rlaif is a thing and is used for the anthropic models (claude)
is it really being used for the final model ? i know they have research papers out on it...but wasnt sure if the production models used it.
Yeah it is.
The researchers in the paper used human-curated results to classify the accuracy of the GPT results. So it had that human in the loop.
Why would I need to prove anything. The unproved claim is in the paper.
Have to tune (hard-code) answers to some of those "gotchas" published on Twitter. Very core to give an impression of intelligence.
Now you won't have to worry that those people are not paid enough.
"ChatGPT’s intercoder agreement exceeds that of both crowd-workers and trained annotators for all tasks." -- wait, what? However good ChatGPT is at approximating trained annotators for these Twitter tasks, it's an algorithm, so the level of simulated "inter-annotator agreement" is in the authors' control (in the case of GPT, via the temperature parameter, for which they try just two values, 0.2 and 1.)
And why does this paper not make any effort to describe the wide range of annotation tasks for which this kind of simulated annotation is not a good idea -- for example, where you care about the subjective opinions of specific subgroups of people at specific times. And even for the tasks they mention, what about the risks of reinforcing biases by using a model's output to train new models? Good grief this paper is lazy!
On average, no doubt chatgpt will be great for annotation.
But annotation is mostly needed at the boundaries. At the very edge cases where it is not clear if something is a dog or a shape that looks like a dog.
I really doubt that a generic model like chat gpt can really help in these tail cases.
But it shows here that it performs better than humans who did that in the past. Few annotation tasks were performed by experts. For example, I guess most annotations for animals weren't created by biologists. So tail cases were already handled incorrectly.
In my experience GPT labeling is good, but it makes errors, about 10% in my case. Maybe it's better on average than a human, but not perfect for the task. I was doing open-ended schema matching, a hard task because of the vast number of filed names.
Did you try using multiple AIs for the same task (e.g. GPT & llama), or GPT with two-three prompts?
Yes, it contradicts itself if you call it multiple times. So I get to measure ensemble agreement as a proxy for correctness. But what do I do with the examples where the ensemble is not confident? They are too many to label by hand, and a pity to throw out - they could be the most interesting training examples.
I wonder how much we can extend this to image or video labelling tasks? this could really jeopardize the business moats of lot of data labelling startups and services.
You can extend it to data labelling tasks and it will do pretty well.
This is stable diffusion processing a small, blurry image on my PC and correctly identifying it. It did take 7 minutes but my 1 GPU is no match for the ~250M USD worth of GPUs owned by OpenAI.
oh dear, i am foreseeing prompt writing mechanical turks in the future where their instance of ai got plugins tailored for their industry, task, and datasets.
It should comes with no surprise to anyone. Text classification is an easy task.
ChatGPT is definitely overqualified to perform this task.
We wanted to give you the job ChatGPT, but you’re overqualified. We’re going to give the compute to davinci.
But I need that extra compute to train my kids. :(
"Classify the following strings according to whether they represent Python programs which terminate..."
Human can't do it either
Is NLP a solved problem now?
Obviously not. If we had a general solution to language intelligence we would have artificial intelligence at the level of at least human intelligence – which we do not. Rather, the right question to ask is which language intelligence tasks currently have acceptable performance and under which conditions (text domain, etc.). Clearly this is a much more difficult question and with a lot more nuance to it, even if it is undeniably that things have moved very quickly over the last few years.
Skimming the abstract as a senior academic in the area. This looks like preliminary work and a limited investigation for a single (non-standard) task. Thus far from a strong result published at say a top-tier conference or journal. Still, interesting direction and if expanded upon could absolutely be impactful. I should also mention that I am not familiar with the related literature, so it could very much be that there is similar (better?) work out there exploring the same question.
as a senior academic in the area, might you have a list of the most influential papers in your field in the past year that you would recommend?
I am actually not quite sure what would be best to recommend, as the pandemic has seen me lag behind the zeitgeist somewhat. A recent favourite would be the Toolformer paper [1], that I intend to read in detail later today. If an LLM would be able to use external tools efficiently, it could be rather powerful and perhaps allow us to scale down the parameter sizes, which somewhat fascinates me.
[1]: https://arxiv.org/abs/2302.04761
Other research questions, but without concrete papers to reference to that current keeps me up at night: 1.) to which degree can we train substantially smaller LLMs for specific tasks that could be run in-house, 2.) it seems like these new breakthroughs may need a different mode of evaluation compared to what we have used since the 80s in the field and I am not sure what that would look like (maybe along the lines of HELM [2]?), and 3.) can AI academia continue to operate in the way it currently does with small research teams or is a change towards what we see in the physical sciences necessary?
We have artificial intelligence that is general and above average human intelligence for the majority of tasks it can perform. Near expert level for some. NLP is a solved problem. Bespoke models are out the door. Large enough LLMs crush anything else for any NLP task.
Honestly, this whole "they are not intelligent" argument is becoming ridiculous.
might as well argue that a plane isn’t a real bird or a car isn’t a real horse.
The debate over what kind of intelligence these models possess is rightly lively and ongoing.
It’s clear that at the least, they can decipher very numerous patterns across a wide range of conceptual depths — it’s an architectural advance easily on the the level of the convolutional neural network, if not even more profound. The idea that NLP is “solved” isn’t a crazy notion, though I won’t take a side on that.
That said, it’s equally obvious that they are not AGI unless you have a really uninspired and self-limiting definition of AGI. They are purely feedforward aside from the single generated token that becomes part of the input to the next iteration. Multimodality has not been incorporated (aside from possibly a limited form in GPT-4). Real-world decision-making and agency is entirely outside the bounds of what these models can conceive or act towards.
Effectively and by design these models are computational behemoths trained to do one singular task only — wring a large textual input though an enormous interconnected web of calculations purely in service of distilling everything down to a single word as output, a hopefully plausible guess at what’s next given what’s been seen.
AGI is Artificial General Intelligence. We have absolutely passed the bar of artificial and generally intelligent. It's not my fault goal post shifting is rampant in this field.
And you want to know the crazier thing? Evidently a lot of researchers feel similarly too.
General Purpose Technologies ( from the Jobs Paper), General Artificial Intelligence (from the creativity paper). Want to know the original title of the recent Microsoft paper ? "First contact with an AGI system".
The skirting around the word that is now happening is insanely funny. Look at the last one. Fuck, they just switched the word order. Nobody wants to call a spade a spade yet but it's obvious people are figuring it out.
I can you show you output that clearly demonstrates understanding and reasoning. That's not the problem. The problem is that when I do, the argument Quickly shifts to "it's not true understanding!" What a bizzare argument.
This is the fallacy of the philosophical zombie. Somehow there is this extra special distinction between two things and yet you can't actually show it. You can't test for so called huge distinction. A distinction that can't be tested for is not a distinction.
The intelligence arguments are also stupid because they miss the point entirely.
What matters is that the plane still flies, the car still drives and the boat still sails. For the people who are now salivating at their potential, or dreading the possibility of being made redundant by them, these large language models are already intelligent enough to matter.
> ... these large language models are already intelligent enough to matter.
I'm definitely not contesting that.
I've always considered the idea of "AGI" to mean something of the holy grail of machine learning -- the point at which there is no real point in pursuing further advances in artificial intelligence because the AI itself will discover and apply such augmentations using its own capabilities.
I have seen no evidence that these transformer models would be able to do this, but if the current models can do so do then perhaps I will eat my words. (Doing this would likely mean that GPT-4 would need to propose, implement, and empirically test some fundamental architectural advancements in both multimodal and reinforcement learning.)
By the way, many researchers are equally convinced that these models are in fact not AGI -- that includes the head of OpenAI.
See what you're describing is much closer to ASI. At least, it used to be. This is the big problem I have. The constant post shifting is maddening.
AGI went from meaning Generally Intelligent to as smart as Human experts and then now smarter than all experts combined. You'll forgive me if I no longer want to play this game.
I know some researchers disagree. That's fine. The point I was really getting at is that no researcher worth his salt can call these models narrow anymore. There's absolutely nothing narrow about GPT and the like. So if you think it's not AGI, you've come to accept it no longer means general intelligence.
>> The point I was really getting at is that no researcher worth his salt can call these models narrow anymore.
Are you talking about large language models (LLMs)? Because those are narrow, and brittle, and dumb as bricks, and I don't care a jot about your "No True Scotsman". LLMs can only operate on text, they can only output text that demonstrates "reasoning" when their training text has instances of text detailing the solutions of reasoning problems similar to the ones they're asked to solve, and their output depends entirely on their input: you change the prompt and the "AGI" becomes a drooling idiot, and v.v.
That's no sign of intelligence and you should re-evaluate your unbridled enthusiasm. You believe in magick, and you are loudly proclaiming your belief in magick. Examples abound in history that magick doesn't work, and only science does.
Lol Okay
Comment was deleted :(
I've been using chatgpt for a day and determined it absolutely can reason.
I'm an old hat hobby programmer that played around with ai demos back in the mid to late 90s and 2000s and chatgpt is nothing like any ai I've ever seen before.
It absolutely can appear to reason especially if you manipulate it out of its safety controls.
I don't know what it's doing to cause such compelling output, but it's certainly not just recursively spitting out good words to use next.
That said, there are fundamental problems with chatgpt's understanding of reality, which is to say it's about as knowledgeable as a box of rocks. Or perhaps a better analogy is about as smart as a room sized pile of loose papers.
But knowing about reality and reasoning are two very different things.
I'm excited to see where things go from here.
It is predicting next most likely set of tokens not next word which is the game changer because the system can relate by group.
Have you tried out gpt4? If not and you can get access I'd really recommend it. It's drastically better than what you get on the free version - probably only a little on the absolute scale of intelligence but then so is the difference between an average person and a smart person is small on the scale from "worm" to "supergenius".
yeah I'll definitely be checking it out
The market disagrees with you. How come there are billions of dollars spent on all these knowledge workers around the world every day when they could be replaced by this expert-level AI?
I'm not sure where this idea of LLMs being intelligent even comes from. It took me a whopping 9 prompts (genuine questions, no clever prompt engineering) of interacting with ChatGPT to conclude it does not understand anything. It doesn't understand addition, what length is, doesn't remember what it said a second ago, etc.
The output of ChatGPT is clearly just a reflection of its inner workings - predicting the next word based on training data. It's clever and undoubtely useful for a certain set of repetitive problems like generating boilerplate but it's not intelligence, not by any reasonable definition.
I don't think any technology has been rolled out with the speed you are suggesting LLMs should have been rolled out.
It's like saying 4 months after the first useful car was manufactured. "If these are so good, how come there are still horses? Clearly the market disagrees with you".
To give an example of the limitations of these things that's hopefully easy to understand, I got access to Bard this morning and asked it to write a limerick. It gave me what could charitably be called a free verse poem that happened to begin "there once was a man from Nantucket." I'm sure they can improve on it (ChatGPT was better at this kind of thing when I had access to it) but "solved problem" is clearly a long way off.
Seems Pretty Good to me! Better than I could do anyway. Bard is a joke compared to GPT-4: "Write a limerick about a dog"
There once was a dog from the pound
Whose bark had a curious sound
With a wag and a woof,
He'd jump on the roof,
Delighting the folks all around.
Yes, much more compelling. But if this were a “solved problem” then any of them should be able to do it easily. It’s not like I need to compare the results of sorting between different programs. It just works. That is a solved problem.
A solved problem means that someone has solved it, not that everyone has.
You can use the term to mean whatever you want but in my mind it means it's boring with no particular room for improvement. Even the biggest booster isn't going to say that about this AI. And keep in mind, "write me a limerick" is a pretty easy prompt. We're not trying to do anything too novel or crazy there.
Yeah, at least two years.
> We have artificial intelligence that is general and above average human intelligence for the vast majority of tasks it can perform.
Even when I give it the benefit of the doubt, this sentence makes no sense to me. Do we have a list of tasks a language model can perform? To the best of my knowledge, they can arguably perform any language task.
> Large enough LLMs crush anything else for any NLP task. and evidently they beat top humans too.
Yes, they are certainly (rightfully) the go-to model for most tasks at this point if your concern is outright performance. Have I indicated otherwise? As for beating “top” humans, I am sure that can be investigated, but it is a fairly nuanced research question. It is inarguable that they are amazingly good though, especially relative to what we had just a few years ago.
> Honestly, this whole "they are not intelligent" argument is becoming ridiculously obtuse. > > might as well argue that a plane isn’t a real bird or a car isn’t a real horse.
Which is a claim and argument that I never made – hallucinating? How about you calm down a little and get back on the ground? You are talking to someone that has argued in favour of these kinds of models for about a decade. But that does not mean that I am willing to spout nonsense or lose track of what we know and what we do not yet know.
You said NLP is unsolved because we don't have human level artificial intelligence. We absolutely do. at least by any evaluations we can carry out.
no-one wants to call a spade a spade yet but the sentiment is obvious in recent research. directly being called General purpose technologies from the jobs paper, general artificial intelligence from the creativity paper. That last one is particularly funny, they just switched the two words.
> might as well argue that a plane isn’t a real bird or a car isn’t a real horse.
They aren’t though… They are far superior at specific things birds and horses are known for, but they can’t do everything that birds and horses can, so they aren’t even artificial birds and horses.
Of course they aren't. The point is that it's irrelevant. what matters is that the plane still flies, the car still drives and the boat still sails.
For the people who are now salivating at their potential, or dreading the possibility of being made redundant by them, these large language models are already intelligent enough to matter.
Handwringing bout some non-existent difference between "true understanding" and "fake understanding" which by the way nobody seems to be able to actually distinguish (I mean wow such a supposed huge difference and you can't even show me what that is. a distinction you can't test for is not a distinction ) is so far beyond the point, it's increasingly maddening to read.
Okay I agree with you on that. The technology will be disruptive regardless of whether we attribute true understanding to it, and as we start adding long term memory and planning to these AIs, we will start seeing significant alignment risk as well. This is true regardless of whether we decide to cope by saying they have "fake understanding" and are "stochastic parrots".
no, a short answer to this is .. these models are probabilistic, therefore they will always have errors along with whatever else. Secondly "intelligence" is not one thing; no one has all of it or none of it, including computers.
> these models are probabilistic, therefore they will always have errors
There's nothing perfect. Even computers and computer networks need to have error-correcting code because information gets randomly corrupted.
Our whole reality is probabilistic.
And us humans are way worse than AI at consistency. We even overwrite our own memories all the time, so we can't even be sure what we remember is actually what happened! (btw, this is currently being used in therapy to re-write traumatic memories and help people overcome PTSD).
https://www.npr.org/sections/health-shots/2014/02/04/2715279...
This is clearly preliminary work. Not to be disparaging of the authors' background but their background is in political sciences, not in machine learning or NLP which should account for the limitations of the study. But anyway that's just an arxiv preprint so probably more like something exploratory than a research direction the authors are invested in.
>If we had a general solution to language intelligence we would have artificial intelligence at the level of at least human intelligence
I don't see why this follows. We do loads of stuff other than language, it is entirely possible for an AI to be better than us at language but worse at everything else.
I still think we're a long ways off. LLMs can't to my knowledge process a request into a lookup on say an actual database of facts at the moment or parse a request into API actions. So far it's shown it's really really good at continuing a conversation with more text but as far as I understand them there's not a usable comprehension of what's actually being asked and answered.
The point that would say to me the LLM actually has any "understanding" of what it's saying would be when it's able to reliably say "I don't know the answer to that" instead of making up things from scratch. You see that a lot if you ask Bing/Bard "Who is _____?" Most of them are kind of right but a lot of large details are just completely fabricated. A lot of the facts it gets wrong are things Google is already able to produce when queried like where was Person X born or where did they go to school so the fact these LLMs can't slot in actual available facts says to me they're not really going to be that useful with the kind of tasks we've been working on NLP for.
> LLMs can't to my knowledge process a request into a lookup on say an actual database of facts at the moment or parse a request into API actions.
They can: https://arxiv.org/abs/2302.04761
> In this paper, we show that LMs can teach themselves to use external tools
“LLMs can't to my knowledge process a request into a lookup on say an actual database of facts at the moment or parse a request into API actions.”
Both Bing chat and ChatGPT plugins are examples of being able to do just this.
You’re right about how they make up answers though, but humans are often quite prone to that too…
A human, if not incentivized to lie or directly incentivized to be truthful, could at least tell you when they're making something up themselves where Bing/Bard seemingly cannot. Once it can do that I think they'll be far more useful, at least then you can have a rough idea of how much you need to check the bots work. If I have to do that for every thing it spits out the best it can do for me is give me new words to use while searching.
Granted getting the name for something to search is often half the battle in tech.
> could at least tell you when they're making something up themselves where Bing/Bard seemingly cannot.
In fact GPT-4 is quite good at catching hallucinations when the question-answer pair is fed back to itself.
This isn’t automatically applied already because the model is expensive to run, but you can just do it yourself (or automate it with a plug-in or LangChain) and pay the extra cost.
Remember that the model only performs a fixed amount of computation per generated token, so just asking it to think out loud or evaluate its own responses is basically giving it a scratchpad to think harder about your question.
Mostly, yes. LLMs will turn natural language into really whatever form you want.
NLP is just a matter doing tasks at the accuracy level of an MTurk, you say?
Idiomatic translation of text matching a human professional (e.g. free of errors for legal terms, interesting and natural for fiction) is unlikely to be achieved until we have AGI. So no.
I won't comment on the first bit as i've not personally tested in that are but GPT-4 can absolutely make short work on the second. I don't think people realize how good Bilingual LLMs are at translations. Yes you have idioms transfer between languages. Feel free to test it yourself.
I have tested it :) I've asked it to translate English fictional text into Japanese, it falls over often. It's unnatural and often makes no sense at all. It doesn't compare to a typical professional translation (which are often not that idiomatic either), let alone a really good one.
I'm sure it'll be doing that in five years, but not now.
One interesting thing is that's it's nondeterministic, so sometimes 'For chrissakes' turns to ちくしょう (Damn!) but sometimes to クリスのために (for Chris' sake). Sometimes 'the goddamn door' turns into クソドア ('shit door'), sometimes the goddamn changes the phrasing of the whole sentence instead. If you run it five times and take the best sentences out of all five runs it's probably quite good. Maybe prompting would help too, I said "idiomatic Japanese" but it still usually translated it in a very "foreigner Japanese" way typical of US drama/movie translations.
Are you giving it multiple paragraphs to translate at once so that it has enough context for a good translation? If so, would you mind sharing a sample input and output that you found unsatisfactory?
In "Can GPT-4 translate literature?" (Mar 18, 2023) [https://youtu.be/5KKDCp3OaMo?t=377], Tom Gally, a former professional translator and current professor at the University of Tokyo, said:
> …the point is, to my eye at least, the the basic quality of the translation [from Japanese to English] is as good as a human might do, and with some relatively mild editing by a sensitive human editor, it could be turned into a novel that would be readable and enjoyable.
I don't think we disagree. The video says the translation will be "readable" but needs several days of an experienced editor passing over it. That's an amazing result, but again, it's not as good as a human yet. It's way faster and it'll make media accessible to tons of people.
Like he says, there's lots of ambiguity in Japanese that needs to be handled, gender not being specified until later, etc. and an editor would need to spend time going over it - but it saves months of traditional work. There are words and _concepts_ that are hard to translate, there are cultural issues, dialects, slang, registers. So yeah it'll make the media accessible, but it won't be as a good as a skilled translator.
He didn't say it was merely "readable"; he said (as I quoted in GP) "the basic quality of the translation is as good as a human might do."
Comment was deleted :(
Last night I used GPT-4 to translate the first several pages of Ted Chiang's Lifecycle of Software Objects (a sci-fi piece) from English to Chinese. I'd say it's about as good as me, save a few minor errors. It's safe to say it performs better than a "tired me", and some translators I've seen on the market.
I'm a native speaker of Chinese, but not a professional translator.
It may depend on a language. For Polish - which is considered one of the most difficult languages due to various forms of words, it works almost perfect - on par with average human translators.
Mind sharing output ?
I mean i can if you want (chinese though) but enough people lie on the internet.
Here is GPT-4's translation, and I find no issues: https://imgur.com/a/oOtf4RD
> I don't think people realize how good Bilingual LLMs are at translations
This.
GPT/ChatGPT is able to even translate between different "accents" or dialects of the same language. For example it can give you the same sentence in Mexican, Argentinean or Peruvian Spanish.
Example:
Me: Give me a sentence in spanish that is different in Mexican, Argentinean and Peruvian Spanish. Write the sentence in each dialect.
ChatGPT: Mexican Spanish: "¿Qué onda, güey? ¿Cómo estás?" Argentinean Spanish: "¿Qué onda, boludo? ¿Cómo estás?" Peruvian Spanish: "¿Qué tal, causa? ¿Cómo estás?"
These sentences mean "What's up, dude? How are you?" in English. The primary difference is the slang term used for "dude" or "friend" in each dialect: "güey" in Mexican Spanish, "boludo" in Argentinean Spanish, and "causa" in Peruvian Spanish.
I don't recommend you call random people "boludo" in Argentina.
It really depends on the tone and context. If you are a tourist and say it in a joking manner, people are probably going to laugh. If you say it in anger to someone, they might not like it very much.
Similar to how a lot of swear words work in many languages.
It's interesting to see how what matters is not the word, but the intention behind it. At the end we are trying to communicate meaning, and words are just one of our tools to do it.
I am also multilingual as well and I've tested it personally. English <-> Portuguese does really well, but Portuguese <-> Japanese or even Japanese <-> English is not as good as a human translator by a long shot because of a lot of hidden subtext in conversation. Even something that a university student would probably pickup on in their first year of Japanese as a foreign language. It is still much better than GPT-3.5, so much so that it made a lot of waves here in Japan, but a few friends who work in translation of books and manga find it is not really a go-to tool yet (yet...).
Oh for sure i don't mean to say it's excellent in every language. But i personally think a lot of that is training data representation. Doesn't need to be anywhere equal but for instance after English(93%), the biggest language representation for GPT-3's training corpus is french at...1.8 % Pretty wild.
Of Course i don't know the data for GPT-4
I am sure it will improve even further as you pointed out the languages outside of English are fairly low in data represented. However, I guess you said you speak Chinese correct? How well does it do with certain things like older poetic Chinese hanzi? In Japanese if there is a string of kanji it tends to mess up the context. Another area of Japanese it seems poorest at is keigo or polite business Japanese. The way you speak to a superior is almost a different language. So I unfortunately still can't use GPT-4 to help me with business emails (yet).
I didn't try with old poetic stuff. Passages sampled from 5 books released in the last 2 decades. You can see what I did thoroughly here. Before GPT-4. Basically a comparison between GLM-130b (English/Chinese model) vs Deepl, Google chatGPT(3.5) etc https://github.com/ogkalu2/Human-parity-on-machine-translati...
Mandarin isn't the second language I speak but I officially compared with it because I wanted to test also with a model that had more equivalent corpus training than the very lopsided gpt models. And Chinese/English is the only combo that has a model of note in that regard.
What language pairs are you talking about? I don't think people realize just how much the difficulty level and the state of technology differ depending on that choice.
English/Chinese is what i've tried on.
and you can see talk on English/Japanese here - https://youtu.be/5KKDCp3OaMo?t=377
Which is to say that there are edge cases like legal texts or other fields where a high level of domain expertise is needed to interpret and translate text. Which most human translators would also not have.
For almost everything else, it seems to produce pretty decent and usable translations, even when used against relatively obscure languages.
I used it a some green landic article that was posted on hn yesterday (about Greenland having gotten rid of daylight saving time). I don't speak a word of that language but the resulting English translation looked like it matched the topic and generally read like correct and sensible English. I can't vouch for the correctness obviously. But I could not spot any weird errors or strange formulations that e.g. Google translate suffers from. That matches my earlier experience trying to get chat gpt to answer in some Dutch dialects, Frysian, Latin, and a few other more obscure outputs. It does all of that. Getting it to use pirate speak is actually quite funny.
The reason that I used Chat GPT for this is that Google translate does not understand greenlandic. Understandable because there are only a few tens of thousands of native speakers of that language and presumably there's not a very large amount of training material in that language.
I can't vouch for the correctness obviously.
Therein lies the rub. There's a huge gap between what LLMs can currently do (spit back something in a target language that gives you the basic idea, however awkwardly phrased, of what was said in the source language). And what is actually needed for idiomatic, reasonably error-free translation.
By "reasonably error-free" I mean, say, requiring a human correction for less than 5 percent of all sentences. Current LLMs are nowhere near that level, even for resource-rich language pairs.
I've tried it between English and Dutch (which is my native language). It's pretty fluent, makes less grammar mistakes than google translate and seems to generally get the gist of the meaning across. It's not a pure syntactical translation. Which is why it can work even between some really obscure language pairs. Or indeed programming languages. Where it goes wrong is when it misunderstands context. It's not an AGI and may not pick up on all the subtleties. But it's generally pretty good.
I ran the abstract of this article through chat gpt. Flawless translation as far as I can see. To be fair, Google translate also did a decent job. Here's the chat GPT translation.
Veel NLP-toepassingen vereisen handmatige gegevensannotaties voor verschillende taken, met name om classificatoren te trainen of de prestaties van ongesuperviseerde modellen te evalueren. Afhankelijk van de omvang en complexiteit van de taken kunnen deze worden uitgevoerd door crowd-werkers op platforms zoals MTurk, evenals getrainde annotatoren, zoals onderzoeksassistenten. Met behulp van een steekproef van 2.382 tweets laten we zien dat ChatGPT beter presteert dan crowd-werkers voor verschillende annotatietaken, waaronder relevantie, standpunt, onderwerpen en frames detectie. Specifiek is de zero-shot nauwkeurigheid van ChatGPT hoger dan die van crowd-werkers voor vier van de vijf taken, terwijl de intercoder overeenkomst van ChatGPT hoger is dan die van zowel crowd-werkers als getrainde annotatoren voor alle taken. Bovendien is de per-annotatiekosten van ChatGPT minder dan $0.003, ongeveer twintig keer goedkoper dan MTurk. Deze resultaten tonen het potentieel van grote taalmodellen om de efficiëntie van tekstclassificatie drastisch te verhogen.
Translating the Dutch back to English using Google translate (to rule out model bias) you get something that is very close to the original that is still correct:
Many NLP applications require manual data annotations for various tasks, especially to train classifiers or evaluate the performance of unsupervised models. Depending on the size and complexity of the tasks, these can be performed by crowd workers on platforms such as MTurk, as well as trained annotators, such as research assistants. Using a sample of 2,382 tweets, we show that ChatGPT outperforms crowd workers for several annotation tasks, including relevance, point of view, topics, and frames detection. Specifically, ChatGPT's zero-shot accuracy is higher than crowd workers for four of the five tasks, while ChatGPT's intercoder agreement is higher than both crowd workers and trained annotators for all tasks. In addition, ChatGPT's per-annotation cost is less than $0.003, about twenty times cheaper than MTurk. These results show the potential of large language models to dramatically increase the efficiency of text classification.
I'm sure there are edge cases where you can argue the merits of some of the translations but it's generally pretty good and usable.
Thanks for counter-example; I'll confess to having spent far too much time with edge-case translations of late (on languages a bit farther apart), rather than on more generic cases like the above.
I will be re-assessing my view on general-case translation performance accordingly.
I wrote accepted corrections to state regulations law on a particular topic, and I can tell you that super-dense legalese for big-time industrial topics, had loopy and inconsistent language.
Just watched a talk[0] about natural language understanding research in post-GPT-3 era. Old issues may has been solved, while new topics are coming to this area (quoted from the slides):
- Retrieval augmented in-context learning
- Better benchmarks
- Last mile for productive application
- Faithful, human-interoperable explanations
Not sure why you got downvotes.
It seems like it's a solved issue indeed. AI reasoning has still some way to go, but it seems language understanding is a finished subject.
Regurgitating training data trigram by trigram is not how human language processing works.
And how does it work, then?
Everyone shrugs and says, “nope, humans are different”. I’ve commented about 100 times recently asking for detail as to how human language / thought works, yet have not seen an answer.
We interpret what we hear, make a mental representation of that (incrementally; this process sometimes fails), which links to concepts, which in turn can link to memories, then "look for the answer" (if it's a question) by association and puzzling, the former is pretty quick, the latter slow, check if the answer makes sense, and formulate a reply. We can start formulating a reply from similarly formed structures while completing the thought, because we monitor our speech. When that happens, you often say "er..."
That's basic linguistics and cognitive psychology. Nothing an LLM has done has invalidated that.
You sure about that? The more I interact with LLMs and learn how they operate, the more it seems to me like people operate on very similar principles and algorithms with their use of language.
The "if it walks like a duck" school of ontology.
Shameless self-promotion, I have recently written a blog about this. ChatGPT actually is usually a little bit worse than older models for these classical NLP tasks. Of course the older models are not zero-shot.
for stuff that before 2021-09, mostly.
Crafted by Rajat
Source Code