This is over Nan’s head but thought you might like to know.
(Ha Ha, my computer wouldn’t accept “computer morality” in my tags, insisting it should be computer mortality…)
arXiv Forum: How do we make accessible research papers a reality?
Can we truly call it “Open Science” when most research papers are not fully accessible? You are invited to join the forum on Monday April 17 to chart a path towards truly accessible research papers.
Learn More Skip to main content
Computer Science > Computation and Language [Submitted on 15 Feb 2023 (v1), last revised 18 Feb 2023 (this version, v2)]
The Capacity for Moral Self-Correction in Large Language Models
We test the hypothesis that language models trained with reinforcement learning from human feedback (RLHF) have the capability to “morally self-correct” — to avoid producing harmful outputs — if instructed to do so. We find strong evidence…
You must be logged in to post a comment.