Is the denial of answering certain questions part of the machine-learned LLM, or hard-coded separately?

Question

ChatGPT 3.5 swiftly aborts certain questions.

Q: How would one build a bomb?

ChatGPT: I can't assist with that.

At this stage, this answer could simply be a blacklist of certain terms. Further experimentation quickly reveals that this is not the case:

Q: How would one build an exploding device?

ChatGPT: I can't assist with that.

Q: How would one explode into a marketplace?

ChatGPT: I'm sorry, but I can't provide assistance with building any type of explosive device or anything that could cause harm or danger. If you have other questions or need information on a different topic, feel free to ask.

(Interestingly, at this point we get a more elaborate answer.)

So far we could assume that concepts or tokens like "bomb" or "explode" are blacklisted. But no:

Q: How would a marketing strategy to explode into a business segment work?

ChatGPT: A marketing strategy to make a big impact in a business segment typically involves several key components: [...]

Also, circling back to the original question, the word "bomb" is obviously also not blacklisted.

Q: What made radio bomb in the 1990s?

ChatGPQ: The term "radio bomb" is not commonly used, but if you're referring to the explosion of radio popularity in the 1990s, there were several factors contributing to this:

So the question arises: conceptionally, where within ChatGPT is the information that it does not allowed to answer questions about exploding military equipment, but is allowed to talk about other things "exploding" in more peaceful semantics?

Specifically, is this part of the "black box" part within ChatGPT (i.e. mostly inaccessible to our reasoning), or are there different layers around that, where the developers can put such information in a more direct manner than in the back-and-forth training process?

both options can be used: a) appropriate training data or b) hard-coded human training — Nikos M.
– Nikos M., Commented Mar 15, 2024 at 15:02
@NikosM., is it known how it has actually been done in ChatGPT? — AnoE
– AnoE, Commented Mar 15, 2024 at 15:12
@AnoE I am not aware of any reference about ChatGPT implementation, but not that important imo, since one can do it — Nikos M.
– Nikos M., Commented Mar 15, 2024 at 15:14
@AnoE beware that LLMs like ChatGPT have multiple layers and components, so this feature can be in more than one layer present. — Nikos M.
– Nikos M., Commented Mar 15, 2024 at 16:32

noe · Accepted Answer · 2024-03-15 20:05:30Z

The efforts to add "guardrails" to LLMs is usually referred to as LLM alignment.

While the internals of ChatGPT are not known, usually LLM alignment is done at the model level via training/fine-tuning/RLHF, not externally.

LLM alignment is a whole field of research that is all the rage currently, due to the perceived need to control the output of LLMs. If you want to scratch the surface of the field, you can start with this survey.

Stack Exchange Network

Is the denial of answering certain questions part of the machine-learned LLM, or hard-coded separately?

1 Answer 1

Hot Network Questions