Shouldn't deep copy be the default, not shallow copy?

Question

If you have an OO language, where every object always has a copy method, shouldn't that be deep copy by default?

In most languages I know, such a copy method is shallow, since a shallow copy is more lightweight, of course. It's faster, it requires less memory, and it's quite often sufficient. And when you ask for deep copy the typical answer is: Just write your own one if you require one.

But IMHO there is a logic flaw in the last sentence. To write a deep copy for any object type, I need to know that exact object type. I don't need to know the object type when calling copy, as that's a universal method that all objects offer.

So when I have an object of unknown type, Object o, I cannot know if that object references other objects or not, so I cannot know if I'd even need a deep copy for that object or not. How would I write a deep copy for it? I know that this object has a copy method, but I cannot know if that copy is sufficient to freeze the state of that object or if a deep copy would be required.

So why would I want just a shallow copy? I want a shallow copy, when I know that an object may reference other objects and those other objects may change but I don't care if they do or I even want to change them myself and have that change reflected to whoever else references these objects. But when that case arises, I will usually know the type of the object and then I could implement a shallow copy for it or the object itself could offer one.

Let's take an object like Array for example. An array references other objects. If I just get an Object o, I cannot know if that o is an array (well, I can test for it but then I have to test for all existing container classes). If I make a copy of it and pass that copy to some method which will know that this object is an array and access its element and performs operations according to their state, I expect that when I later on pass the same copy to the same method, it will behave exactly the same but that isn't guaranteed, as the objects the array references may have changed, despite me having never changed them in my code. So I have a copy but that copy may change on its own. How could I avoid that? I cannot, since I cannot implement a deep copy for Object or start testing for all known container classes and implement deep copies for each of them.

On the other hand, if copy is always deep, I don't need to know the type of o, I can just call copy on it and get a deep copy for sure that is frozen in time and will never change, unless I change it or expose that copy to other code that could change it.

So when would I want a shallow copy of an array? Usually only, when I know that this is an array and I just want to preserve references to the objects it contains yet don't mind if those objects themselves do change. But in that case I don't have to call copy, I can instead write a simple shallow copy or use something like a copy constructor or if that doesn't exist, I can usually do something like

newArray = new Array() newArray.addElements(from: oldArray)

Shallow copies are usually desired in situations where the object type is known and not when it is unknown and most objects provide an interface that makes a shallow copy easy to perform. It doesn't seem very likely to me that I have an object of unknown type and know for sure, that a shallow copy will do the trick. What situation would that be? Unless I know the type, I cannot know if a shallow copy is safe and if in doubt, a deep copy is the safer copy.

And as for having two copy method for every object, copy and deepCopy, again, how would you know when to call which one? You cannot decide that unless you know the type of object you are dealing with, can you? And if you know that type, a copy constructor that is documented to be shallow would do, wouldn't it?

Is my logic flawed? Shouldn't the default always be optimized for safety and not for performance? Isn't optimizing copy for speed premature optimization?

@pjc50 What happens in current languages when you do so using a copy constructor, that promises to be deep? What happens when you write your own code that makes deep copy of arrays? What happens, when there is an explicit deepCopy method? Good questions, indeed. — Mecki
– Mecki, Commented Oct 17, 2023 at 16:31
@pjc50 How does Python deal with it? Python allows deep copying: stackoverflow.com/a/40900086/15809 — Mecki
– Mecki, Commented Oct 17, 2023 at 17:02
I've often asked myself the same question about why deep merges aren't the default for data structure implementations. I suspect the reasons for and against are similar — Indigenuity
– Indigenuity, Commented Oct 18, 2023 at 15:24
Ruby does deep copy also... not as a built-in, but it's a one-liner using serialization. Serialize the object to memory, deserialize it back in. Not all objects can be serialized, and I have no idea how circular references are handled, but it works for the most common cases. — Wayne Conrad
– Wayne Conrad, Commented Oct 18, 2023 at 15:46

BenM · Accepted Answer · 2023-10-18 01:24:13Z

76

Let's say we have a Car that has four Wheels and an Owner. If we want to deep-copy the Car, we'd probably want to copy the Wheels, but it probably wouldn't make much sense to copy the Owner (and thus copy everything else hanging off the Owner, possibly including the Car itself).

If object A has a reference to object B, it may be because A "owns" B, and B should be included in a deep copy. But it may also be because A merely has some kind of association with B, and B and shouldn't be included in a deep copy. Unless the language provides a way to distinguish between these two types of references, we can't automate a deep copy, and the logic will have to be explicitly specified.

answered Oct 18, 2023 at 1:24

BenM

9316 silver badges7 bronze badges

6

This is exactly what I would have answered, too. It is clearly the concept the OP is missing: to distinguish between references which are just associations, or "part-of" references. And even for "part-of" references, in case the contained object is immutable, a shallow copy will be quite sufficient.

Doc Brown
– Doc Brown

2023-10-18 06:46:12 +00:00
Commented Oct 18, 2023 at 6:46
1

in C++ couldn't one (ab)use the distinction std::unique_ptr (owning something) and std::shared_ptr (referencing something) to automate this? =) Still would probably break a lot of code, and circular dependencies..

Raf
– Raf

2023-10-19 06:33:44 +00:00
Commented Oct 19, 2023 at 6:33
2

@Raf more common is to have no pointer at all for "deep copy"

Caleth
– Caleth

2023-10-19 08:16:30 +00:00
Commented Oct 19, 2023 at 8:16
I'm not too familiar with it, but wasn't one of the motivations of Rust precisely to better annotate and check the "ownership" semantics?

pjc50
– pjc50

2023-10-19 10:18:11 +00:00
Commented Oct 19, 2023 at 10:18
2

We can automatically deep copy (copying all the stuff that is referenced), but as you correctly say, it does not always make sense. Actually, "automatic" deep copies are used, often implemented as using the standard serialisation and deserialisation of the object.

JF Meier
– JF Meier

2023-10-19 11:15:01 +00:00
Commented Oct 19, 2023 at 11:15

| Show 3 more comments

VLAZ · Accepted Answer · 2023-10-19 08:58:47Z

Without keeping a big table of all the objects you've copied so far, you can't safely deep copy objects at all if there may be circular references.

where every object always has a copy method

Is this true of everything? In which languages? Doesn't this mess with e.g. lock objects? Streams? Use of the RAII pattern? Handles to operating system objects? Intentional singletons? There's lots of cases where you may legitimately want to prevent objects being copied at all.

I'd also like to push back on

Shouldn't the default always be optimized for safety and not for performance? Isn't optimizing copy for speed premature optimization?

Deep copy can get arbitrarily expensive. You can't overlook that entirely.

But there are lots of different opinions about "safety". Quite a lot of functional programming and FP influence in modern languages moves towards immutability. Objects once created aren't changed. This is a more mathematical view; if you have f(x) = y then it is always true that f(x) = y, and none of f, x, or y are mutable.

If an object is immutable, there is no point in copying it. You can just pass around references to it. You only ever produce mutated copies, instead of changing values within the object. You start thinking more like value types: nobody asks how many different copies of "2" there are in their program, nor do they care about deep vs shallow copy of "2".

Then we consider the reverse operation: interning. Sometimes people ask how many copies of the string "hello" there are in their program, and would find it convenient if all references to "hello" returned the same object. https://en.wikipedia.org/wiki/String_interning ; this makes deep copy meaningless, because in that situation you only ever want one copy of the interned string.

(Incidentally, it turns out that Java does care about how many copies of "2" there are in your program if you start boxing them - all Int(2) are the same object. But not all Int(2000).)

I expect that when I later on pass the same copy to the same method, it will behave exactly the same but that isn't guaranteed, as the objects the array references may have changed, despite me having never changed them in my code

Immutability fixes this.

In Haskell, the default containers are immutable, and it's quite hard to write a mutable one. You don't add an item to a list, you create a new list containing the contents plus the new item. https://www.fpcomplete.com/haskell/library/containers/

Rust takes a different approach: every reference is annotated with ownership and mutability semantics. So you can call methods confident in the knowledge that they won't mutate particular objects.

In Objective C every object has a copy method, since the top level object they all inherit from has a copy method, so they all inherit it. Yet by default it will cause a fail, unless the object also implements the NSCopying interface, which is like a promise to have implemented copying. However, the whole thing was more hypothetical ("If you have ..."). And whether every object has one or not, was not the key point. Turn this into "every object that implements Copying interface" or has a "canBeCopied" property returning true. — Mecki
– Mecki, Commented Oct 17, 2023 at 16:26
I’ve done many, many years in objective-c and never, ever written code trying to detect cycles. If there is a cycle then I created it and I remove it. But the compiler will try very hard to stop you from creating cycles. — gnasher729
– gnasher729, Commented Oct 17, 2023 at 17:28
"just not the lock state as copying that state makes no sense." that's actually the point. You can't copy locks, because you can't tell if the thing they're locking is internal to the object graph you'r copying (if so, you should probably copy the lock state), or outside of it (in which case, it probably shouldn't be copied at all). Non-copyable resources like locks make it impossible to implement a default deep copy algorithm that's correct. — Alexander
– Alexander, Commented Oct 17, 2023 at 23:33
Added a big section on immutability rather than write a separate comment. Fortunately answers aren't immutable. — pjc50
– pjc50, Commented Oct 18, 2023 at 9:23
@pjc50 When you edit your answer, SO stores the original answer in the edit history, so both the old and new answers are in there. So, answers kind of are immutable. :) — David Conrad
– David Conrad, Commented Oct 18, 2023 at 19:01

Flater · Accepted Answer · 2023-10-18 22:11:08Z

This is one of those questions that is asked because you are currently thinking of a scenario where it'd be beneficial if you didn't have to explicitly implement it, and you're forgetting a long list of scenarios where this would be an annoyance to deal with.

The main issue here is that your question is assuming that anything that object A references therefore is part of the larger object A, in the same sense that a subaggregate belongs to an aggregate root. That's definitely a valid use case, but it is by no means the only possible scenario.

You might be referencing an object with a completely different identity and lifecycle. For example, just because I want to place the same order, and I do so by copying the existing order, does not mean that the product referenced by this new order should also be recreated.

Additionally, the parent type might be a collection, and the child type is an element of that collection. If I want to make a second list, it would be very detrimental if that caused me to have to make duplicate copies of every element on the list itself as well. I'm not even getting into the broken nature of e.g. trying to create a second copy where you shouldn't, it'd already be a massive memory drain to begin with.

All of these scenarios are incredibly common and would significantly break if you default to deep copying.

Reading between the lines of your argument a bit, it sounds to me like you're arguing against the very nature of reference types, instead wanting to use value types.

Just to be clear here, I'm aware that your question points at the copy operation, which on the face of it seems to disprove that you're arguing against reference types. However, as established in the previous section you seem to be glossing over the existence of one object referencing another without sharing its lifecycle, and that is a (if not the) cornerstone of reference typing.

Reference types inherently perform operations on the reference, not the instance, unless you explicitly access the instance behind that reference (usually through the . or -> operator). That's just how reference types work by intention.

And the main premise of the question is then reduced to the notion that it isn't the language doing something that doesn't make sense, it's that you're using something where you disagree about its core premise and use case.

candied_orange · Accepted Answer · 2023-10-18 12:12:33Z

You cannot rely on deep copy working when you don't know the type. The reason why is because, depending on the type, a deep copy might involve reaching out beyond the object in memory to the DB, the file system, or even objects that haven't been constructed yet. A deep copy demands the full context. And that might not exist yet.

Oh sure, that isn't true of every type. But if you demand that it work every time regardless of type it just takes one to ruin your fun.

There's also the fact that holding a reference to something doesn't mean it's part of you. So while a generic shallow copy can promise to only copy parts of you, a generic deep copy that doesn't know your type can't make that promise.

Not a good explanation. There are languages that support deep copying at the language level. Java has the Cloneable interface, a decent default implementation of Object.clone(), and shallow array cloning. C++ has copy constructors. Rust has the Clone trait and even #[derive(Clone)]. In Java and Rust, you can simply call obj.clone() and the object knows how to clone itself (if supported). In C++, it's just a = b as long as there's a copy constructor available. — Nayuki
– Nayuki, Commented Oct 19, 2023 at 19:07
@Nayuki none of that is what we're talking about. All of those are type aware. — candied_orange
– candied_orange, Commented Oct 19, 2023 at 19:13

DJClayworth · Accepted Answer · 2023-10-19 13:31:23Z

The simplest reason why you don't want deep copy to be the default is Containers. When I copy a Container (including arrays) I want it to be an object that has the same contents as the Container I copied, not an object that contains copies of the things in the original Container.

You are right in that you could write special code to handle arrays and other Containers - but you might not know whether the object you are copying is a container or not. Containers (except arrays) are not special kinds of objects, you would have to write code to distinguish them. Now every time you copy an object you have to test to see if it's a container. Or you could make containers' copy methods behave differently - but then what happened to your Liskov substitution principle? You've violated a fundamental rule of object orientation.

Shallow copy is far more common than deep copy. Also deep copy is far more expensive than shallow copy - if deep copy is the default, people writing simple apps that don't really care about the difference will end up writing much slower and more memory hungry code. Also deep copy can't always be implemented, which makes it a bad choice for default behaviour.

For many objects the correct copy behaviour is "mixed" - i.e. some subobjects are duplicated and some are referenced. For example if you duplicate a car, then you probably want to duplicate the wheel objects, but not the owner object. It is very easy to override a shallow copy to duplicate the things you want duplicated. It is much harder (and more expensive) to override deep copy to retain references to theings you want referenced.

When every single language designer has made the same decision, you can bet there are good reasons for it.

For the car: What about the instruction booklet? Value or reference? When thousands of cars have the same booklet? (In Swift some classes are implemented as "copy on demand": If you make a copy of a string, it doesn't matter whether there is one or two objects until you modify either the original or the copy. So until then you only have one object. Kind of. ) — gnasher729
– gnasher729, Commented Oct 19, 2023 at 15:32
@gnasher729 You are reinforcing my point here. Some subjects should be copied by reference and some should be duplicated and if's far from clear which should be which. — DJClayworth
– DJClayworth, Commented Oct 19, 2023 at 18:51
Yeah, but the fact that a sub-object is or is not in a container doesn't tell you whether it is or is not an intrinsic part of the data structure. To stretch the metaphor that's already been used in other answers here: My "car" object might have an array of four "wheel" objects and it could have a scalar "owner" reference. I want the wheels to be copied along with the rest of the car even if they are indirectly referenced via a container, and I don't want the owner to be copied, even despite the fact that the owner is a direct, top-level reference. — Solomon Slow
– Solomon Slow, Commented Oct 20, 2023 at 20:57
@DJClayworth: If a language can distinguish between references that are used to encapsulate the state of an object versus those that are used to identify an object without encapsulating its state, then it usually will be pretty clear what would need to be done, given a reference that encapsulates the state of a mutable object, to produce a reference that encapsulates the same information. Many languages, however, fail to make the required distinction. — supercat
– supercat, Commented Oct 20, 2023 at 21:34

Yakk · Accepted Answer · 2023-10-18 15:11:03Z

Most OO languages are garbage collected and strongly treat complex objects as reference-first. They are value-hostile, and only provide value semantics in limited cases.

In such languages, objects don't contain other objects. They always have a reference to them. So the difference between a Car that refers to an Owner as an external fact about the Car, and a Car that contains its Wheels as parts of it, is not really well expressed.

In a system where objects are aware of their context -- where they have references that end up producing a reference chain that encompasses the entire world -- deep copy by default is disastrous. Copy a pebble, get a new copy of the entire program state, and your entire program dies!

On the other hand, in a language with value semantics as a base assumption, the pebble won't (and in fact cannot) contain the program state. So a deep copy of the value state of the pebble cannot copy the entire program state.

Sometimes actual objects being contained as values is annoying for dependency issues - in that case, value semantics that include unique ownership semantics can be used, again allowing for sensible default copy operations. The car uniquely owns the wheels; it copying the wheels by default makes perfect sense. The car isn't going to uniquely own its owner. (On the other hand, the car could uniquely own its driver.)

We can see this happening in C++, where value semantics - both by member and by value semantics pointer - can make copy by value the default assumption.

But the idea that a given object has a unique owner who determines its lifetime, and thus has the plausible right to duplicate its lifetime, is a step away from the default assumptions of garbage collected languages.

And it isn't all fairy dust in C++: mixed semantics (a mixture of reference and value semantics) is annoyingly easy to accidentally run into. Sometimes it is correct, like a car OWNING 4 wheels but REFERRING to an owner; at other times it is not.

supercat · Accepted Answer · 2023-10-19 16:02:38Z

Many object oriented programming languages follow Java's lead in using the same types to represent four different types of concepts:

A reference to a shareable immutable object, used to encapsulate the state thereof.
A reference to a non-shared mutable object, used to encapsulate the state thereof.
A reference to a mutable object to which outside references exist, which is used to encapsulate both the state of the object thereof and its relationship with holders of outside references.
A reference to some other entity's object of type #3, which is used to encapsulate the relationship with the object and other holders of references to it, but not to encapsulate the object's actual state (e.g. if an object representing a button on a form holds a reference to the form, form properties like background color may be of interest to the button, but would represent part of the form's state, not the button's state.

While references may also be used in other ways, the above four are the most common. If one wants to clone an object, references of types #1 should for maximum efficiency identify the same object in the clone as in the original, but may also target clones of their original targets. References of type #2 must target clones of the original objects. References of types #4 must identify the same objects as in the original. If any references of type #3 exist, either they must be handled specially, or the object containing them must be considered non-cloneable.

Shallow-copying an object is, by itself, only a correct means of cloning an object if all references therein are of the first and fourth kinds. Even if an object contains other kinds of references, however, performing a shallow copy and then adjusting any references of type #2 to target clones of the original objects will often be the easiest way of producing a complete valid clone. By contrast, if all one had was a "deep copy" operation, there would be no way to correctly handle any references of the third and fourth kinds.

Chris Schaller · Accepted Answer · 2023-10-19 23:40:10Z

When we consider Deep copy, one must first ask the question "How deep?" It doesn't matter what language you use, if your data is highly normalised and has many interdependent relationships then a complete deep copy might result in copying the entire collection of in-memory objects as a single graph. If there are any self-relationships or you have two way object references (Child reference has reference back to Parent) this you might endup in infinite loops and ultimately a stack overflow. If you have a database with a lazy loading implementation, you might end up copying the entire database!

Many people who think they want Deep cloning by default have often not worked in these environments and are likely working with closed objects that have relatively shallow graphs. It is possible to design your classes, even when modelling relational databases in a way that deep copy would be viable, you might even be diligent in sticking to your conventions that support this. But given the high degree of variability of what any class might do or become it is too risky to assume from a language point of view that a deep copy implementation is expected to work or even viable as a default option.

You could write your own copy implementation that does iterate the child object references and copy those with the parent, you could do this recursively. But as you implement such logic you will need to make some rulings or trade-offs in certain situations. As you come to these situations you will eventually find exceptions to your rules, at this point you might start to understand why this isn't the default ;).

In most runtimes when we do want a deep copy of an object and no function exists to provide this, the simple way out is to serialize the object and de-serialize it, but this can often fail for the same reasons as explained above. However, because serialization is a common requirement of any application that persists data to a storage medium, data class structures are more likely to have been designed with the exceptions for serialization built in.

Deduplicator · Accepted Answer · 2023-10-19 20:12:31Z

As long as there are any non-copyable objects (locks, connections with others, ...), automagically creating a full deep copy is not always possible.

As long as there are any handles (generalized pointers) which aren't annotated with their deep-copy behavior ("normal" deep copy, shallow copy, special handling), conjuring up a deep copy will stumble over them.

A comparably minor problem are cycles, which might make the algorithm more involved and costly but won't stop it.

Additionally, if you have any kind of synchronization, deep-copying can cause a dead-lock.

In contrast, one can easily determine whether a shallow copy is impossible, or there is a reasonable expectation that it will succeed if there is enough memory. Describing and understanding the algorithm used is simplicity itself, and thus also determining whether it is wrong.

gnasher729 · Accepted Answer · 2023-10-19 15:27:57Z

"shallow copy" vs. "deep copy" is an inherently pointless distinction.

When you copy an object, you want a second object that is separate from the first one, can be changed independently, can act independently, while "deep" vs. "shallow" just looks at an implementation detail.

Let's say you have a "person" object with members "spouse" and "children". When you copy a "person" object, it's not the class that decides if you want references or copies of the spouse and children. Or if you want the copy to be an unmarried childless person. You define what you mean by "copy" and then you have one or possibly multiple methods that do exactly what you mean by copying. Take it more extreme, where the person object has a reference to the country they are a citizen of, with a reference to that country's law books etc. Do you want to copy them? If you have "shallow" and "deep" copy, then you can either copy the spouse and all the countries law books, or you copy neither. That doesn't sound like a good idea either way.

Stack Exchange Network

Shouldn't deep copy be the default, not shallow copy?

10 Answers 10

Hot Network Questions

Shouldn't deep copy be the default, not shallow copy?

10 Answers 10

Related

Hot Network Questions