Is breaking encapsulation a necessary compromise for serialization?

Question

I've been considering the way to solve this problem for a while, and I'm currently stuck between two options I both feel are suboptimal. Here's an abstract example of what I'm dealing with:

I have a Project class that stores which files are part of the project, as well as dependency information:

class Project { Collection<File> projectFiles; Multimap<File, File> dependencies; }

I also have an OpenFile class that represents a file open for editing:

class OpenFile { Project project; File file; StringBuilder contents; }

What I want to do is serialize the state of a project, along with the currently opened files. This way, when the project is reopened, all open files are reopened in the same state they were in when closing. The problem is that I need OpenFile.file to always correspond to an entry in Project.projectFiles, so that modifying a file can update its dependencies correctly.

I thought about implementing a FileRef class which is an index into projectFiles:

class FileRef { Project project; int fileId; }

Then Project could own the master list of files, and OpenFile could simply own a FileRef. I'm worried that breaks encapsulation, though, since FileRef is dependent on the state of Project to provide its own services - in effect separating data and code.

Is this an acceptable sacrifice, or is there a better way to implement this?

So your project is actually a directed graph, while files are its nodes. In this setup I find the following API cleaner and easier to work with: let the FileRef be just an int, and always use Project explicitly when dealing with it. That way FileRef remains a plain data object, and it is also clear that it is meaningless without the project. Which is exactly the case. Then there are no circular dependencies, (de)serialization becomes simple. Usage is only slightly more complicated, but at least it is explicit, which I often prefer. With this setup you don't need separate OpenFile class as well — freakish
– freakish, Commented Nov 4, 2024 at 7:27
"The problem is that I need OpenFile.file to always correspond to an entry in Project.projectFiles" -- you have failed to explain why this problem happens. Presumably you are implying that sometimes OpenFile.file does not correspond to an entry in Project.projectFiles. Why doesn't it? What kind of serialization are you doing which prevents this from being the case? — Mike Nakis
– Mike Nakis, Commented Nov 4, 2024 at 7:52
Are you using Java's built-in serialization? Please specify what kind of approach you are using. — JimmyJames
– JimmyJames, Commented Nov 4, 2024 at 15:12
"The problem is that I need OpenFile.file to always correspond to an entry in Project.projectFiles" - the OpenFile class you have already doesn't guarantee that. How are OpenFile instances constructed? And which entity controls the list of currently open files (that belong to a project)? — Bergi
– Bergi, Commented Nov 5, 2024 at 1:32

candied_orange · Accepted Answer · 2024-11-04 10:14:12Z

This isn't an encapsulation problem. This is a single source of truth problem.

Encapsulation is about making clear what private innards are not part of the public interface. Thus its users have no right to expect them to keep working the same way. Hence they should keep their grubby mitts off them and use the public interface.

Single source of truth is a strategy to keep things from getting out of sync by making sure everyone knows the one place to go for the truth. Many copies of the truth may exist. But they can go stale over time. Only the single source holds the truth.

OpenFile is not the source of truth of which files are in the project. They are stale copies of that truth. They are the source of truth about whether that file is open. Even if it isn't in the Project any more.

Getting rid of orphaned OpenFiles is a housekeeping chore. It's not an encapsulation problem.

As for the FileRef idea, I also don't see an encapsulation problem here. Again I see a sync problem. You're indexing into a collection that lives an independent life. Someone calls remove(3) on it and all indexes greater than 2 could suddenly change what they hold.

Solution: make sure your collection isn't ArrayList. Use something that has a fixed index like a hash or a UUID.

Now this isn't to say there is no encapsulation issue to consider. There will be, as soon as you provide interfaces that actually let you do things. I don't see that here. I just see data structure stuff. Sorry, but encapsulation only means something when there are public methods that do stuff.

During debugging, instead of a UUID a number like 1000 + n might be easier, so you can go easier through your data structures manually and match up the numbers in two sources. That can be a pain with UUIDs. — gnasher729
– gnasher729, Commented Nov 4, 2024 at 12:27
@gnasher729 sure. Just so long as these aren’t volatile indexes that require updating when the collection changes. — candied_orange
– candied_orange, Commented Nov 4, 2024 at 19:40

JimmyJames · Accepted Answer · 2024-11-04 16:30:50Z

It's a little unclear from the question but I think I can surmise from it that you are either using Java native serialization or you are using some sort of other automatic serialization approach. By 'automatic', I mean that class definitions are used to define the serialization structure either wholly, or with minimal decoration such as using annotations.

I strongly recommend you avoid such approaches.

Chief Architect Mark Reinhold describes the decision in 1997 to adopt the current serialization feature as a “horrible mistake.”

Reinhold also claims that as many as half of all Java vulnerabilities are linked to the current serialization approach.

Here are a couple articles which discuss some of the problems with Java-serialization as well as similar approaches. The first one has a lot of good links (and some dead ones, unfortunately)

https://dzone.com/articles/jdk-11-beginning-of-the-end-for-java-serialization https://adtmag.com/articles/2018/05/30/java-serialization.aspx

Most of the issues around this are related to security:

Note: Deserialization of untrusted data is inherently dangerous and should be avoided.

It's possible that you are not concerned about this because you trust the serialized structures, or you simply don't have any security concerns with this application. Even so, the issues which create these security concerns also create other practical problems.

One of the main issues is because the serialized data is defined by your classes, if you change the structure of your class, your serialized data may no longer be compatible. If you want to refactor these classes, for example, you need to consider how you will deserialize things that were serialized using the old version of the class. One solution to this is to simply keep the old class around and a routine to map it to the new class. This is ugly and after a few versions, you have a huge mess on your hands. You can also have special applications that convert from old formats to the new ones. This issue often becomes readily apparent during development-testing cycles.

There have been tools and techniques devised to address these concerns. But once you work though all that, the 'ease' of serialization has turned into a slog. I suggest instead that you learn from the mistakes of others and avoid this trap. Take some time to design a wire/file format for your problem. What you describe here is pretty simple to model in something like JSON. You create structure which describes the files (location, etc.) and then you create a structure which says which of those files are open and any other detail about that you like. Then use one of the many available approaches to marshalling and unmarshalling data explicitly. That, is, use constructors to create objects the normal way based on the data in your wire/file format. This decouples your classes from the file/wire format(s).

The whole concept that you can 'serialize' an object to file/wire format is somewhat delusional. An 'object' in working memory cannot be written out and loaded into another context. The best you can do is write the details required to produce a new object with the same state or some portion of it. Once this becomes clear, the perceived advantages of automatic object serialization become much less appealing compared to a planned approach.

Steve · Accepted Answer · 2024-11-04 11:07:22Z

Is breaking encapsulation a necessary compromise for serialization?

Yes, at least in the sense that the serialised data then has a life independent of the object and the code from whence it comes. That is, serialisation in OOP is intrinsically about distilling the data from the code which operates upon it.

I suppose it is theoretically possible that OOP could serialise and store code as well as data, but in practice this doesn't happen, and would be wildly inconsistent with security (for example when serialised code was transmitted with data across a network), or with evolution of the algorithms that get applied to data (since a program would always be deserialising the version of code that applied to the original object, rather than the current version).

Typically, and also separately from whether OOP is in use or not, some kind of loading and saving process is necessary for any kind of data structure that employs pointers into the main memory of the machine for the purpose of expressing keys or references, since by definition loading and saving is about moving that data in and out of main memory, and those pointers are not reusable across different sessions in which that data is loaded into main memory. There has to be a fix-up process of some kind. This is essentially what happens when executable code is loaded from storage too.

The most typical way this fixing up can be avoided for data, incidentally, is to use pointers that are independent of main memory and can be reused across different contexts, such as the ID numbers allocated to rows by database engines.

Dereferencing such a pointer is then a case of explicitly looking it up in the relevant list, rather than simply allowing the machine to look the memory address up in the main memory address space.

I struggled to understand the remainder of your question, and what exactly the problem was with the solution of using the FileId.

If the problem is that this ID seems somehow ad-hoc, you can of course just use a plain reference (i.e. a main memory pointer), and then assemble a memory-independent pointer on the fly at serialisation time, and fix it back up at deserialisation time.

But my suggestion would be to accept the use of explicit pointers like a FileId. And I wouldn't use hashes or UUIDs as suggested by another answer, I'd just use the lowest available positive integer from amongst the entries already available.

Stack Exchange Network

Is breaking encapsulation a necessary compromise for serialization?

3 Answers 3

Hot Network Questions

Is breaking encapsulation a necessary compromise for serialization?

3 Answers 3

Related

Hot Network Questions