You could think of this as a graph, which might be kind of fun.
Let's say you have 100,000 products in your entire corpus. Each could be a node in a fully connected weighted graph. The weight could be the frequency in which two things are ordered together.
This would be a pretty large graph, but you could merge nodes based on proximity to bin products based on their weighted proximity. This would allow you to bin these products into reasonable groups, perhaps dropping the number of features in your space from 100,000 to 100. For our purposes, though, let's say you collapse all of your 100,000 products into 10 groups.
You could then encode a cart as the number of occurrences in each group you have. For instance, Shoes might chiefly exist in the first 3/10 groups. So the cart might look something like.
[2,1,1,0,0]
You could then do whatever you want with this data. With a lower number of dimensions a tree might be appropriate. With a larger number of dimensions some deep learning approach might be appropriate. Sky's the limit. You could also experiment with different relationships to create the graph, like how often are things ordered by the same person.
I feel like this approach addresses your question because you're encoding relationships in two ways:
- by binning based on proximity
- by encoding relationships spatially within the input vector