76

It has been firmly established that my_tensor.detach().numpy() is the correct way to get a numpy array from a torch tensor.

I'm trying to get a better understanding of why. I have studied the internal workings of PyTorch's autodifferentiation library, and I'm still confused by these answers.

Why does it break the graph to to move to numpy? Is it because any operations on the numpy array will not be tracked in the autodiff graph?

I feel that a thorough high-quality Stack-Overflow answer that explains the reason for this to new users of PyTorch who don't yet understand autodifferentiation is called for here. In particular, I think it would be helpful to illustrate the graph through a figure and show how this code could be problematic if it didn't throw an error:

import torch tensor1 = torch.tensor([1.0,2.0],requires_grad=True) print(tensor1) print(type(tensor1)) tensor1 = tensor1.numpy() print(tensor1) print(type(tensor1)) 
8
  • 4
    The variable interface has been deprecated for a long time now (since pytorch 0.4.0). Any description of autograd which says they are necessary is outdated by a couple years. Commented Aug 25, 2020 at 17:10
  • 3
    Is there something confusing in the official docs. I think they do a good job of encapsulating how computation graphs are constructed using a tensor's grad_fn attribute (of course numpy arrays do not have a grad_fn attribute that is populated for arrays which result from operations so gradients can't be tracked for those). Commented Aug 25, 2020 at 17:16
  • 1
    It describes that operations are tracked using the grad_fn attribute which is populated for any new tensor which is the result of a differentiable function involving tensors. Since this tracking functionality is part of the tensor class and not numpy arrays, once you convert to numpy array you can no longer track these operations and can therefore can't apply the chain rule of differentiation (aka backpropagation) Commented Aug 25, 2020 at 17:45
  • 1
    Also, perhaps this causes confusion but there's no computation graph object. What is referred to as the computation graph is really an abstract composition of tensors and functions. Your resulting tensors refer to functions (using grad_fn) which themselves refer to other tensors, which refer to functions, etc.... Given a tensor you could trace back through the grad_fn references which eventually will reference your model parameters (leaf tensors). If you convert to numpy arrays in the middle you can't trace back to those parameters since only tensors have grad_fn. Commented Aug 25, 2020 at 17:52
  • 1
    @jodag My question was originally prompted by reading the docs for detach. Now that I've seen your comment-answer, I think this document is simply saying that the detached tensor is not tied through grad_fn's to the other tensor. But the fact that it shares memory with the other tensor feels odd. But I feel like that's a different question than I am trying to ask here. Commented Aug 25, 2020 at 18:09

3 Answers 3

140
+200

I think the most crucial point to understand here is the difference between a torch.tensor and np.ndarray:
While both objects are used to store n-dimensional matrices (aka "Tensors"), torch.tensors has an additional "layer" - which is storing the computational graph leading to the associated n-dimensional matrix.

So, if you are only interested in efficient and easy way to perform mathematical operations on matrices np.ndarray or torch.tensor can be used interchangeably.

However, torch.tensors are designed to be used in the context of gradient descent optimization, and therefore they hold not only a tensor with numeric values, but (and more importantly) the computational graph leading to these values. This computational graph is then used (using the chain rule of derivatives) to compute the derivative of the loss function w.r.t each of the independent variables used to compute the loss.

As mentioned before, np.ndarray object does not have this extra "computational graph" layer and therefore, when converting a torch.tensor to np.ndarray you must explicitly remove the computational graph of the tensor using the detach() command.


Computational Graph
From your comments it seems like this concept is a bit vague. I'll try and illustrate it with a simple example.
Consider a simple function of two (vector) variables, x and w:

x = torch.rand(4, requires_grad=True) w = torch.rand(4, requires_grad=True) y = x @ w # inner-product of x and w z = y ** 2 # square the inner product 

If we are only interested in the value of z, we need not worry about any graphs, we simply moving forward from the inputs, x and w, to compute y and then z.

However, what would happen if we do not care so much about the value of z, but rather want to ask the question "what is w that minimizes z for a given x"?
To answer that question, we need to compute the derivative of z w.r.t w.
How can we do that?
Using the chain rule we know that dz/dw = dz/dy * dy/dw. That is, to compute the gradient of z w.r.t w we need to move backward from z back to w computing the gradient of the operation at each step as we trace back our steps from z to w. This "path" we trace back is the computational graph of z and it tells us how to compute the derivative of z w.r.t the inputs leading to z:

z.backward() # ask pytorch to trace back the computation of z 

We can now inspect the gradient of z w.r.t w:

w.grad # the resulting gradient of z w.r.t w tensor([0.8010, 1.9746, 1.5904, 1.0408]) 

Note that this is exactly equals to

2*y*x tensor([0.8010, 1.9746, 1.5904, 1.0408], grad_fn=<MulBackward0>) 

since dz/dy = 2*y and dy/dw = x.

Each tensor along the path stores its "contribution" to the computation:

z tensor(1.4061, grad_fn=<PowBackward0>) 

And

y tensor(1.1858, grad_fn=<DotBackward>) 

As you can see, y and z stores not only the "forward" value of <x, w> or y**2 but also the computational graph -- the grad_fn that is needed to compute the derivatives (using the chain rule) when tracing back the gradients from z (output) to w (inputs).

These grad_fn are essential components to torch.tensors and without them one cannot compute derivatives of complicated functions. However, np.ndarrays do not have this capability at all and they do not have this information.

please see this answer for more information on tracing back the derivative using backwrd() function.


Since both np.ndarray and torch.tensor has a common "layer" storing an n-d array of numbers, pytorch uses the same storage to save memory:

numpy() → numpy.ndarray
Returns self tensor as a NumPy ndarray. This tensor and the returned ndarray share the same underlying storage. Changes to self tensor will be reflected in the ndarray and vice versa.

The other direction works in the same way as well:

torch.from_numpy(ndarray) → Tensor
Creates a Tensor from a numpy.ndarray.
The returned tensor and ndarray share the same memory. Modifications to the tensor will be reflected in the ndarray and vice versa.

Thus, when creating an np.array from torch.tensor or vice versa, both object reference the same underlying storage in memory. Since np.ndarray does not store/represent the computational graph associated with the array, this graph should be explicitly removed using detach() when sharing both numpy and torch wish to reference the same tensor.


Note, that if you wish, for some reason, to use pytorch only for mathematical operations without back-propagation, you can use with torch.no_grad() context manager, in which case computational graphs are not created and torch.tensors and np.ndarrays can be used interchangeably.

with torch.no_grad(): x_t = torch.rand(3,4) y_np = np.ones((4, 2), dtype=np.float32) x_t @ torch.from_numpy(y_np) # dot product in torch np.dot(x_t.numpy(), y_np) # the same dot product in numpy 
Sign up to request clarification or add additional context in comments.

10 Comments

I think you generally do a good job keeping the discussion both simple and accurate, but I find the discussion of shared memory confusing. I feel there is something that should be obvious about why, "Since np.ndarray does not store/represent the computational graph associated with the array, this graph should be explicitly removed using detach() when sharing both numpy and torch wish to reference the same tensor," and yet it's not quite obvious enough. Could you elaborate on that a bit?
And, do you think that a figure that illustrates the computational graph, e.g., for the sample code at the end of my question, would clarify your answer further?
I really like how you mention with torch.no_grad() as an alternative to detach.
To be honest the fact that you cannot express the computational graph in numpy doesn't really explain why you must detach the tensor before calling numpy on it. It seems to me the designers of torch could have implicitly detached when numpy is called. I'm not sure if there are any side effects? It seems detach returns a copy, so the original tensor and computational graph aren't effected?
@DavidWaterworth because they share tge same storage, if you do not explicitly detach really bad things csn happen and it would be extremely difficult to debug
Calling .detach() does not un-link the shared memory though, so this is misleading! You need to .clone to do that. As far as I understand it, .detach() might be best practices but is not really affecting anything anybody who asks this question might attempt to do. It's a semantic helper but IMO can hurt more than it helps because someone might mistake it for cloning.
|
11

I asked, Why does it break the graph to to move to numpy? Is it because any operations on the numpy array will not be tracked in the autodiff graph?

Yes, the new tensor will not be connected to the old tensor through a grad_fn, and so any operations on the new tensor will not carry gradients back to the old tensor.

Writing my_tensor.detach().numpy() is simply saying, "I'm going to do some non-tracked computations based on the value of this tensor in a numpy array."

The Dive into Deep Learning (d2l) textbook has a nice section describing the detach() method, although it doesn't talk about why a detach makes sense before converting to a numpy array.

Thanks to jodag for helping to answer this question.

2 Comments

It seems like you got the answer pretty clearly. Why the bounty? Is there anything you think can be made clearer?
@hkchengrex et al. I'm looking specifically for an answer that explains, through figures and simple language appropriate for a newbie, why one must call detach(). I think if the figures illustrated the graph, grad_fn, etc., for the example I just borrowed from Blupon and pasted in my question above, it would explain more clearly not just the question, but numpy's autodiff functionality. I feel I understand the topic reasonably well myself, but I think such an explanation will provide more theoretical depth to SO's coverage of .detach() beyond a quick code solution.
7

This is a little showcase of a tensor -> numpy array connection:

import torch tensor = torch.rand(2) numpy_array = tensor.numpy() print('Before edit:') print(tensor) print(numpy_array) tensor[0] = 10 print() print('After edit:') print('Tensor:', tensor) print('Numpy array:', numpy_array) 

Output:

Before edit: Tensor: tensor([0.1286, 0.4899]) Numpy array: [0.1285522 0.48987144] After edit: Tensor: tensor([10.0000, 0.4899]) Numpy array: [10. 0.48987144] 

The value of the first element is shared by the tensor and the numpy array. Changing it to 10 in the tensor changed it in the numpy array as well.

2 Comments

This does not answer the question. tensor.detach().numpy()[0] = 10 has the exact same effect.
This does answer part of the question. (And I am the OP!) It demonstrates that detach does NOT "detach" the numpy array from the underlying data of the tensor.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.