0

I have a matrix A of dimension 1000x70000. my loss function includes A and I want to find optimal value of A using gradient descent where the constraint is that the rows of A remain in probability simplex (i.e. every row sums up to 1). I have initialised A as given below

A=np.random.dirichlet(np.ones(70000),1000) A=torch.tensor(A,requires_grad=True) 

and my training loop looks like as given below

for epoch in range(500): y_pred=forward(X) y=model(torch.mm(A.float(),X)) l=loss(y,y_pred) l.backward() A.grad.data=-A.grad.data optimizer.step() optimizer.zero_grad() if epoch%2==0: print("Loss",l,"\n") 

1 Answer 1

1

An easy way to accomplish that is not to use A directly for computation but use a row normalized version of A.

# you can keep 'A' unconstrained A = torch.rand(1000, 70000, requires_grad=True) 

then divide each row by its summation (keeping row sum always 1)

for epoch in range(500): y_pred = forward(X) B = A / A.sum(-1, keepdim=True) # normalize rows manually y = model(torch.mm(B, X)) l = loss(y,y_pred) ... 

So now, at each step, B is the constrained matrix - i.e. the quantity of your interest. However, the optimization would still be on (unconstrained) A.


Edit: @Umang Gupta remined me in the comment section that OP wanted to have "probability simplex" which means there would be another constraint, i.e. A >= 0.

To accomplish that, you may simply apply some appropriate activation function (e.g. torch.exp, torch.sigmoid) on A in each iteration

A_ = torch.exp(A) B = A_ / A_.sum(-1, keepdim=True) # normalize rows 

the exact choice of function depends on the behaviour of training dynamics which needs to be experimented with.

Sign up to request clarification or add additional context in comments.

2 Comments

You may also want each element of A to be >0.
@UmangGupta right. I have edited the answer.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.