8

I have two huge matrices with equal dimensions. I want to calculate Euclidean distance between them. I know this is the function:

euclidean_distance <- function(p,q){ sqrt(sum((p - q)^2)) } and if these are two matrices: set.seed(123) mat1 <- data.frame(x=sample(1:10000,3), y=sample(1:10000,3), z=sample(1:10000,3)) mat2 <- data.frame(x=sample(1:100,3), y=sample(1:100,3), z=sample(1:1000,3)) 

then I need the answer be a new matrix 3*3 showing Euclidean distance between each pair of values of mat1 and mat2.

any suggestion please?

1
  • @AndresT I want the output to be a matrix too Commented Jan 30, 2016 at 20:28

4 Answers 4

12

This is a job for the base function outer:

outer(mat1,mat2,Vectorize(euclidean_distance)) 
 x y z x 9220.40 9260.736 8866.034 y 12806.35 12820.086 12121.927 z 11630.86 11665.869 11155.823 
Sign up to request clarification or add additional context in comments.

3 Comments

This is giving error Error in formals(FUN) : object 'euclidean_distance' not found
See the OP's function named euclidean_distance(). It is not a built-in R function.
The method in this answer calculates the distance between columns and not rows. And if it's used with a matrix or a transposed dataframe, then it produces a 4-dimensional array. To calculate the distance between rows, you can convert the rows of each input matrix to a list of vectors: x=matrix(1:12,4);y=matrix(1:9,3);outer(split(x,row(x)),split(y,row(y)),Vectorize(function(x,y)sqrt(sum((x-y)^2)))). But in my benchmark, that was about a hundred times slower than sqrt(outer(rowSums(x^2),rowSums(y^2),"+")-2*x%*%t(y)).
10

You can use the package pdist:

library(pdist) dists <- pdist(t(mat1), t(mat2)) as.matrix(dists) [,1] [,2] [,3] [1,] 9220.40 9260.735 8866.033 [2,] 12806.35 12820.086 12121.927 [3,] 11630.86 11665.869 11155.823 

this will give you all Euclidean distances of the pairs: (mat1$x,mat2$x), (mat1$x,mat2$y),..., (mat1$z,mat2$z)

8 Comments

is this way calculates Euclidean distance among pairs?
yes, this amounts to applying your function, euclidean_distance() to all the pairs.
I tried to install that library by install.packages(pdist) but it gives error:Error in install.packages : object 'pdist' not found. how can I install this library?
maybe install.packages("pdist")?
@Grec001 yes, it is the pairwise distance (L2) between each pair of "obervations". Or as stated in the documentation: "Compute distance matrix between two matrices of observations, or two subsets of one matrix"
|
3
library(Rcpp) library(microbenchmark) cppFunction('NumericMatrix crossdist(NumericMatrix x,NumericMatrix y){ int n1=x.nrow(),n2=y.nrow(),ncol=x.ncol(),i,j,k; if(ncol!=y.ncol())throw std::runtime_error("Different column number"); NumericMatrix out(n1,n2); for(i=0;i<n1;i++) for(j=0;j<n2;j++){ double sum=0; for(k=0;k<ncol;k++)sum+=pow(x(i,k)-y(j,k),2); out(i,j)=sqrt(sum); } return out; }') cppFunction('NumericMatrix crossdist2(NumericMatrix x,NumericMatrix y){ int n1=x.nrow(),n2=y.nrow(),ncol=x.ncol(),i,j,k; if(ncol!=y.ncol())throw std::runtime_error("Different column number"); NumericMatrix out(n1,n2); double rs1[n1],rs2[n2],sum; for(i=0;i<n1;i++){sum=0;for(j=0;j<ncol;j++)sum+=pow(x(i,j),2);rs1[i]=sum;} for(i=0;i<n2;i++){sum=0;for(j=0;j<ncol;j++)sum+=pow(y(i,j),2);rs2[i]=sum;} for(i=0;i<n1;i++)for(j=0;j<n2;j++){ sum=0; for(k=0;k<ncol;k++)sum+=x(i,k)*y(j,k); out(i,j)=sqrt(rs1[i]+rs2[j]-2*sum); } return out; }') x=matrix(rnorm(2e4),,10) y=matrix(rnorm(1e4),,10) b=microbenchmark(times=100, crossdist(x,y), crossdist2(x,y), Rfast::dista(x,y), proxy::dist(x,y), pracma::distmat(x,y), as.matrix(pdist::pdist(x,y)), sqrt(outer(rowSums(x^2),rowSums(y^2),"+")-2*tcrossprod(x,y)), sqrt(outer(rowSums(x^2),rowSums(y^2),"+")-2*x%*%t(y)), sqrt(Rfast::Outer(Rfast::rowsums(y^2),Rfast::rowsums(x^2),"+")-2*x%*%t(y)), sqrt(Rfast::Outer(Rfast::rowsums(y^2),Rfast::rowsums(x^2),"+")-2*Rfast::Tcrossprod(x,y)), outer(split(x,row(x)),split(y,row(y)),Vectorize(function(x,y)sqrt(sum((x-y)^2)))) ) a=aggregate(b$time,list(b$expr),median) a=a[order(a[,2]),] writeLines(paste(sprintf("%.3f",a[,2]/min(a[,2])),gsub(" ","",a[,1]))) 

Result:

1.000 crossdist(x,y) 1.054 crossdist2(x,y) 1.217 sqrt(Rfast::Outer(Rfast::rowsums(y^2),Rfast::rowsums(x^2),"+")-2*Rfast::Tcrossprod(x,y)) 1.227 sqrt(Rfast::Outer(Rfast::rowsums(y^2),Rfast::rowsums(x^2),"+")-2*x%*%t(y)) 1.897 Rfast::dista(x,y) 1.946 sqrt(outer(rowSums(x^2),rowSums(y^2),"+")-2*tcrossprod(x,y)) 1.950 sqrt(outer(rowSums(x^2),rowSums(y^2),"+")-2*x%*%t(y)) 2.004 proxy::dist(x,y) 2.402 as.matrix(pdist::pdist(x,y)) 3.674 pracma::distmat(x,y) 177.474 outer(split(x,row(x)),split(y,row(y)),Vectorize(function(x,y)sqrt(sum((x-y)^2)))) 

tcrossprod(m1,m2) is supposed to be a slightly faster alternative to m1%*%t(m2), even though it was about as fast in this benchmark:

> m1=matrix(rnorm(2e4),,10);m2=matrix(rnorm(1e4),,10) > microbenchmark(times=1000,tcrossprod(m1,m2),m1%*%t(m2),Rfast::Tcrossprod(m1,m2)) expr min lq mean median uq tcrossprod(m1, m2) 12.28305 13.06046 17.58402 17.60379 17.74104 m1 %*% t(m2) 12.79996 17.30764 17.52570 17.59473 17.70758 Rfast::Tcrossprod(m1, m2) 11.48939 13.81658 17.68059 17.23675 17.37447 

This is a fast way to calculate the distance of row 1 in m1 to row 1 in m2, row 2 in m1 to row 2 in m2, and so on:

sqrt(rowSums((m1-m2)^2)) 

This is a fast way to calculate the distance of a vector v to each row of matrix m:

sqrt(rowSums(m^2)+sum(v^2)-2*(m%*%as.matrix(v))[,1]) 

3 Comments

Rfast::Tcrossprod needs more than 500*500 size to play fast.
@ManosPapadakis Is it a feature or a bug that the matrix returned by Rfast::Outer is a transposed version of the regular outer function?
No it is not a bug. My partner want it to be called Rfast::Outer(y,x). So if you want the same result, just flip the arguments and it will play correct.
1

Let X and Y be matrices with the same number k of columns but possibly different numbers of rows. You can calculate the Euclidean distance matrix D, i.e. D[i,j] = sqrt( (X[i,1] - Y[j,1])^2 + ... + (X[i,k] - Y[j,k])^2) with this one-liner, using only base R functions:

D = sqrt(do.call(`+`, lapply(1:k, function(j) outer(X[,j], Y[,j], '-')^2))) 

Explanation: First use lapply and outer to get a list of matrices of squared differences, calculated separately for each column as follows:

D2_list = lapply(1:k, function(j) outer(X[,j], Y[,j], '-')^2) 

Then add up the matrices in the list with do.call and take the element-wise square root of the result:

D = sqrt(do.call(`+`, D2_list)) 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.