Distance between two different sets of points

Question

Before writing this post I looked around for a library that could solve this problem. I didn't find much so I decided to try to write this.

My problem is the following:

I have two sets of points with coordinates x1, y1, and x2, y2. The sets have different number of elements. I want to find what is the average distance between all the elements in set 1 vs all the elements in the set 2 given a certain cutoff. This mean that, if two points (of the two different sets) are further than cutoff they should not be considered.

The easiest solution is to perform \$O(n^2)\$ search and then filter the results based on the distance, but it's inefficient.

I tried to write an algorithm that divide the space of the sets in squares of size "cutoff". For each point I can associate two indexes that tell me to which box the point belong. Looking at the indexes I can generate the lists of neighbors points and calculate the distances only between points that are in confining boxes.

#include <vector> #include <algorithm> #include <iostream> #include <time.h> #include <numeric> using namespace std; // euclidean distance double euc(double x, double y) { return sqrt(x * x + y * y); } //calculate the distance vector between two different sets of points // set 1 of coordinate x1, y1 // set 2 of coordinate x2, y2 vector <double> all_dist(vector <double>& x1, vector <double>& x2, vector <double>& y1, vector <double>& y2) { vector <double> d(x1.size()*x2.size()); for (int i = 0; i < x1.size(); ++i) { for (int j = 0; j < x2.size(); ++j) { d[i*x2.size()+j]=euc(x1[i] - x2[j], y1[i] - y2[j]); } } return d; } vector <double> dist(vector <double>& x1, vector <double>& x2, vector <double>& y1, vector <double>& y2, double cutoff) { //we divide the space of the vectors in squares of size cutoff // each square has two indexes, one for the x coordinates and one for the y // these vectors contain the indexes for each point for the vector x1 and y1 // means: by reading the content of box_x1[n] box_y1[n] we know to which spatial // box the point "n" belongs vector <int> box_x1(x1.size()); vector <int> box_y1(y1.size()); // these vectors contain the indexes for each point for the vector x2 and y2 vector <int> box_x2(x2.size()); vector <int> box_y2(y2.size()); vector <double> res; // we compute the maximum number of sections (or divisions) in the x and y dimension // we need to find the maximum and minimum value for each vector double maxx = max(*max_element(x1.begin(), x1.end()), *max_element(x2.begin(), x2.end())); double minx = min(*min_element(x1.begin(), x1.end()), *min_element(x2.begin(), x2.end())); double maxy = max(*max_element(y1.begin(), y1.end()), *max_element(y2.begin(), y2.end())); double miny = min(*min_element(y1.begin(), y1.end()), *min_element(y2.begin(), y2.end())); int max_box_x = int((maxx - minx) / (1.1 * cutoff)); int max_box_y = int((maxy - miny) / (1.1 * cutoff)); for (int i = 0; i < x1.size(); ++i) { box_x1[i]=(int((x1[i] - minx) / (1.1*cutoff) )); box_y1[i]=(int((y1[i] - miny) / (1.1*cutoff) )); } for (int i = 0; i < x2.size(); ++i) { box_x2[i]=(int((x2[i] - minx) / (1.1*cutoff))); box_y2[i]=(int((y2[i] - miny) / (1.1*cutoff))); } // we need to create the list of neighbors points for a specific box // we need to consider all the boxes that are neighboring a specific box // this mean looking at the boxes +1 and -1 for (int i = 0; i < max_box_x; ++i) { for (int j = 0; j < max_box_y; ++j) { vector <double> points_x1c, points_y1c, points_x2c, points_y2c; for (int k = 0; k < box_x1.size(); ++k) { if ((box_x1[k] == i || box_x1[k] == i + 1 || box_x1[k] == i - 1) && (box_y1[k] == j || box_y1[k] == j + 1 || box_y1[k] == j - 1)) { points_x1c.push_back(x1[k]); points_y1c.push_back(y1[k]); } } for (int k = 0; k < box_x2.size(); ++k) { if ((box_x2[k] == i || box_x2[k] == i + 1 || box_x2[k] == i - 1) && (box_y2[k] == j || box_y2[k] == j + 1 || box_y2[k] == j - 1)) { points_x2c.push_back(x2[k]); points_y2c.push_back(y2[k]); } } // now that we have the two list of points (we have four vectors) // we can calculate the distances between these points vector <double> temp = all_dist(points_x1c, points_x2c, points_y1c, points_y2c); // we still accept only the distances below the cutoff vector <double> temp2; for (int m = 0; m < temp.size(); ++m) if (temp[m] < cutoff) temp2.push_back(temp[m]); move(temp2.begin(), temp2.end(), back_inserter(res)); } } return res; } int main() { int num_el = 50000; double cutoff = 200; vector<double> x1(num_el); vector<double> y1(num_el); vector<double> x2(num_el/2); vector<double> y2(num_el/2); generate(x1.begin(), x1.end(), rand); generate(y1.begin(), y1.end(), rand); generate(x2.begin(), x2.end(), rand); generate(y2.begin(), y2.end(), rand); clock_t begin_time = clock(); vector <double> res = dist(x1, x2, y1, y2,cutoff); cout << float(clock() - begin_time) / CLOCKS_PER_SEC<<endl; cout << accumulate(res.begin(), res.end(), 0.0) /res.size() << endl; begin_time = clock(); res=all_dist(x1, x2, y1, y2); cout << float(clock() - begin_time) / CLOCKS_PER_SEC << endl; vector <double> res2; for (int i = 0; i < res.size(); ++i) if (res[i] < cutoff) res2.push_back(res[i]); cout << accumulate(res2.begin(), res2.end(), 0.0) / res2.size() << endl; }

I wonder if there are some optimizations that can be applied to the code and I also wonder if, somewhere, there is a library that does what need. I tried to measure the speed of the simple \$O(n^2)\$ solution vs the "box solution" (\$O(n \log n)\$? I wish that). The box solution is faster but the result is not exactly the same as the \$O(n^2)\$. Is this an acceptable error in such algorithm?

I have put an image to explain my reasoning. The boxes should be squared of size cutoff (approximately). In the code we can read the variables max_box_x and max_box_y that, for the image, are respectively 4 and 3. The size of the vectors box_x1 and box_y1 are as big as one of the set. By looking at box_x1[n] and box_y1[n] we can tell to which box the particle with index n belongs.

Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. This is not a forum where you should keep the most updated version in your question. Please see what you may and may not do after receiving answers. Consider pointing out the bug in a self-answer instead. Your explanation of the bug was basically a short review already. — Mast
– Mast ♦, Commented Oct 23, 2019 at 13:51

Edward · Accepted Answer · 2019-10-24 12:36:10Z

I see a number of things that may help you improve your program. I'll start with the more superficial and progress to more substantive suggestions.

Don't abuse `using namespace std`

Putting using namespace std at the top of every program is a bad habit that you'd do well to avoid. Know when to use it and when not to (as when writing include headers). In this particular case, I happen to think it's not terrible but it also doesn't help much.

Make sure you have all required `#include`s

The code uses sqrt but doesn't #include <cmath>. Also, remember that sqrt is in the std namespace, so the fully qualified name is std::sqrt. Additionally, rand is used but no <cstdlib>

Be careful with signed and unsigned

In the current code, the loop integers i and j and k are signed int values, but they're being compared with unsigned quantities x1.size() and x2.size(). Better would be to declare them all as unsigned or size_t.

Use `const` where practical

The passed vectors should not be altered (and are not altered) by the processing functions. Indicate that fact (and maybe even get a small speed boost) by specifying const std::vector& as the function arguments.

Use "range `for`" and simplify your code

If you're using a C++11 compliant compiler, the use of "range for" can simplify your code. For example, the code currently contains this:

for (unsigned i = 0; i < res.size(); ++i) if (res[i] < cutoff) res2.push_back(res[i]);

It could be this:

for (auto dist: res) if (dist < cutoff) res2.push_back(dist);

Better would be to eliminate it completely, but we'll get to that later.

Don't use `std::endl` if you don't really need it

The difference between std::endl and '\n' is that '\n' just emits a newline character, while std::endl actually flushes the stream. This can be time-consuming in a program with a lot of I/O and is rarely actually needed. It's best to only use std::endl when you have some good reason to flush the stream and it's not very often needed for simple programs such as this one. Avoiding the habit of using std::endl when '\n' will do will pay dividends in the future as you write more complex programs with more I/O and where performance needs to be maximized.

Don't store data you don't need

Right now, the code attempts to calculate the comprehensive list of distances and puts them into a large vector. On my machine, a double is 8 bytes. With 50000 and 25000 points in the two sets, this means the resulting distance vector is 50000 * 25000 * 8 = 10,000,000,000 bytes or 9.3 GiB. That's a huge amount of data, which the program then mostly discards and then reduces to a single number. (On my machine, that number is always -nan which suggests another problem, but more on that later.)

Use classes more effectively

The description talks about points, but the program actually uses vectors of coordinates instead. I'd recommend creating a templated Point2D object like this:

template <typename T> class Point2D { public: Point2D(T x, T y) : x_(x), y_(y) {} T dist(const Point2D& other) const { return std::sqrt(sqdist(other)); } T sqdist(const Point2D& other) const { const auto dx{x_ - other.x_}; const auto dy{y_ - other.y_}; return dx*dx + dy*dy; } private: T x_, y_; };

Consider using a better random number generator

Right now, the program is using the old C-style rand which is not a very good random number generator. The intent is apparently to create numbers in the range [0, RAND_MAX] where RAND_MAX is implementation defined. On my machine, RAND_MAX = 2147483647 but on yours I suspect it must be 32768. Otherwise you would also be getting -nan for the result or you'd have chosen a different cutoff value. So rather than relying on implementation defined values and a poor random number generator, if you are using a compiler that supports at least C++11, consider using a better random number generator. In particular, instead of rand, you might want to look at std::uniform_real_distribution and friends in the <random> header. Here's one way to do it:

double newrand() { static std::random_device rd; static std::mt19937 gen(rd()); static std::uniform_real_distribution<double> dis(0, 32768); return dis(gen); }

Avoid computationally costly operations

The distance between every pair of points is calculated which means millions of calls to std::sqrt, but this is not really needed since it would be sufficient to calculate the squared distance instead and then compare with the squared cutoff value. We can go a bit further and test the square of each coordinate pair. That is, if \$(\Delta x)^2 \ge t^2\$, or \$(\Delta y)^2 \ge t^2\$, then there's no point in doing further calculations.

Results

When I first ran the program, it takes 30.9 seconds just for the all_dist call and a total of 55.1 seconds to finally print the answer. After applying all of the suggestions above the program runs in 0.95 seconds. Here's the revised code:

#include <vector> #include <algorithm> #include <iostream> #include <ctime> #include <cmath> #include <random> double newrand() { static std::random_device rd; static std::mt19937 gen(rd()); static std::uniform_real_distribution<double> dis(0, 32768); return dis(gen); } template <typename T> class Point2D { public: Point2D() : x_(newrand()), y_(newrand()) {} Point2D(T x, T y) : x_(x), y_(y) {} T dist(const Point2D& other) const { return std::sqrt(sqdist(other)); } T sqdist(const Point2D& other) const { const auto dx{x_ - other.x_}; const auto dy{y_ - other.y_}; return dx*dx + dy*dy; } T sqdist_thr(const Point2D& other, T threshold) const { const auto dx{x_ - other.x_}; const auto dy{y_ - other.y_}; if (dx < threshold && dy < threshold) return dx*dx + dy*dy; return threshold; } private: T x_, y_; }; double avgdist(const std::vector<Point2D<double>>& a, const std::vector<Point2D<double>>& b, double threshold) { const double threshold2 = threshold*threshold; double sum{0}; unsigned count{0}; for (const auto &one: a) { for (const auto &two: b) { auto d2 = one.sqdist_thr(two, threshold2); if (d2 < threshold2) { ++count; sum += std::sqrt(d2); } } } return sum/count; } int main() { constexpr int num_el = 50000; constexpr double cutoff = 200; const std::vector<Point2D<double>> a(num_el); const std::vector<Point2D<double>> b(num_el/2); clock_t begin_time = clock(); std::cout << avgdist(a, b, cutoff) << '\n'; std::cout << float(clock() - begin_time) / CLOCKS_PER_SEC << '\n'; }

A very nice and thorough answer! Out of curiosity, regarding "maybe even get a small speed boost" -- can we find a single case (using say Godbolt) to show that adding const actually allows the compiler to do something smarter? I learned C++ around 15 years ago and this claim was often repeated back then at least, but I wonder if it's true at all. — Juho
– Juho, Commented Oct 18, 2019 at 6:57
O_O impressive. I have just a question. How do I change the class Point2D to accept 2 vectors? I used random values as an example in the code. I have real position in my work. — Fabrizio
– Fabrizio, Commented Oct 18, 2019 at 7:08
@Juho Whether const confers a performance advantage in this context is, of course, entirely dependent on the compiler used. For the most used compilers for desktop machines (clang, gcc, MSVC) there is likely no difference. For some embedded systems compilers I have used, I’ve observed such a difference. Measuring, as you suggest, is the way to find out what difference it makes on your particular platform, configuration and compiler. — Edward
– Edward, Commented Oct 18, 2019 at 7:50
@Fabrizio I would probably not make a change to the Point2D class, but rather create a function that converts two vectors of numbers into a single vector of Point3D. — Edward
– Edward, Commented Oct 18, 2019 at 7:57
@Edward I wasn't limiting ourselves to this specific case, but I'm asking whether there is any case where we can see a difference in the generated code depending on whether const is used or not. — Juho
– Juho, Commented Oct 18, 2019 at 7:59

Juho · Accepted Answer · 2019-10-16 15:51:03Z

Some general comments:

In terms of readability and protection from errors, you could benefit a lot from something like struct point which capsulates a 2-dimensional points. That is, if x and y belong together logically, they should be inside the same structure.
Read about const correctness. All parameters that are not modified should be marked as const, again for readability and protection from errors.
All max_... and max_... variables should be const as well.
You are doing many unnecessary passes over your vectors with max_element and min_element, but what you actually want here is minmax_element.
Is temp2 really necessary? Why not push into res directly?
Currently, you also need to include <iterator>. Note that in C++, we also have <ctime>.
For readability and potentially performance as well, don't declare all variables at the beginning of a scope. That is, box_... vectors and res can be declared much later, closer to their site of usage.
You can precompute cutoff * 1.1 instead of writing the expression each time you need the result.
I don't know if you did already, but Boost.Geometry could be helpful in cleaning up the code.

Thank you for the suggestions. I will apply them. I also have casted a positive vote but my reputation is too low for now. I also noticed many other style problems that I have to fix. Regarding temp2, it is possible that two particles in neighboring boxes are further apart than cutoff. For this reason I have to be sure to remove these results. I am not sure if there is another more direct way to select all the values of an array below a certain threshold and "move" them to the result vector. — Fabrizio
– Fabrizio, Commented Oct 17, 2019 at 8:49

ALX23z · Accepted Answer · 2019-10-17 10:58:08Z

Basics:

Use a point2d struct to save the coordinates. It is better for the code: it is more cache friendly, and you can add methods to the structure that explain what are the functions, and you automatically impose the condition that the coordinate vectors are of appropriate size.
You don't need to return the whole vector of distances, just compute the average and return it.

To improve performance: the box idea is sufficient but implementation leaves a lot to be desired.

The size of the box doesn't necessarily needs to be cutoff or cutoff*1.1. You might want it to be different depending on the density. For low density case, consider the box resolution to be cutoff/sqrt(2) so that points belonging to the same box are necessarily within the correct distance and you can skip the fine-distance filtration in this case. For high density case you can make the box small enough to contain just several points - so that the coarse box-based filtration is more accurate.
There is little need to arrange both sides into boxes, it is enough to arrange just one side into boxes and for each point from the other side simply find relevant boxes and then apply the fine cutoff filtration. Also now you can consider making the comparison not (box1 vs box2) but rather (point1 vs range_of_boxes2) for each row of boxes. It should be more cache friendly.
In the case when you have high point density compared to cutoff size, consider switching to "average square distance" from "average distance", as in this case it is much easier and faster (O(n) vs O(n^2)) to compute average distances points1 vs points2 (as long as all points are within the cutoff distance) - also it allows certain precomputation optimizations that are impossible with "average distance". If this is relevant I can elaborate.

Fabrizio · Accepted Answer · 2019-10-23 14:54:52Z

The code contain a major logical flaw.

I tried to correct my question but @Mast made me notice that this is against the rules, sorry for my mistake.

In the section of the code that uses the boxes there is a for loop to generate the lists points_x1c and points_y1c. In that section there is a wrong "if" statement.

 for (int i = 0; i < max_box_x; ++i) { for (int j = 0; j < max_box_y; ++j) { vector <double> points_x1c, points_y1c, points_x2c, points_y2c; for (int k = 0; k < box_x1.size(); ++k) { if ((box_x1[k] == i || box_x1[k] == i + 1 || box_x1[k] == i - 1) && (box_y1[k] == j || box_y1[k] == j + 1 || box_y1[k] == j - 1)) { points_x1c.push_back(x1[k]); points_y1c.push_back(y1[k]); } }

The correct code should be the following:

 for (int i = 0; i < max_box_x; ++i) { for (int j = 0; j < max_box_y; ++j) { vector <double> points_x1c, points_y1c, points_x2c, points_y2c; for (int k = 0; k < box_x1.size(); ++k) { if (box_x1[k] == i && box_y1[k] == j) { points_x1c.push_back(x1[k]); points_y1c.push_back(y1[k]); } }

This is because, for the lists points_x1c and points_y1c (logically the points belonging to a specific type) you need to take only the values for the box "i,j". Obviously all the other suggestions given in the other answers still remains, the code can be better written and standardized by using these guidelines.

Stack Exchange Network

Distance between two different sets of points

4 Answers 4

Don't abuse `using namespace std`

Make sure you have all required `#include`s

Be careful with signed and unsigned

Use `const` where practical

Use "range `for`" and simplify your code

Don't use `std::endl` if you don't really need it

Don't store data you don't need

Use classes more effectively

Consider using a better random number generator

Avoid computationally costly operations

Results

You must log in to answer this question.

Hot Network Questions

Distance between two different sets of points

4 Answers 4

Don't abuse using namespace std

Make sure you have all required #includes

Be careful with signed and unsigned

Use const where practical

Use "range for" and simplify your code

Don't use std::endl if you don't really need it

Don't store data you don't need

Use classes more effectively

Consider using a better random number generator

Avoid computationally costly operations

Results

You must log in to answer this question.

Related

Hot Network Questions

Don't abuse `using namespace std`

Make sure you have all required `#include`s

Use `const` where practical

Use "range `for`" and simplify your code

Don't use `std::endl` if you don't really need it