1
$\begingroup$

What is the difference between blocking and clustering?

As far as I know clustering originates from the Machine Learning branch and refers to methods of grouping data points together in a "We and them" way: Points in the same cluster are related, points of different clusters are not.

Blocking, however, seems to come from the statistics and picked up later in data analysis. I encountered Blocking/Indexing etc. primarily in Record Linkage, where it is used to partition data into blocks before they are analyzed.

One could even have both blocking and clustering methods used in Record Linkage. However, I find, I don't understand their difference too well yet and thus would like to have a different perspective on that matter.

My interpretation thus far: The goal of blocking and clustering is different:

Blocking is used to throw potential unnecessary data out - distinguishing between important and non-important data.

Clustering however does make that distinction, all data points are important. is used for classification purposed - identifying all data.

$\endgroup$

1 Answer 1

2
$\begingroup$

As you mentioned their goals are different. In clustering, we try to group data such that they have the same variability. For example, clustering customers of a company into different clusters, somehow, members of each cluster have the same behavior in their buying.

On the other side, in blocking we try to reduce the variability, to record linkage, as an instance.

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.