Edit 1
This edit is to answer the questions raised in the comments.
The basic idea of delay and sum beamforming is to apply delays to different acquisition channels such that the sounds the originate from one point in space align and "amplify" when signal from the different channels are added. Sounds that orignate from other regions of space do not align and therefore are not "amplified".
The point in space for which the sounds align using a certain set of delays is called the focus of the microphone array (or focal spot). In reality however, the focus is not an ideal point but rather a small(ish) (depending on the array) region of space for which the sounds align well. The size of this region is called the size of the focal spot.
The geometry (size, shape, etc) depend on the exact details of the array: number of microphones, microphone spacing, frequency content of the signals of interest. See e.g. this article.
For more information look for texts on focusing "phased arrays" or "linear arrays" in ultrasonics. Beamforming can be used on reception (to amplify signals from a certain point in space) or on emission (to create a "loud" spot in a room). The principles are identical: just replace "microphone" by "loudspeaker" in your thinking.
Regarding the calibration procedure: you are correct. The procedure I outlined is too simplistic. It only works well if you can create the calibration clap from a much longer distance than the region of space you are interested in. (I.e. to ensure a plain wave.)
If this is not possible, you have to take the position of the clap into account. In this case, the simplest procedure is to correct the delays by cross-correlation as described but then add the curvature of the wavefront back onto the the signal by applying an "inverse beamforming" set of delays calculated with the position of the origin of the clap. (I.e. if you use a depth variable +t0 (or +z0) in your "normal" beamforming algorithm, you need to use -t0 (or -z0) for the inverse beamforming algorithm.)
What is the point of this calibration: it eliminates any errors due to the different sound cards starting their recording at slightly different times. This would normally prevent signals aligning properly even with correct delays and thus prevent the amplification effect you are looking for.