5

I'm calculating median as:

DECLARE @TEMP TABLE ( ID INT ) Select ( ( Select Top 1 ID From ( Select Top 50 Percent ID From @Temp Where ID Is NOT NULL Order By ID ) As A Order By ID DESC ) + ( Select Top 1 ID From ( Select Top 50 Percent ID From @Temp Where ID Is NOT NULL Order By ID DESC ) As A Order By ID Asc ) ) / 2 

Above query I want to use. But, in my case there are so many columns for those I want to calculate MEDIAN. But I think it would be bad to repeat above block of code for each column. So, I'm trying to define separate function which would accept column values, process and return median. Should I have to define table-value-funtion for that or there is another optimized way to do so??


This question is related to the following questions:

2 Answers 2

11

There are much more efficient ways to calculate a simple or grouped median than the one shown in your question:

What is the fastest way to calculate the median?
Best approaches for grouped median

The general winner for 2012 is a method by Peter Larsson. The pattern is:

Simple Median

SELECT Median = AVG(1.0 * SQ.YourColumn) FROM ( SELECT NumRows = COUNT_BIG(*) FROM dbo.YourTable WHERE ColumnName IS NOT NULL ) AS C CROSS APPLY ( SELECT YT.ColumnName FROM dbo.YourTable AS YT WHERE YT.ColumnName IS NOT NULL ORDER BY YT.ColumnName ASC OFFSET (C.NumRows - 1) / 2 ROWS FETCH NEXT 1 + (1 - C.NumRows % 2) ROWS ONLY ) AS SQ; 

Grouped Median

SELECT SQ2.GroupingColumn, SQ2.Median FROM ( SELECT GroupingColumn, NumRows = COUNT_BIG(*) FROM dbo.YourTable WHERE ColumnName IS NOT NULL GROUP BY GroupingColumn ) AS C CROSS APPLY ( SELECT Median = AVG(1.0 * SQ1.YourColumn) FROM ( SELECT YT.ColumnName FROM dbo.YourTable AS YT WHERE YT.GroupingColumn = C.GroupingColumn AND YT.ColumnName IS NOT NULL ORDER BY YT.ColumnName ASC OFFSET (C.NumRows - 1) / 2 ROWS FETCH NEXT 1 + (1 - C.NumRows % 2) ROWS ONLY ) AS SQ1 ) AS SQ2; 

To maximize the performance of the OFFSET method above, you may need to add a locking hint (advanced topic). Suitable indexing will also be required, of course.

Code reuse

This is hard to achieve directly with a T-SQL function, since these do not allow the execution of dynamic SQL (assuming you were thinking of passing in the column name).

There are several ways around this, including using a function to generate the dynamic SQL text itself, which can then be executed by the caller. There aren't really enough details in the question to say which approach would be most suitable for you.

0
5

If you want to calculate a Median value, there are a few ways to do this depending on what version of SQL Server you (or someone else, perhaps) are on. Dwain Camps wrote two articles for Simple Talk in which he collected several pure T-SQL options from various folks, gives examples of each and compares their performance:

However, I don't think you can encapsulate any of those methods into a function for easy re-use. For that you would need to use SQLCLR in order to create a User-Defined Aggregate (UDA). I wrote an article several years ago showing how to create a UDA, using Median as the example :-)

Getting The Most Out of SQL Server 2005 UDTs and UDAs

Regarding that article, please keep in mind:

  • The article was written in 2007, before SQL Server 2008 came out, which introduced an improvement (i.e. allowing for storing more than 8000 bytes) that renders the compression technique shown in the article to be somewhat unnecessary. But, it should be fairly easy to remove the GZipStream stuff and change MaxByteSize from 8000 to -1.
  • You don't need to code anything since the article includes, at the bottom, a SQL script that installs the UDA so you can just download, run, and use the Median aggregate :-)

See also:

2
  • 1
    While generally a fan of SQLCLR, it's tough to recommend a UDA for this, given the lack of support for IsInvariantToOrder, and the consequent impact on memory use and performance compared with the best T-SQL methods. The basic implementation is extremely limited in maximum input size, meaning the query could fail unexpectedly. Extending it to LOB size as suggested, is not without significant implications either. Commented Jan 26, 2016 at 7:41
  • @PaulWhite Fair enough with regards to larger data sets. I could add a caveat regarding context and include your comment. But MaxByteSize = -1 doesn't mean someone will be using >= 1 MB; it could just as easily be 9 - 10k. Though I have not tested the impact of setting to -1 yet still only using < 8k. Commented Jan 26, 2016 at 7:51

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.