6
\$\begingroup\$

I am trying to apply this simple macro to merge two big sized sheets (around 30000 rows each), but the process is too slow and never ends. The macro works perfectly with smaller sheets. Could you give me any advice for the optimization of my code please ?

My Macro consists in an Userform :

Public listChoice As String 'Using your code to get the sheet names for the ListBox rowsource. Private Sub UserForm_Activate() For n = 1 To ActiveWorkbook.Sheets.Count With SelectSheet .AddItem ActiveWorkbook.Sheets(n).Name End With Next n End Sub 'Including an update event for the ListBox Private Sub SelectSheet_AfterUpdate() listChoice = SelectSheet.Text End Sub 'Including a test just to demonstrate that the result is still retained. You don't need this, it demonstrates the results on the screenshot. Private Sub CommandButton1_Click() Dim sh1 As Worksheet Dim sh2 As Worksheet Dim sh3 As Worksheet Dim lc As String Dim letter2 As String Dim letter3 As String Dim mrgKeyRange1 As Range Dim mrgKeyRange2 As Range Dim cell As Range Dim lastC1 As Integer Dim lastC2 As Integer Dim lastC3 As Integer Dim lrow As Integer Dim currentR As Integer Dim key As Variant lc = listChoice 'closing the UserForm Unload Me Set mrgKeyRange1 = Application.InputBox("Select the range by wich the rows of the current sheet will be merged with the sheet " & lc, Type:=8) 'type 8 serve a fargli pigliare un range Set mrgKeyRange2 = Application.InputBox("Select the corresponding range in " & lc, Type:=8) Application.ScreenUpdating = False Application.EnableEvents = False Application.Calculation = xlCalculationManual 'if there are many calculations, it helps to speed the macro up Set sh1 = ActiveSheet Set sh2 = ActiveWorkbook.Sheets(lc) Set sh3 = Sheets.Add 'renaming the new sheet If Len(sh1.Name) < 26 Then 'the limit of a sheet's title is 31 chars sh3.Name = "Merged" & sh1.Name Else sh3.Name = "MergedSheet" End If 'adding the headers to the new sheet sh1.Rows(1).Copy Destination:=sh3.Rows(1) lastC1 = LastColumn(sh1) lastC2 = LastColumn(sh2) 'LastCol() is defined in the module LastRowColumn lastC3 = LastColumn(sh3) + 1 letter2 = NumberToLetter(lastC2) 'NumberToLetter is described in NumToLet module sh2.Range("A1:" & letter2 & "1").Copy Destination:=sh3.Cells(1, lastC3) 'formatting the headers of the new sheet (sh3) With sh3 lastC3 = LastColumn(sh3) letter3 = NumberToLetter(lastC3) .Cells(1, 1).Copy .Range("B1:" & letter3 & "1").PasteSpecial Paste:=xlPasteFormats .Range("A1:" & letter3 & "1").Columns.AutoFit End With '>>CR note: the macro begins to be veeeeeery slow starting from here: 'For each value in the Merging Key range, it finds the corresponding row in the other sheet For Each key In mrgKeyRange1 If Trim(key) <> "" Then With mrgKeyRange2 Set cell = .Find(What:=key, _ LookIn:=xlValues, _ LookAt:=xlWhole, _ SearchDirection:=xlNext, _ MatchCase:=False) If Not cell Is Nothing Then lrow = LastRow(sh3) key.EntireRow.Copy sh3.Rows(lrow + 1).PasteSpecial xlPasteValues currentR = cell.row sh2.Range("A" & currentR & ":" & letter2 & currentR).Copy sh3.Cells(lrow + 1, lastC1 + 1).PasteSpecial xlPasteValues End If End With End If Next Application.ScreenUpdating = True Application.EnableEvents = True Application.Calculation = xlCalculationAutomatic End Sub 

In order to call the userform, I use this very simple code, written in another module:

Sub MergeSheets() Merge_UserForm.Show vbModeless End Sub 

Here is an example of two sheets that I want to Merge by ID. Normally my sheets contain much more columns and the ID is variable (sometimes are numbers, sometimes strings,...)

enter image description here enter image description here

\$\endgroup\$
11
  • 1
    \$\begingroup\$ What is the logical basis for merging the data? Is it alphabetical sorting? Weeding out (deleting/skipping) duplicate data rows? What constitutes a unique data row (what combination of columns indicates unique-ness and/or sorting)? \$\endgroup\$ Commented Sep 1, 2016 at 13:30
  • \$\begingroup\$ If two sheets have common keywords in two of the ranges chosen by the user (mrgKeyRange1, mrgKeyRange2) their respective rows conteining the same keyword are pasted side by side in a third sheet (sh3). There is nothing constituting data row's uniqueness, all matches produces a copy paste towards the new sheet. \$\endgroup\$ Commented Sep 1, 2016 at 13:45
  • \$\begingroup\$ Can you provide any sample data of your keywords and data sheets? \$\endgroup\$ Commented Sep 1, 2016 at 13:53
  • \$\begingroup\$ Yes, how can I attach it ? \$\endgroup\$ Commented Sep 1, 2016 at 14:45
  • \$\begingroup\$ @LorenzoLP you can edit your post and upload an image with Ctrl+G (or by clicking the image icon in the edit toolbar) \$\endgroup\$ Commented Sep 1, 2016 at 14:47

2 Answers 2

3
\$\begingroup\$

@Mat'sMug does an excellent job with covering the OOP aspects of the code, so I'll just dive straight into the performance issues.

1 - Just a tiny drop in the performance bucket, but you call lastC3 = LastColumn(sh3) twice in this section:

lastC3 = LastColumn(sh3) + 1 letter2 = NumberToLetter(lastC2) 'NumberToLetter is described in NumToLet module sh2.Range("A1:" & letter2 & "1").Copy Destination:=sh3.Cells(1, lastC3) 'formatting the headers of the new sheet (sh3) With sh3 lastC3 = LastColumn(sh3) 

You could omit the second call by simply calculating the new last column instead:

lastC3 = LastColumn(sh3) + 1 letter2 = NumberToLetter(lastC2) 'NumberToLetter is described in NumToLet module sh2.Range("A1:" & letter2 & "1").Copy Destination:=sh3.Cells(1, lastC3) 'formatting the headers of the new sheet (sh3) With sh3 lastC3 = lastC3 + lastC2 

However, if you glance up about 13 lines of code you'll notice that LastColumn(sh3) will always be the same as lastC1 at that point because you've just added it on the line Set sh3 = Sheets.Add and its only contents are the sh1 headers. Best would be to always use the calculated values when you can.

2 - The NumberToLetter should never be used unless you're displaying a human readable column address (and in that case you should let Excel do it for you). Computers work with numbers well, but strings, not so much. What happens with code like this...

letter2 = NumberToLetter(lastC2) 'NumberToLetter is described in NumToLet module sh2.Range("A1:" & letter2 & "1").Copy Destination:=sh3.Cells(1, lastC3 + 1) 

...is that you take your last column number (let's say for the sake of argument that it's 4 like your top data example). You make a function call to convert it to "D", then concatenate it into a String with "A1:" & "D" & "1", and pass the value to sh.Range as "A1:D1". And then... Excel parses the string. It determines that "A" refers to column 1, and "D" refers to column 4. You're doing a bunch of work with the sole effect of making Excel do more work. Do Excel a solid and use the numeric interfaces:

With sh2 .Range(.Cells(1, 1), .Cells(lastC2, 1)).Copy Destination:=sh3.Cells(1, lastC3 + 1) End With 

3 - Don't use .Copy. At all. Ever. It actually does 2 copies internally - one to the clipboard and one to the destination. Then comes the really annoying UX part - it wipes out whatever the user already had in the clipboard. Ctrl-V didn't do anything? Oh. Somebody used .Copy in a macro. Too bad I closed that file/navigated away from that page/deselected the 100 unordered items I laboriously Ctrl-clicked... You get the point. Instead, just assign the values directly:

sh3.Range(sh3.Cells(1, lastC3 + 1),sh3.Cells(1, lastC3 + lastC2 + 1)).Value = _ sh2.Range(sh2.Cells(1, 1), sh2.Cells(lastC2, 1)).Value 

It's much faster, and doesn't blow away the clipboard.

4 - While I have UX on the brain, you shouldn't make the assumption that the user has Application.Calculation set to xlCalculationAutomatic. This is an application wide setting, and it holds its state after the your code completes. Instead, cache the current setting at the start of the code, do what you need to, and then change it back to the setting it was when you found it. Also, if you're going to alter the state of the application, you should have error handling to make sure that it gets set back if you have an exception somewhere:

 On Error GoTo CleanExit Dim calcState As XlCalculation Dim eventState As Boolean calcState = Application.Calculation Application.Calculation = xlCalculationManual eventState = Application.EnableEvents Application.EnableEvents = False '... CleanExit: Application.Calculation = calcState Application.EnableEvents = eventState End Sub 

5 - Only copy data that needs to be copied. .EntireRow is well named - it's every single cell in the row. All 16,384 of them. You already know how wide the data is because you measure it here:

lastC1 = LastColumn(sh1) lastC2 = LastColumn(sh2) lastC3 = LastColumn(sh3) 

Just figure out which sheet mrgKeyRange1 belongs to (hint, it's mrgKeyRange1.Parent) and only copy the appropriate number of columns.

6 - Similar to #1, but with a much, much higher impact on your overall performance is the call lrow = LastRow(sh3) inside your main loop. This value will never increase by more than 1. You enter the loop with only header rows in sh3, and you're adding rows to sh3 one at a time. It can be replaced with this...

lrow = 1 For Each key In mrgKeyRange1 If Trim(key) <> "" Then With mrgKeyRange2 '... If Not cell Is Nothing Then lrow = lrow + 1 key.EntireRow.Copy sh3.Rows(lrow).PasteSpecial xlPasteValues 

7 - Pausing for a brief nitpick, the If block beginning with If Not cell Is Nothing Then should be indented another level.

8 - Avoid round trips to the Worksheets, especially in the loops. You can pull all of the values from a range of cells as an array like this:

Dim allValues As Variant allValues = ActiveSheet.Range("A1:D1000").Value 'allValues is now a 2D array containing all the cell values 'in "A1:D1000". 

Reading individual values from a worksheet is an expensive operation. Reading values from an array is really, really cheap. This is where the rubber really hits the road. You'll see an immediate performance gain by restructuring your main loop like this:

Dim searchItems As Variant searchItems = mrgKeyRange1.Value Dim searchRow As Long Dim searchCol As Long For searchRow = LBound(searchItems, 1) To UBound(searchItems, 1) For searchCol = LBound(searchItems, 2) To UBound(searchItems, 2) key = searchItems(searchRow, searchCol) '... Next Next 

9 - I don't know enough about the data to recommend a specific alternative approach, but my guess is that the .Find call is another big source of burned CPU cycles. If you're only checking one column in SheetX against one column in SheetY and at least one contains only unique values, it would probably be fastest to just sort both of them on the target columns and use "seeking" indexes to ladder your way down both sets of data at the same time. Pseudo-code:

Sort both lists. Do until index1 hits the end of array1 Do until array2(index2) = array1(index1) or index2 hits the end of array2 Increment index2 Loop If index2 is at the end of array2, Exit. Process the match. Do until array1(index1) = array2(index2) or index1 hits the end of array1 Increment index1 Loop Loop 

Otherwise it would probably be faster to read all of the mrgKeyRange2 values into a Dictionary and use its hash lookups instead of doing individual searches. Store the "key" as the key and the row number (or collection of row numbers if you have duplicates) as the value. Then you can just test to see if they're in the Dictionary and retrieve the row number(s) as you iterate through mrgKeyRange1. There's an example of something similar to this method here.

\$\endgroup\$
4
  • \$\begingroup\$ Dear Comintern, thank you very much for your very detailed comment, you are a great teacher =) My code runs much faster now, but I still have some problems when my two ranges to match exceed the 30000 lines of values. In fact, the macro arrives to the end after few seconds but without providing the user only with a new sheet with headers. In other terms, when the sheets size is veeeery big, the result of the "match-copy-paste loop" is nothing. Do you have any explanation in mind ? \$\endgroup\$ Commented Sep 12, 2016 at 16:00
  • \$\begingroup\$ I found the bug ! A function I took from cpearson.com/excel/vbaarrays.htm (NumberOfArrayDimensions) didn't recognize big 2 dimensional arrays as 2 dimensional if their size > 30000 for one of their dimensions. This created an incorrect chain of events which resulted in...nothing ! Now the problem is solved and I emailed the guy (Chip Pearson) for explanations. Thanks again for the support :) \$\endgroup\$ Commented Sep 14, 2016 at 13:50
  • \$\begingroup\$ @Lollo - Wow, that's a horribly naïve implementation of NumberOfArrayDimensions. See the ReadSafeArrayInfo implementation in this answer for a reliable way to get the number of dimensions without using the error handler. It will be sitting in the cDim member of the resulting struct. BTW, the error in the Chip Pearson implementation is that he's overflowing Res if the UBound won't fit into an Integer. \$\endgroup\$ Commented Sep 14, 2016 at 14:00
  • \$\begingroup\$ Yeah I just spotted it ! Anyway, thanks for the link, it seems to be a much more solid solution. \$\endgroup\$ Commented Sep 14, 2016 at 14:21
5
\$\begingroup\$

The first thing I expect to see in any code module - be it a standard module, a class module, a worksheet module, ThisWorkbook, or a UserForm's code-behind, is this:

Option Explicit 

With this, your code wouldn't compile... because you're not declaring every variable that you're using - in the UserForm_Activate handler, n is an implicit Variant that VBA allocates on the fly... and Variant should be avoided for routine tasks such as looping:

Private Sub UserForm_Activate() For n = 1 To ActiveWorkbook.Sheets.Count With SelectSheet .AddItem ActiveWorkbook.Sheets(n).Name End With Next n End Sub 

There are a few things to improve here. A For Each loop performs best when iterating an array; when iterating a collection of objects (such as the Sheets collection of the ActiveWorkbook), it's best to use a For Each loop:

Dim sheet As Worksheet For Each sheet In ActiveWorkbook.Worksheets SelectSheet.AddItem sheet.Name Next 

Notice this loop is iterating the Worksheets collection - the Sheets collection contains sheets that aren't worksheets, such as chart sheets.

A new sheet reference is "captured" at each iteration, so you don't need to access the Sheets or Worksheets collection every time - it's simply given to you by the iteration mechanism; that's why a For Each loop performs better with collections. Keep For...Next loops for iterating arrays.


Public listChoice As String 

Given that a UserForm is really a class module with a designer and a default instance, it should be used as an object - and in object-oriented code, this listChoice module-level variable is a public field.

A public field makes the String value readable from the calling code. The problem is that is also makes the value writable from the calling code... which doesn't always make sense and makes it easier to introduce bugs.

I like that you've abstracted away the ListBox, so the caller doesn't need to know that the selection is coming from a specific control. A better and more object-oriented way to do exactly that is to expose a property:

Public Property Get SelectedSheetName() As String With SelectSheet If .ListIndex = -1 Then Exit Property 'nothing is selected SelectedSheetName = .List(.ListIndex) End With End Property 

Notice that this eliminates the need for a public field for the client code to access the selected sheet name.

Now, looking at the rest of the code, it seems the listChoice field / SelectedSheetName property can very well be Private, given it's only used in the module it's declared in: variables and members should always have the tightest possible scope - I like that all members are Private, but that Public field makes this code possible:

Merge_UserForm.listChoice = "potato" Merge_UserForm.Show vbModeless 

And then the user could click the CommandButton1 without making a valid selection in the SelectSheet listbox, and then this line would blow up:

Set sh2 = ActiveWorkbook.Sheets(lc) 

Side note, what's the lc local variable needed for? listChoice is already there, in scope, waiting to be used... but I'll get back to this in a moment.

I think the form, CommandButton1 in particular, is responsible for way too many things. A UserForm is a view, a user interface: a UI exists because a program needs to collect user input.

I think these InputBox calls are a missed opportunity to have two RefEdit controls on that form, to collect Range selections from the user without ugly and annoying InputBox prompts - not to mention that Excel.Application.InputBox is a bit confusing when there's also the standard VBA.Interaction.InputBox. And you're not validating that the selected range is actually on the lc sheet, which can lead to some interesting bugs.

Why do you need that ListBox at all anyway then? Let the user select a range, and extract the sheet's name from that range!

So back the form's responsibilities: collecting user input. What are the actual inputs you need? mrgKeyRange1 and mrgKeyRange2? I'm not sure I completely understand exactly what's happening (haven't looked at the actual "do work" code yet), but it seems to me your UI could look something like this:

a simple form with two RefEdit controls and an Ok button

Then there would be some logic to validate the selected ranges and make sure the form can't be OK'd without consistent input values (e.g. if the selected columns/rows need to line up, or if the two ranges must be on separate sheets, etc.) - the entire role and purpose of a UserForm is to collect and validate the user's input.

So the only code that should be in a form's code-behind, is simple code that deals with simple mechanics, e.g. making sure that the object instance doesn't get destroyed when the user decides to X-out and cancel everything:

Private cancelled As Boolean Public Property Get IsCancelled() As Boolean IsCancelled = cancelled End Property Public Property Get Selection1() As Range 'todo: rename On Error Resume Next Set Selection1 = Application.Range(RefEdit1.Value) On Error GoTo 0 End Property Public Property Get Selection2() As Range 'todo: rename On Error Resume Next Set Selection2 = Application.Range(RefEdit2.Value) On Error GoTo 0 End Property Private Sub OkButton_Click() cancelled = False Me.Hide End Sub Private Sub UserForm_QueryClose(Cancel As Integer, CloseMode As Integer) If CloseMode = VbQueryClose.vbFormControlMenu Then cancelled = True Cancel = True Me.Hide End Sub 

If the input ranges can be invalid and you would want to prevent the user from OK'ing the form with such invalid input, I would suggest to keep that responsibility outside the form still - by implementing it in dedicated class modules.

I don't know what the rules are, and I don't know how many there are - so I'll go and add a new class module and call it IRangeValidationRule to define a simple interface:

Option Explicit Public Function Validate(ByVal range1 As Range, ByVal range2 As Range) As Boolean End Function 

And then say one of the rules is that range1 and range2 must refer to distinct sheets - we would have another class module, DistinctSheetRangeValidationRule:

Option Explicit Implements IRangeValidationRule Private Function IRangeValidationRule_Validate(ByVal range1 As Range, ByVal range2 As Range) As Boolean On Error GoTo CleanFail Dim result As Boolean result = Not range1.Parent Is range2.Parent CleanExit: IRangeValidationRule_Validate = result Exit Function CleanFail: result = False Resume CleanExit End Function 

And so on for each validation rule you might have. Then the form could have this:

Private rules As New Collection Public Sub AddValidationRule(ByVal rule As IRangeValidationRule) rules.Add rule End Sub 

And you could then determine whether or not the selected ranges are valid by simply iterating the rules:

Private Function IsValid() As Boolean Dim rule As IRangeValidationRule For Each rule In rules If Not rule.Validate(Selection1, Selection2) Then IsValid = False Exit Function End If Next IsValid = True End Function 

And then you could handle the two RefEdit controls' AfterUpdate handlers to run the validation and disable the OkButton until the input is valid:

Private Sub RefEdit1_AfterUpdate() OkButton.Enabled = IsValid End Sub Private Sub RefEdit2_AfterUpdate() OkButton.Enabled = IsValid End Sub 

So, what would the calling code look like then? This couldn't work:

Sub MergeSheets() Merge_UserForm.Show vbModeless End Sub 

First, it's working off the default instance of the form, and then with the actual "do work" code gone, that wouldn't be doing anything. We need to create an instance and work with that - but first let's remove that underscore and keep the form's name PascalCase; the underscore makes it look like some event handler or interface member implementation procedure. Then we'll restrict user interactions to entering valid input or cancelling the form by using vbModal, so while the form is displayed, the user can't interact with anything other than the form:

Public Sub MergeSheets() With New MergeUserForm .Show vbModal If Not .IsCancelled Then DoWork .Selection1, .Selection2 'todo: rename all these End With End Sub 

Notice the conceptual difference here: instead of fire-and-forget displaying a form and not knowing what's happening afterwards, we can see that we're displaying a form, collecting Selection1 and Selection2, allowing the user to cancel everything, and passing the inputs to some DoWork procedure that's responsible for the actual work - the form itself doesn't do much.

If we have implemented validation rules as I've shown above, we could have this:

Public Sub MergeSheets() With New MergeUserForm .AddValidationRule New DistinctSheetValidationRule .AddValidationRule New SomeOtherValidationRule .Show vbModal If Not .IsCancelled Then DoWork .Selection1, .Selection2 'todo: rename all these End With End Sub 

The biggest advantage with this approach, is that you've decoupled the actual work (and input validation rules) from the UI, and you can even write unit tests for them if you want - and because the DoWork procedure is taking the user input as parameters rather than prompting for it, you can write unit tests for it too!

I'll end this answer here and let other reviewers address the actual do work procedure and its performance problems.

\$\endgroup\$
12
  • 2
    \$\begingroup\$ Good answer! For OP - Always turn on Option Explicit. You can have it automatically by going to Tools -> Options in the VBE and checking the Require Variable Declaration option. \$\endgroup\$ Commented Sep 1, 2016 at 19:21
  • \$\begingroup\$ Dear Mat's Mug, thank you for your detailed answer, you made me learning a huge amount of new vba features, and programming in general :) Unfortunatly, however, the code doesn't run because of this error : "user-defined type not defined", that is apparently related to this Public Sub AddValidationRule(ByVal rule As IRangeValidationRule) rules.Add rule End Sub....any idea? \$\endgroup\$ Commented Sep 7, 2016 at 12:24
  • \$\begingroup\$ @Lollo there needs to be a class module named IRangeValidationRule for that code to be compilable - I've edited my answer to clarify =) \$\endgroup\$ Commented Sep 7, 2016 at 13:15
  • \$\begingroup\$ I already did it before commenting ! :( \$\endgroup\$ Commented Sep 7, 2016 at 13:59
  • \$\begingroup\$ Then it's probably just a typo - the class name should be offered by IntelliSense after entering the whitespace following As \$\endgroup\$ Commented Sep 7, 2016 at 14:37

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.