Variable %0 (ctr in the surrounding function) is counting the number of bits in the byte being transmitted. The branch causes the routine to loop until all 8 bits of %1 (curbyte) have been processed.
It looks like you may have removed more than just no-ops. There should be at least one branch based on the value of a bit in %1 that selects one of two paths through the body of the loop. The out instructions toggle an output pin high and low to create the PWM waveform that the WS2812 requires, and the no-ops control the relative timing of the high and low periods.
Ah, yes. Looking at the original code, it has:
asm volatile( " ldi %0,8 \n\t" "loop%=: \n\t" " out %2,%3 \n\t" // '1' [01] '0' [01] - re [ variable number of NOPs based on w1_nops ] " sbrs %1,7 \n\t" // '1' [03] '0' [02] " out %2,%4 \n\t" // '1' [--] '0' [03] - fe-low " lsl %1 \n\t" // '1' [04] '0' [04] [ variable number of NOPs based on w2_nops ] " out %2,%4 \n\t" // '1' [+1] '0' [+1] - fe-high [ variable number of NOPs based on w3_nops ] " dec %0 \n\t" // '1' [+2] '0' [+2] " brne loop%=\n\t" // '1' [+3] '0' [+4] : "=&d" (ctr) : "r" (curbyte), "I" (_SFR_IO_ADDR(ws2812_PORTREG)), "r" (maskhi), "r" (masklo) );
The pin gets set high at the beginning of the loop. The sbrs instruction (Skip if Bit in Register is Set?) determines whether the pin gets set low after the w1_nops period, or not until after the w2_nops period as well. The total of w1_nops + w2_nops + w3_nops (plus the instructions shown here) determines the total period of each bit.