TL;DNR version What is written below is not an answer to the OP's request for an explanation of the algorithm that the OP has found in a master's thesis from a Chinese university. It does, however, show that while the number of iterations is reduced by a factor of $2$, the execution time of each iteration is almost doubled and the number of finite-field multipliers is more than tripled.
I have not studied the algorithm given in the OP's question in detail, and so do not make any statement regarding whether it is correct. Nor can I provide an explanation of the algorithm. I do, however, wish to make the following observations.
- The (inversionless) Berlekamp-Massey (iBM) algorithm as implemented in VLSI by many people is an iterative algorithm with $2t$ iterations. Each iteration consists of two Steps (i) compute the discrepancy (a sum pf products), and (ii) use the discrepancy to update the polynomials. Thus, the two Steps must be performed sequentially -- Step (i) and then Step (ii). With parallel computation in VLSI, Step (i) has execution time of one finite-field multiplication ($\operatorname{FFM}$) plus $\lceil\log_2(t+1)\rceil$ additions ($\operatorname{XOR}$ for fields of characteristic 2) while Step (ii) has execution time $\operatorname{FFM} + \operatorname{XOR}$ for a total execution time of $2\operatorname{FFM} + (1+\lceil\log_2(t+1)\rceil)\operatorname{XOR}$ per iteration and $4t\operatorname{FFM} + 2t(1+\lceil\log_2(t+1)\rceil)\operatorname{XOR}$ for the entire KES process. The storage requirements are the syndrome register of $2t$ symbols that remains unchanged during the entire KES process, $2$ registers of $t+1$ symbols to store the error-locator polynomial and its auxiliary polynomial (maximum degree $t$), and $2$ registers of $t$ symbols to store the error-evaluator polynomial and its auxiliary polynomial (maximum degree $t-1$).
- The key idea of the reformulated iBM algorithm devised by Naresh Shanbhag and myself in [1] uses the fact that the first of the $2t$ iterations of the Berlekamp-Mssey algorithm does not require any discrepancy computation at all: the discrepancy is just the first entry of the syndrome register. We devised a scheme to modify the syndrome register so that the first entry of the syndrome register would contain the discrepancy in succeeding iterations as well (the fixed location is very helpful in VLSI implementation) and then realized that doing so meant that
(a) Step 1(i) is not needed,
(b) Step 1(ii) and Step 2(i) can be executed simultaneously instead of sequentially,
(c) More generally, Step $m$(ii) and Step $(m+1)$(i) can be executed simultaneously instead of sequentially.
Thus, the KES process using the reformulated iBM algorithm has execution time of $2t(\operatorname{FFM} + \operatorname{XOR})$ which is less than half the execution time of the KES process using the iBM algorithm. - The RiBM architecture implements the reformulated iBM algorithm in a single systolic array of $3t+1$ identical cells. Each cell has two finite-field multipliers, one finite-field adder and one multiplexer. The array executes $2t$ iterations with total execution time $2t(\operatorname{FFM} + \operatorname{XOR})$. At the end of the KES process, the error-locator polynomial is found at the high end of the array and the error-evaluator polynomial is at the low end of the array.
With all that as preliminaries, let us examine the algorithm stated in the OP's question and consider its implementation in an array of $3t+1$ cells that iterate only $t$ times instead of the $2t$ iterations required in by the RiBM architecture. The computation in each iteration seems to have an execution time of $3\operatorname{FFM} + 2\operatorname{XOR}$ for Step 1 and an additional multiplication $\big(\theta_i(r+1)=\gamma(r)\cdot \delta_i(r+2)\big)$ in Step 2 which gives a total execution time of $t(4\operatorname{FFM} + 2\operatorname{XOR})$ which is about two times larger than the $t(2\operatorname{FFM} + 2\operatorname{XOR})$ achieved by the RiBM architecture in [1]. Each cell has 7 finite-field multipliers, 3 finite-field adders, and many multiplexers (I didn't bother to count how many). There also are issues with regard to wiring delays. In the RiBM architecture, all connections between the cells are of the array are local, each cell #i connects only to the neighboring cells #$(i+1)$ and #$(i-1)$ whereas in the architecture proposed, cell #$i$ also connects to cell #$(i+2)$.
For these reasons, I think that the master's thesis found by the OP is not as significant an advance in the state of the art as the OP believes it to be.
[1] Dilip V. Sarwate and Naresh R. Shanbhag, "High-Speed Architectures for Reed-Solomon Decoders," IEEE Trans. VLSI Systems, vol. 9, pp. 641-655, October 2001.