Here is how I would approach your problem:
Preparation
Split the time series into bursts, in particular do not use moving windows, since they are not good for capturing the transition between zero and a burst as well as longer bursts.
Collect a test dataset $\mathcal{B} = {B_1,\ldots,B_n}$ of clear real bursts and a test dataset $\mathcal{C} = {C_1,\ldots,C_m}$ of noise bursts. The more, the better.
General procedure
You want to find some characteristic derived from the test datasets that tells you whether a given burst is noise or not (forgive me if I am stating the obvious here). As the noise bursts are more uniform, what suggests itself is using some similarity to $\mathcal{B}$ as a measure. Ideally your measure indicates higher similarity for any $C_1,\ldots,C_m$ than for any $B_1,\ldots,B_n$. However, be prepared for not obtaining such a perfect separation. You can use receiver operating characteristics (ROC) to quantify the quality of your separation and to find a threshold that fits your desires (i.e., is a good compromise between false positives and false negatives).
Be aware that fine-tuning your characteristics may lead to it being overly specific to your test dataset (in-sample optimisation). You can avoid this by collecting two pairs of test datasets and assessing your methods separation capabilities on the respective other one.
Possible Characteristics
The easiest approach is to accumulate all noise bursts ($\mathcal{C}$) into one distribution and use something like the Kolmogorov–Smirnov test characteristics or the Mann–Whitney test to quantify similarity between a burst’s distribution of values and the distribution of values from the known noise bursts $\mathcal{C}$. This makes most sense if the values of all noise bursts are roughly sampled from the same distribution, but this is not a strict requirement – it may still allow for a good separation.
An example for a more complicated characteristics would be to count the number of members of $\mathcal{C}$ your burst complies with.