NEW METHOD:
This is much quicker. Final result as a list of partitioned audios is stored in saudio .
(By the way the dictionary is really reliable - it translates огонь as wire instead of fire :-D)
aud = ImportByteArray[ ByteArray[ Join[Flatten[ Round[(ImageData[Import[#]][[All, All, 1]])*255] & /@ {"https://i.sstatic.net/jtqLu.png", "https://i.sstatic.net/IW5q6.png", "https://i.sstatic.net/bVYL0.png", "https://i.sstatic.net/CNAUu.png"}]]], "MP3"]; del = ImportByteArray[ ByteArray[ Join[Flatten[ Round[(ImageData[Import[#]][[All, All, 1]])*255] & /@ \ {"https://i.sstatic.net/F1j5L.png"}]][[1 ;; -4]]], "MP3"]; le = QuantityMagnitude@AudioLength[del]; alm = AudioLocalMeasurements[aud, "RMSAmplitude", PartitionGranularity -> Quantity[le, "Samples"]]; par = Partition[ Join[{0}, Flatten[{# - 0.2, # + 0.2} & /@ Select[FindPeaks[alm, 0] // Normal, 0.005 < #[[2]] < 0.02 &][[All, 1]]], {QuantityMagnitude[ Duration[aud], "Seconds"]}], 2]; saudio = AudioTrim[aud, #] & /@ par;
OLD METHOD:
All the computation (correlation) is done by:
vys = {}; Do[dif = (daud[[1, n + 1 ;; n + le2]] - ddel[[1]]) // Abs // Total; If[dif < 380, AppendTo[vys, {n, dif}]], {n, 1, le1 - le2}] vys
where daud is audio data of test.mp3 and ddel is audio data of delimiter.mp3. Variable vys contains positions of all delimiter sounds and can be seen as list output down bellow.
Since Mathematica's ListCorrelate sucks in performance if used with custom functions (fifth and sixth argument of the function) I had to use good-old Do which overcome ListCorrelate by several magnitudes, yet still it took 12 minutes to do the Do cycle.
Yes, it is ridiculous that it takes 3:47 minutes to play the whole sound and computation takes almost four times more.
(*aud=Import["C:\\...\\test.mp3"]; del=Import["C:\\...\\delimiter.mp3"];*) aud = ImportByteArray[ ByteArray[ Join[Flatten[ Round[(ImageData[Import[#]][[All, All, 1]])*255] & /@ {"https://i.sstatic.net/jtqLu.png", "https://i.sstatic.net/IW5q6.png", "https://i.sstatic.net/bVYL0.png", "https://i.sstatic.net/CNAUu.png"}]]], "MP3"]; del = ImportByteArray[ ByteArray[ Join[Flatten[ Round[(ImageData[Import[#]][[All, All, 1]])*255] & /@ \ {"https://i.sstatic.net/F1j5L.png"}]][[1 ;; -4]]], "MP3"]; daud = AudioData[aud]; ddel = AudioData[del]; ddel = ddel/Max[ddel[[1]]]; le1 = daud[[1]] // Length; le2 = ddel[[1]] // Length; vys = {}; Do[dif = (daud[[1, n + 1 ;; n + le2]] - ddel[[1]]) // Abs // Total; If[dif < 380, AppendTo[vys, {n, dif}]], {n, 1, le1 - le2}] vys pos = (SortBy[#, #[[2]] &] & /@ Gather[vys, Abs[First@#1 - First@#2] < le2/2 &])[[All, 1, 1]]; Partition[Join[{1}, Flatten[{#, # + le2} & /@ pos], {le1}], 2]; AudioTrim[ aud, {Quantity[#[[1]], "Samples"], Quantity[#[[2]], "Samples"]}] & /@ Partition[Join[{1}, Flatten[{#, # + le2} & /@ pos], {le1}], 2]
{121101,350616,580313,809835,1039339,1268895,1498431,1727917,1957611,2187133,2415123,2649062,2878614,3108288,3337810,3567318,3796869,4026366,4255889,4485586,4715110,4944615,5174180,5403663,5633189,5862885,6092408,6321905,6551419,6780961,7010485,7240181,7469704,7699205,7928716,8158296,8387783,8617480,8847001,9076505,9306015,9535568,9765081,9994777}
The code produces list that contains positions of delimiters in sample units and list of partitions of the original audio without delimiter sound.
