1

I have ~20,000 .asm files from IDA pro output via hex-rays.

These were all created from known malware, and all from 32bit Windows Portable Executables.

I do not have the original executables, just the disassembled output(.asm) files.

  • What I am trying to obtain is a list of any possible mnemonics (i.e. add, xor, jump, etc..) ,that IDA could output into an .asm file

    With this list I will be attempting a machine learning/ malware classification task using grep (or similar) to compile statistics.

Inspecting them visually I have hand crafted a list of 30 or so ( jmp, push,mov, call, lea.. etc etc) with help from this site, which list common instructions http://www.strchr.com/x86_machine_code_statistics.

Are there any clues in the headers of these files which could assist in defining possible mnemonics ? Are these consistent across platforms or specific to some attribute of the original file?

I searched IDA pros documentation, and it seem all the functionality for this is available during the disassembling process, but I am stuck with the .asm files to parse.

similar questions with no help.

Parsing IDA Pro .asm files

IDA Pro List of Functions with Instruction

sample .asm Header

 ; ; +-------------------------------------------------------------------------+ ; | This file has been generated by The Interactive Disassembler (IDA) | ; | Copyright (c) 2013 Hex-Rays, <[email protected]> | ; | License info: | ; | Microsoft | ; +-------------------------------------------------------------------------+ ; ; --------------------------------------------------------------------------- ; Format : Portable executable for 80386 (PE) ; Imagebase : 400000 ; Section 1. (virtual address 00001000) ; Virtual size : 0002964D ( 169549.) ; Section size in file : 00029800 ( 169984.) ; Offset to raw data for section: 00000400 ; Flags 60000020: Text Executable Readable ; Alignment : default ; OS type : MS Windows ; Application type: Executable 32bit include uni.inc ; see unicode subdir of ida for info on unicode .686p .mmx .model flat ; =========================================================================== 

sample from inside

.text:00401080 ; --------------------------------------------------------------------------- .text:00401081 CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC align 10h .text:00401090 8B 44 24 10 mov eax, [esp+10h] .text:00401094 8B 4C 24 0C mov ecx, [esp+0Ch] .text:00401098 8B 54 24 08 mov edx, [esp+8] .text:0040109C 56 push esi .text:0040109D 8B 74 24 08 mov esi, [esp+8] .text:004010A1 50 push eax .text:004010A2 51 push ecx .text:004010A3 52 push edx .text:004010A4 56 push esi .text:004010A5 E8 18 1E 00 00 call _memcpy_s .text:004010AA 83 C4 10 add esp, 10h .text:004010AD 8B C6 mov eax, esi .text:004010AF 5E pop esi .text:004010B0 C3 retn .text:004010B0 ; --------------------------------------------------------------------------- 

Thanks for any pointers or clues as to the best way to approach this and my apologies if this isn't suitable for this forum.

3
  • 1
    What you want is an x86 opcode reference. Commented Mar 6, 2015 at 18:11
  • 1
    Are you hoping to find there is a specific set of assembler opcodes that are used by malware? I'd doubt that assertion right from the start, on the premise that malware is just regular software (which happen to do malicious things). Commented Mar 6, 2015 at 22:10
  • 1
    @Jongware Part of a datmining competition to clssify 'families' of malware kaggle.com/c/malware-classification Commented Mar 7, 2015 at 22:20

1 Answer 1

3

As I'm working with the malware samples provided by kaggle too, I faced the same problem. I found a solution by the processing in two steps, which extracts all the mnemonics used in the complete set.

Note: As I'm not finished with my work yet, I'm not able to post the full script. The real implementation is realized with threading and the process takes roughly one hour for all 9 families. Addtionally the solution is not perfect and with good performance - rather a dirty fix.


1. Step: Roughly cleaning the IDA listing format of an INPUT.ASM into an OUTPUT.ASM (extraction from my script; see the discussion for this step here)

Note: It should be mentioned that ignore dd like instructions. Additionally I keep the subroutines and basic blocks delimeted by ==== and -----.

 grep -E '^.text:*' INPUT.ASM | grep -v align | grep -E '^.{10,15}[0-9A-F]{2} *|=======================|-----------------------------------' | sed 's/\t/ /g' | grep -v ' dq ' | grep -v ' dd ' | grep -v ' db ' | grep -v ' dw ' | cut -c100-200 | sed -e 's/^[ \t]*//' | tr -s [:blank:] | cut -d ';' -f1 > OUTPUT1.ASM 

2. Step: Process the cleaned OUTPUT.ASM in python (extraction from my script)

 #!/usr/bin/python mneLocal = set() with open('OUTPUT.ASM') as oFile: for line in oFile.readlines(): mne = line.split(" ")[0] if mne[0] != '-' and mne[0] != '=' and len(mne)≤6 and not mne[0].isdigit() and mne.islower(): mneLocal.add(mne) print(mneLocal) 

3. Output: Applied on the Ramnit dataset

 set(['jns', 'fbstp', 'jnp', 'rol', 'psrlw', 'fld1', 'jnz', 'movd', 'imul', 'lds', 'jnb', 'psrlq', 'cdq', 'psrld', 'pand', 'pfmax', 'ror', 'fxch', 'jno', 'dt', 'fisub', 'movq', 'cmps', 'arpl', 'pi2fd', 'pfmin', 'cld', 'nop', 'pf2id', 'maxss', 'add', 'jcxz', 'adc', 'fadd', 'pf2iw', 'fistp', 'setbe', 'aad', 'maxps', 'fmulp', 'movzx', 'fdivp', 'fdivr', 'femms', 'not', 'repe', 'cmc\r\n', 'svts', 'repne', 'shr', 'pfadd', 'sgdt', 'mulps', 'leave', 'div', 'mulpd', 'shl', 'btc', 'cmp', 'rcpps', 'psubd', 'psubb', 'bts', 'btr', 'loope', 'jle', 'pandn', 'fist', 'out', 'fstcw', 'cbw\r\n', 'xor', 'sub', 'neg', 'rep', 'lddqu', 'jge', 'movs', 'pfrcp', 'fdiv', 'jecxz', 'xchg', 'mul', 'pavgb', 'lea', 'ficom', 'pfsub', 'jz', 'addpd', 'jp', 'subsd', 'js', 'bt', 'fidiv', 'daa\r\n', 'jo', 'clc\r\n', 'lods', 'jg', 'ja', 'jb', 'addps', 'jl', 'cmovz', 'movsd', 'cld\r\n', 'xorpd', 'les', 'cmovl', 'subss', 'movsx', 'xlat', 'cmova', 'cmovb', 'nop\r\n', 'sbb', 'or', 'cmovg', 'shrd', 'fsub', 'por', 'bound', 'pop', 'setnb', 'fmul', 'pabsw', 'subps', 'minsd', 'minss', 'sti\r\n', 'xadd', 'cdq\r\n', 'setnl', 'retf', 'faddp', 'retn', 'rcr', 'rcl', 'pslld', 'call', 'setnz', 'das\r\n', 'aas\r\n', 'setns', 'setnp', 'sldt', 'ptest', 'fcomi', 'divps', 'jmp', 'rcpss', 'ffree', 'lgdt', 'pfacc', 'utes', 'shld', 'fcomp', 'fsave', 'psraw', 'aam', 'subpd', 'fstsw', 'psrad', 'pxor', 'fsubp', 'fsubr', 'fldcw', 'dec', 'fld', 'loop', 'and', 'addsd', 'cmovs', 'fldz', 'psubq', 'sal', 'int', 'lock', 'andpd', 'in', 'fucom', 'ud2\r\n', 'addss', 'fild', 'sar', 'scas', 'psllw', 'andps', 'bswap', 'inc', 'mulss', 'paddd', 'std\r\n', 'paddb', 'psubw', 'stc\r\n', 'idiv', 'psllq', 'paddw', 'cli\r\n', 'mulsd', 'paddq', 'test', 'setp', 'fiadd', 'hnt', 'orpd', 'enter', 'minps', 'bsr', 'mov', 'orps', 'fstp', 'xorps', 'setle', 'bsf', 'fo', 'pfmul', 'movss', 'setb', 'aaa\r\n', 'setl', 'divsd', 'fimul', 'seto', 'fcom', 'hlt\r\n', 'jbe', 'fst', 'divss', 'sets', 'push', 'pavgw', 'setz']) 

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.