I have ~20,000 .asm files from IDA pro output via hex-rays.
These were all created from known malware, and all from 32bit Windows Portable Executables.
I do not have the original executables, just the disassembled output(.asm) files.
What I am trying to obtain is a list of any possible mnemonics (i.e. add, xor, jump, etc..) ,that IDA could output into an .asm file
With this list I will be attempting a machine learning/ malware classification task using grep (or similar) to compile statistics.
Inspecting them visually I have hand crafted a list of 30 or so ( jmp, push,mov, call, lea.. etc etc) with help from this site, which list common instructions http://www.strchr.com/x86_machine_code_statistics.
Are there any clues in the headers of these files which could assist in defining possible mnemonics ? Are these consistent across platforms or specific to some attribute of the original file?
I searched IDA pros documentation, and it seem all the functionality for this is available during the disassembling process, but I am stuck with the .asm files to parse.
similar questions with no help.
IDA Pro List of Functions with Instruction
sample .asm Header
; ; +-------------------------------------------------------------------------+ ; | This file has been generated by The Interactive Disassembler (IDA) | ; | Copyright (c) 2013 Hex-Rays, <[email protected]> | ; | License info: | ; | Microsoft | ; +-------------------------------------------------------------------------+ ; ; --------------------------------------------------------------------------- ; Format : Portable executable for 80386 (PE) ; Imagebase : 400000 ; Section 1. (virtual address 00001000) ; Virtual size : 0002964D ( 169549.) ; Section size in file : 00029800 ( 169984.) ; Offset to raw data for section: 00000400 ; Flags 60000020: Text Executable Readable ; Alignment : default ; OS type : MS Windows ; Application type: Executable 32bit include uni.inc ; see unicode subdir of ida for info on unicode .686p .mmx .model flat ; =========================================================================== sample from inside
.text:00401080 ; --------------------------------------------------------------------------- .text:00401081 CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC align 10h .text:00401090 8B 44 24 10 mov eax, [esp+10h] .text:00401094 8B 4C 24 0C mov ecx, [esp+0Ch] .text:00401098 8B 54 24 08 mov edx, [esp+8] .text:0040109C 56 push esi .text:0040109D 8B 74 24 08 mov esi, [esp+8] .text:004010A1 50 push eax .text:004010A2 51 push ecx .text:004010A3 52 push edx .text:004010A4 56 push esi .text:004010A5 E8 18 1E 00 00 call _memcpy_s .text:004010AA 83 C4 10 add esp, 10h .text:004010AD 8B C6 mov eax, esi .text:004010AF 5E pop esi .text:004010B0 C3 retn .text:004010B0 ; --------------------------------------------------------------------------- Thanks for any pointers or clues as to the best way to approach this and my apologies if this isn't suitable for this forum.