Malware analysts spend appreciable time inspecting the situations beneath which malicious packages will take sure actions. For instance, take into account a malware program that incorporates a examine for the presence of a debugger, a standard method meant to hinder evaluation. The analyst could want to know if there’s a viable execution path that circumvents this examine and, if there’s a path, what inputs and environmental situations are wanted to traverse it. We name one of these reasoning path discovering. The paths to search out will be based mostly on quite a few standards, corresponding to user-specified begin and finish addresses or passing by means of particular program factors. CERT’s reverse engineers and malware analysts have discovered path discovering is helpful when analyzing malware.
In a earlier submit, we mentioned Kaiju, the CERT/CC‘s extension framework to Ghidra, the Nationwide Safety Company’s software program reverse engineering (SRE) instrument suite. Kaiju consists of many instruments to assist malware evaluation and reverse engineering. One of many extra advanced plugins included in Kaiju is a Satisfiability Modulo Theories (SMT) based mostly path evaluation instrument named GhiHorn. On this submit I delve deeper into GhiHorn to debate the way it works and the way it may be used to unravel path evaluation issues.
Path Discovering With out Supply Code
As famous above with the debugger detection instance, we imagine that many widespread challenges in malware evaluation and reverse engineering will be framed when it comes to discovering a path to a particular level in a program. On the binary degree we don’t have the additional info that you just usually have in supply code. For instance, as a substitute of programmer-named variables we should cause about CPU registers or reminiscence values. Moreover, optimizations carried out throughout compilation could lead to unique or convoluted code preparations that make it laborious to acknowledge widespread program options.
Recall the instance from our earlier weblog submit on binary path discovering, which is proven in Determine 1. Discovering a path from line 1 to the goal at line 20 (marked as “Goal line!”) is comparatively easy by visible inspection: we all know that native integer variable
x have to be set to
42 due to the situation at line 19, which may solely occur at line 12 when variable
y is the same as
2, which in flip depends upon the three integer enter parameters,
ok being set to
Determine 1: Instance program to show path discovering
Nevertheless, when this program is compiled, quite a few complexities emerge:
- Variable names and kinds are misplaced.
- Optimizations could lead to convoluted code buildings.
- There are lots of meeting directions which are mixed to type higher-level operations.
- Every CPU gives its personal instruction set structure (ISA) with particular calling conventions, registers, and operations.
From Pharos to Ghidra
We have now been growing instruments to evaluate path viability in our Pharos Binary Evaluation Framework for a while. Pharos’ path discovering instruments encode program management movement graphs as SMT assertions which are solved by the Z3 theorem prover. The unique encoding scheme utilized in Pharos is predicated on mutually unique Z3 assertions which are generated utilizing Pharos’ evaluation. Representing a program utilizing this scheme turned out to be cumbersome for a number of causes:
- The notation was not designed for human legibility, e.g., knowledge and management movement buildings and the transitions between them have been made utilizing customary Z3 assertions, which made encoding management movement unnatural and laborious for analysts to know.
- Frequent program phenomena have been difficult to mannequin, corresponding to calling capabilities, recursion, and complicated knowledge varieties.
- State variables grew to become derived symbolic values generated by Pharos that may very well be laborious to map again to significant program buildings.
Our different Pharos-based path evaluation instrument is ApiAnalyzer. ApiAnalyzer traverses program management movement graphs on the lookout for sequences of Utility Program Interface (API) operate calls that match prescribed behavioral signatures. The issue of figuring out a path in a program that satisfies particular constraints, corresponding to traversing a sequence of API operate calls, lends itself properly to path discovering. Our new work thus seeks to reframe this API evaluation when it comes to path discovering.
The Pharos method to program evaluation was initially designed to be easy, quick, and helpful for malware analysts. Many of those targets have been achieved on the expense of deeper analyses. It seems that superior path evaluation requires extra constancy that Pharos had problem offering.
Pharos can reply sure questions shortly, for instance: Is the worth used at X the identical as the worth used at Y? Since Pharos focuses on velocity, nevertheless, it trades off how advanced paths can turn into earlier than accuracy is now not assured. Consequently, reasoning about advanced program paths in Pharos is difficult. As famous above, we have been already exploring extra rigorous path evaluation utilizing SMT solvers in Pharos when Ghidra was launched. The Ghidra decompiler, with its wealthy programming API, opened new and thrilling methods to method path evaluation issues.
We have now created a brand new Ghidra-based instrument named GhiHorn (See Determine 2 under) to benefit from our current advances and to supply an extensible framework for binary-path evaluation issues. GhiHorn helps reverse engineers and malware analysts reply fascinating questions corresponding to
- Does a path exist to a specified level in a program (i.e. feasibility)?
- If there’s a path, what values must be assigned to program variables to succeed in it?
- If there’s not a possible path, why?
- Does the trail point out an fascinating or indicative habits?
Determine 2: Ghihorn Consumer Interface
We have named this new Kaiju instrument “GhiHorn” (GHI-dra HORN-ifier), in step with the custom of comparable source-code evaluation instruments utilizing Horn clauses, together with SeaHorn (a C language hornifier) and JayHorn (a Java hornifier). GhiHorn is created within the spirit of those different verification instruments, however it operates on Ghidra-generated knowledge buildings, particularly p-code. Ghidra’s decompiler and p-code language present sturdy details about program semantics for various architectures. Based on Ghidra’s p-code documentation:
A p-code operation is the analog of a machine instruction. All p-code operations have the identical primary format internally. All of them take a number of varnodes as enter and optionally produce a single output varnode.
Varnodes in Ghidra are knowledge parts represented as triples (reminiscence area, offset, and dimension) on which p-code operates. P-code and varnodes are essential to Ghidra’s decompilation course of. Ghidra generates each p-code and varnodes for every instruction in a program throughout preliminary program evaluation and disassembly. The p-code and varnodes initially generated are uncooked within the sense that they’re solely meant to symbolize the instruction semantics with little or no high-level info gleaned from greater order evaluation.
Throughout decompilation, pcode and varnodes are refined and related to summary native variables and source-code degree knowledge buildings. We time period this “excessive p-code” as a result of it’s certain to knowledge buildings in Ghidra that embody decompilation info, corresponding to
HighFunctions. Luckily, the construction of excessive p-code lends itself to SMT-based encoding.
With Ghidra offering the information and management movement buildings essential to symbolize a path, we want a technique to encode this system buildings for an SMT solver. Enter Horn clauses, that are a particular encoding for verification situations that may be constructed routinely from program management movement buildings. Researchers have continued to make advances in Horn clause solvers, and plenty of SMT solvers, together with Z3, now embody Horn solvers, which makes Horn clause encoding a viable resolution for path evaluation issues. We delve into among the particulars of how GhiHorn encodes p-code under.
GhiHorn encodes Ghidra p-code as SMT-Lib Horn clauses appropriate for the Z3 solver. Horn clauses are rule-like constraints expressed as implications. Horn clauses can symbolize transitions by means of a management movement graph of the logical type:
????? ???????? ∧???????????⇒?????? ????????Input Location ∧Constraints⇒Output Location
the place an enter location conjoined with constraints on state variables transitions to an output location. In program phrases, the enter location is the originating primary block, the constraints are situations over program variables, and the output location is a succeeding primary block. An instance of a conditional expression and the related Horn rule generated for it are proven in Desk 1 under. When management movement arrives at Line 1 and the variable
42, then management movement could progress to Line 2. As a result of this can be a choice level within the code, a second rule is required to transition from Line 1 to Line 3. Taken collectively, these guidelines mannequin the conditional construction proven within the following supply code.
Desk 1: Instance of a conditional expression and the related Horn rule generated for it
Ghidra gives a management movement graph from which the enter and output elements of the principles will be derived. Provided that p-code represents machine directions, you should utilize them to symbolize state transitions inside blocks. Translating p-code statements into Z3 expressions then seems to be comparatively straight ahead. For instance, here’s a advanced supply code assertion (proven in Desk 2) that was generated by Ghidra’s decompiler: ((param_1 >> 2) + 1U & 0xff) == 0x55. This instance is used as a result of it decomposes to a number of p-code operations, and these will be mapped to Z3 expressions. Notice that the supply code operations, corresponding to >> have analogs in each p-code (
INT_SRIGHT) and in Z3 SMT (
bvashr). For essentially the most half variables are represented as 64-bit bit vectors.
Notice that the varnodes current in p-code operations are changed with variables within the last Z3 expression (e.g.,
param_1). That is doable as a result of, throughout decomposition, Ghidra assocates varnodes with higher-level decompiler parts, corresponding to variables. Working on significant knowledge parts corresponding to variables gives for far more fascinating outcomes, that are mentioned under.
Desk 2: Decompilation, p-code, and Z3 expressions.
Because the desk above illustrates, the decompilation was generated by Ghidra. The p-code operations embody varnodes represented as triples: (reminiscence area, offset, dimension). The Z3 expressions are all equalities of the shape output = operations enter.
After every primary block is hornified it’s organized right into a set of Z3 guidelines. Every rule is an implication the place the antecedent is the supply primary block and the consequence is the sink block. A whole instance based mostly on the Ghidra decompilation proven in Desk 2 is proven in Determine 3. The rule captures the transition from the fundamental block at tackle 0x100003f60 to the fundamental block at tackle 0x100003fa2. The constraints seize situations on param_1 taken from the p-code operations that have to be true to allow the transition.
Determine 3: Horn rule generated by GhiHorn
A group of those guidelines is generated to symbolize a management movement graph for a program. Notice that variables are handed as state (i.e., arguments) to each the enter and output blocks. On this means, program state is up to date and maintained. Finally the encoded program is handed to Z3, and GhiHorn queries for a state (i.e., the tackle current within the encoding) to find out if a viable path is current.
Though this method to reasoning about program paths is intuitive it requires extra buildings to mannequin actual packages. For simplicity’s sake, all reminiscence operations are carried out on a single giant reminiscence array (named
Reminiscence in Determine 3) that’s managed by Z3. This method retains issues easy however will be restricted, and it’s in no way performant. In future variations of GhiHorn we plan to enhance reminiscence modeling by higher dealing with pointers and higher illustration of various reminiscence areas.
One other downside is the right way to deal with exterior dependencies, corresponding to imported API capabilities for which the code is probably not accessible. GhiHorn gives a functionality to construct a simulated API ecosystem by compiling library recordsdata that comprise easy implementations of widespread API capabilities. For instance, Determine 4 exhibits the supply code for the simulated API capabilities
CloseHandle(). On this instance, the implementation merely maintains an array meant to mannequin a file deal with desk, which makes it easy to trace which handles are open and closed, nothing extra.
Determine 4: Simulated API capabilities
In a future submit I’ll current two path evaluation instruments that we now have carried out on prime of the Ghihorn platform: Path Analyzer and API Analyzer.