Skip to content

Latest commit

 

History

History
346 lines (235 loc) · 13.6 KB

concepts.md

File metadata and controls

346 lines (235 loc) · 13.6 KB

Understanding BinDiff

This page provides some background information about how the BinDiff matching engine works.

It also describes the various matching strategies/algorithms that can be configured in the config file.

Table of Contents

Introduction

BinDiff works on the abstract structure of an executable, ignoring the concrete assembly-level instructions in the disassembly. Every function gets a signature, based on the structure of the (normalized) flow graph of the function. The signature consists of:

  • Number of basic blocks
  • Number of edges between basic blocks
  • Number of calls to sub-functions

Once the two sets of signatures (for the two executables) have been generated, initial matches are created. This works by selecting a subset of all functions in each executable that share a common characteristic. A match is created if a signature occurs once (and only once) in both examined subsets of signatures.

After this step, call graphs (graphs which contain information about the calls-to relations between functions) are used to generate more matches: If a match is known, the subsets of all functions called from a matched function are examined. These subsets are significantly smaller than the set of all functions, thus the propability for finding new unique matches is a lot higher. This process is repeated until no new matches can be found.

This means that after running the BinDiff algorithm, the result is a list of functions that were successfully associated with each other, as well as two lists of functions that could not be associated.

General Matching Strategy

BinDiff has a list of function attributes (see below) suitable for generating matches. It starts on a global level, considering all functions of the binary and calculates the first attribute for every function. There are several possible outcomes:

  1. An attribute is unique in both binaries. The function is matched.
  2. An attribute appears several times in both binaries - the matching is ambiguous. BinDiff proceeds with a "drill down" step, only considering the equivalent sets of functions for that attribute. Drilling down means trying the next best attribute until it either runs out of algorithms, matches functions uniquely or because the set dissolves because an attribute does not match any of its functions.
  3. An attribute does not have a match in the other binary. The function is kept in the unmatched set.

After the initial global matching step the parents (callers) and children (callees) of each new match are considered. BinDiff tries to match functions in the set of parents and children by performing the "drill down" step described above on each. Finally, BinDiff performs basic block matching for all newly matched functions and matches functions called from matched basic blocks(function: call reference matching). This concludes global matching for a single attribute. The whole process is restarted on the remaining unmatched functions using the next best attribute.

Function Matching

Function attributes are used in one of two ways. Either canonically per function or per edge. Edge matching tries to match edges (which represent calls in the call graph or jumps in the flow graphs) if source and target function attributes match. Thus edge matching is the stronger criterion, yielding better matches in general. However, since it is possible to have a lot of edges per vertex in a graph (this is especially true for call graphs, where the number of edges often grows super-linearly in the number of vertices), edge matching may potentially be very slow.

If you do encounter performance problems for certain binaries you should first try to disable edge matching based algorithms.

Function matching algorithms ordered roughly by resulting match quality:

Hash matching

Matches functions based on a hash of the original raw function bytes. Thus two functions matched by this algorithm should be identical or near-identical on the byte level.

Match quality: very good
Performance: very good

Name hash matching

Matches functions based on a hash of their names. Only real names are considered, names which have been auto-generated by the disassembler are not used. This is one of the few algorithms that can match imported functions, i.e. functions that do not have an actual body in the binary. False matches are highly unlikely.

Match quality: very good
Performance: very good

Edges flow graph MD index

Matches call graph edges based on their source and target function's MD indices. Thus calls between two structurally identical functions are matched.

Match quality: very good
Performance: medium

Edges call graph MD index

Matches call graph edges based on their callgraph MD index. This means the call graph leading to that particular call is structurally identical in both binaries. Match quality depends on how deep the callstack leading up to this edge is: the deeper the less likely is a false match.

Match quality: good
Performance: medium

MD index matching (flow graph MD index, top-down/bottom-up)

Matches functions based on their structure using the MD index. Since the MD index takes a topological graph ordering as one of its inputs, it can be parametrized by whether the graph vertices are sorted into levels following calls from the entry point (top-down) or callers from the exit points (bottom-up).

Match quality: good
Performance: very good

Prime signature matching

Matches functions based on their instruction prime products. Each mnemonic gets assigned a unique small prime number. These primes are multiplied for all instructions of the function. This yields a structurally invariant, instruction order independent product to be subsequently used as a matching attribute.

Match quality: good
Performance: very good

MD index matching (call graph MD index, top-down/bottom-up)

Matches functions based on their position in the call graph. The call graph leading up to that function must be structurally identical as viewed from the program entry point (top-down) or its exits (bottom-up).

Match quality: good
Performance: very good

Edges proximity MD index

Matches functions based on their local call graph neighborhoods. Calls and callees are only followed two levels deep as seen from the function in question.

Match quality: medium
Performance: poor

Relaxed MD index matching

Matches functions based on their relaxed MD indices. The MD index is calculated without taking topological order into account. This means only the in-edges and out-edges in the function's local neighborhood are considered.

Match quality: medium
Performance: medium

Address sequence

Matches functions in order based on their entry point addresses. This is a special matching step that is especially useful during drill downs. Since it would indiscriminately match all functions if not further constrained there are two additional requirements: first, the functions in question must already be equivalent according to the relaxed MD index and the flow graph MD index. And second, the two sets of equivalent functions in both binaries must be of equal size.

Match quality: poor
Performance: very good

String references

Matches functions based on their referenced string data. All strings referenced from the functions in question are put into a combined hash which is subsequently used as the matching attribute if at least one string is referenced. This is a good algorithm for matching error handling code which often has very little structure (and thus won't get matched by the stronger algorithms) but lots of references to error message strings.

Match quality: medium
Performance: very good

Loop count matching

Matches functions based on the number of loops. Only applied if at least one loop is present.

Match quality: poor
Performance: very good

Call sequence matching (exact/topology/sequence)

Special algorithm that is only used for functions with matched parents (callers). The point of the call at the call site is determined as a tuple: (topological basic block level, instruction number in the basic block, address).
The child function (callee) is matched if basic block level and instruction number match (exact), if only the basic block level match (topology) or simply ordered by call site address (sequence). This produces very weak matches in general, but may be a good strategy if the parent function was matched correctly. In that case it is not unlikely that it will call functions in the same order in both binaries.

Match quality: very poor
Performance: good

Basic Block Matching

Basic block matching on flow graph level is algorithmically very similar to function matching. Global attribute matching is followed by drill downs and by trying to match in the reduced set of parents/children of matched basic blocks.

Basic block matching algorithms ordered roughly by resulting match quality:

Edges prime product

Flow graph edges are matched if source and target basic block instruction prime products match. Thus both basic blocks contain identical instructions, potentially ordered differently.

Match quality: very good

Hash matching (4 instructions minimum)

Basic blocks are matched based on a binary hash of their raw bytes. Only used on basic blocks with at least 4 instructions.

Match quality: very good

Prime matching (4 instructions minimum)

Basic blocks are matched based on their instruction prime product. Only used on basic blocks with at least 4 instructions.

Match quality: very good

Call reference matching

Basic blocks are matched if they call at least one function and all called functions have been matched.

Match quality: very good

string reference matching

Basic blocks are matched if they reference at least one string and that string is the same in both binaries.

Match quality: very good

Edges MD index (top-down/bottom-up), MD index matching (top-down/bottom-up),and relaxed MD index matching

Basic blocks are matched based on their position in the flow graph.

Match quality: good

Prime matching (0 instructions minimum)

Works like the previous prime matching step but with the minimum number of instructions constraint lifted.

Match quality: medium

Edges Lengauer-Tarjan dominated

Matches the back edges of loops which have been determined using the Lengauer-Tarjan algorithm.

Match quality: medium

Loop entry matching

Matches basic blocks that are loop anchors, i.e. targets of a back edge.

Match quality: low

Self loop matching

Matches basic blocks that have self loops.

Match quality: low

Entry point/exit point matching

Matches the entry/exit point basic blocks. The entry point is uniquely identified by the function and usually has an indegree of 0. Exit points are vertices with outdegree 0.

Match quality: low

Instruction count matching, jump sequence matching

Special matching steps that are only applied to basic blocks that are structurally equivalent based on their MD indices. Basic blocks are matched according to their number of instructions or simply ordered by address.

Match quality: very low

Propagation (size 1)

Special matching step of last resort. Single unmatched parents/children of matched basic blocks are matched - with no regard of their content.

Match quality: very low

Confidence/Similarity

The confidence value displayed by BinDiff is the average algorithm confidence (match quality) used to find a particular match weighted by a sigmoid squashing function. The values aren't simply averaged because few single weak matches in an otherwise perfectly matched function/binary shouldn't drag the confidence down too much. Analogously, even a few strong matches will not "rescue" a binary pair matched primarily by address sequence and similarly weak algorithms.

The similarity value for a function is a weighted sum taking the following factors into account:

  1. Weight ~25%: quota of matched flow graph edges to total edges
  2. Weight ~15%: quota of matched basic blocks to total basic blocks
  3. Weight ~10%: quota of matched instructions to total instructions
  4. Weight ~50%: difference in flow graph MD index

The final similarity value is multiplied by confidence - even a seemingly good match is not trustworthy if produced by weak algorithms.

The similarity value for the whole binary is a weighted sum taking the following factors into account:

  1. Weight ~35%: quota of matched flow graph edges to total edges
  2. Weight ~25%: quota of matched basic blocks to total basic blocks
  3. Weight ~10%: quota of matched functions to total functions
  4. Weight ~10%: quota of matched instructions to total instructions
  5. Weight ~20%: difference in call graph MD index

This is, again, multiplied by confidence. Only non-library functions are considered for the counts. This is to avoid inflating the similarity of binaries that simply use the same runtime library but are otherwise completely dissimilar.

Further reading

The original papers outlining the general ideas behind BinDiff:

  • Thomas Dullien and Rolf Rolles. Graph-Based Comparison of Executable Objects. bindiffsstic05-1.pdf. SSTIC ’05, Symposium sur la Sécurité des Technologies de l’Information et des Communications. 2005.
  • Halvar Flake. Structural Comparison of Executable Objects. dimva_paper2.pdf. pp 161-173. Detection of Intrusions and Malware & Vulnerability Assessment. 2004.3-88579-375-X.