My Reading: Software Defense

HexType: Efficient Detection of Type Confusion Errors for C++

Source Code:

Summary: In a type-based programming language, typecasting is a common phenomenon. With the object-oriented programming paradigm, this feature turns into a dangerous attack surface. When a derived class object cast to parent class object (upcast), it is usually safe, considering parent class is a subset of the derived class. On the other hand, casting a parent class object to the derived class pointer (downcast) can raise a serious issue (if the derived class has an extra item than parent class). Moreover, these casting as often indirect (an object pass to a function as a class pointer type). The research work attempts to resolve the issue with runtime checking (similar checking as dynamic casting, but expensive and limited to classes with virtual function). The research expands the coverage with a more efficient runtime checking with their defined data structure.

Design: The design can be divided into three parts. 1) Type Relationship Information: This information is available to the compiler at compile time to verify static cast, but they have to instrument it to the binary to use it at runtime. 2) Object Type Tracing: Whenever an object is allocated, a table will update the object information with its original class type. To do that they instrument every allocation site. The table they use is a combination of a hashmap with each hash item represents an RB-tree. The hashmap check considers as a fast-path check while RB-tree traverse considers as slow-path check. The hashmap will frequently update to recent object verification. 3) Type Casting Verification: In every type casting location, instrumentation will ask for runtime verification of that casting object. The verification will collect runtime information and verify against (1)[if casting path allowed according to the tree] and (2)[original type of the object is casting of]. They expand this verification coverage to dynamic cast too by replacing it to theirs. However, they also optimize the performance by discarding safe object from being tracked and safe casting from being asked to verify (detect the safe object and casting through static analysis).

Advantage: The design is simple to be scalable. They also detect couples of type confusion bugs using this. Performance overhead is better than CaVer but worst than TypeSan. They also achieve better code coverage.

Disadvantage: Static analysis based optimization is not described completely. Traditional analyses are not reliable. Handling reinterpret cast is also dubious.

Discussion: Overall, type confusion is an interesting issue to cover. Detection through runtime verification is hopeful. C also could have a similar bug, they should all be covered.

My Reading: Memory Defense

Stack Bounds Protection with Low Fat Pointers

Source Code:

Summary: The research work is an extension of their another work (Heap bounds protection with low-fat pointers). The concept of low-fat pointers are originated in the previous paper, they provide well-details of that too. The basic concept of low-fat pointers is: use the pointer memory itself to calculate its object boundary instead of extending the memory or creating a shadow memory. Additionally split up the memory region into different virtual pages based on object size for performance improvement. However, handling heap memory explicitly is easier than the program maintained stack memory for which they propose a stack allocator which will handle all stack memory operation.

Design: According to the design, when a pointer loads the object information, it has to access a lookup table and use its pointer address divided by 32GB (size of the virtual page) as the lookup to figure out its object size. Next, using the roundup technique with pointer address and object size, it can calculate the base address of object memory. Now, as it has both base address and objects size, it can validate the object boundary, every time the pointer access a different location on that object. The bound check operation will operate over instrumented binary. For heap allocation (e.g. malloc), they can modify the allocator to allocate the object into a variety virtual page. But for stack objects, each of them allocates to the same virtual page dedicated to stack. Due to this, their physical memory is non-fragmented, this improves their memory management. For stack objects, it is important to handle memory efficiently. To cope stack objects to the same design for heap and the same time, don’t hurt the regular stack operation, is critical. Moreover, it is important to be efficient. To calculate lookup table, they use lzcnt instruction, for aligning, they do a masking operation (another lookup table), and they allocate the objects in splited 32GB virtual pages similar to the heap, but now separate each 32GB page for heap and stack. Everything is fine unless the physical memory is now more fragmented due to redesigned stack allocator. So, they propose a memory alias to map every virtual stack page to the same physical memory. They do apply different optimization where to check bounds and where not, but they are not a major contribution.

Advantage: Low fat pointer is memory efficient, but when it comes to performance, the design still has to sacrifice little more memory than it has to.

Disadvantage: First of all, now it is easier to predict where an object is located in the physical memory by only knowing the object size. Although pointer integrity is not a part of the bound check, now it can cause greater disaster if not checked. Memory aliasing design for virtual stack pages are not reliable, the design is obscure.

Discussion: The design and evaluation do not show any hope to the practical implementation of low-fat pointers for stack object. Overall, I realize, low-fat pointers cannot precede shadow memory design.

My Reading: Exploit Generation

It is impossible to write a complicated system without any hidden vulnerability resides in it. When researchers are continuously working on developing tools to detect these hidden vulnerabilities, the attackers are also coming up with new kinds of vulnerability. This race seems to be never-ending, so researchers also concentrate to design a defense system that can at least able reduce the impact of a vulnerability. These defense systems restrict an attacker to develop an effective exploit in the vulnerable program. This opens another door for attack researchers to develop tools to generate exploit automatically.

Block Oriented Programming: Automating Data-Only Attacks

Source Code:

Summary: Vulnerable software with an active defense system (e.g. Control-Flow Integrity, Shadow Stack, Address Space Randomization etc.) is hard to exploit. Control Flow Integrity (CFI) restrict execution within valid control flows, although because of the weak control flow graph (CFG), the coarse-grained CFI system allows overapproximating control transfers. This keeps open the exploit door for an attacker to lead the execution to a certain status (e.g. pwn the shell, information leak etc.). In this research work, the author proposes a high-level language to write a tuning complete exploit, a new approach to generate exploit using basic blocks as gadgets and validate their control flow following valid path in CFG. This approach can beat shadow stack which ROP gadgets cannot do.

Design: When a vulnerable system is built with CFI and shadow stack, it is neither possible to target any arbitrary location in the program nor possible to return anywhere. The major contribution here is that instead of using ROP gadgets, using basic blocks as gadgets, an attacker can let the program follow the valid control transfers when still he can achieve his goal. But before that, an attacker has to express his desire what he is hoping for to exploit. So, they propose a language for writing exploit payloads (SPL) which is architecture independent and tuning complete. An attacker can use abstract registers as volatile vars, a subset of C library calls (e.g. execve etc.). SPL can contain multiple statements and for each of them, the block oriented programming (BOP) will find out a set of basic blocks that satisfy that statement. For example, if a statement expects a register to hold value 7, it will find out basic blocks in the vulnerable program that ends up with any register containing value 7. These set of basic blocks are identified as candidate blocks. Next, the system will mix the set of basic blocks each represents one statement without one basic block clobbering another basic blocks purpose. They are called functional blocks and now it is time to identify their dispatcher blocks. Dispatcher blocks are those blocks which are not clobbered blocks and connect one functional blocks to another. Searching for dispatcher blocks are NP-hard problems, even approximation algorithm will not work (because of large code space in real-world programs). So, they propose a backtracking system with a minimum threshold for search depth. This may overlook lots of possible dispatcher path, but it is important for scalability. With functional blocks with their dispatcher blocks, they now have delta graph which is a context-sensitive shortest path to lead a vulnerable entry to exploit zone. Although the exploit is valid in presence of CFI and shadow stack, it is important to validate them by solving their constraints in different control transfers. A concolic execution system is used to prove a subgraph of the CFG that can use for the exploit is possible to execute. Now, if an attacker has found a vulnerability in a system, he can use this exploit generation system to figure out what exploit could be useful.

Advantage: Their exploit generation technique is a forward approach that’s why it is easy to follow. The approach of block-oriented gadgets is a real expectation (ROP gadgets are an insane idea for the secured system).

Disadvantage: Too many thresholds limit the power. The SPL idea is not necessary, it is still a C code. The evaluation shows the system mostly fails to generate exploits for a major breakthrough like pwn the shell.

Discussion: The concept of using basic blocks and follow the control flow graph as the base to help find out a subgraph which is also valid is a good one. But, due to heuristics, the expectation limits a lot. A forward approach also limits what else a vulnerable system can be able to deliver.

My Reading: Software Attack

k-hunt: Pinpointing Insecure Cryptographic Keys from Execution Traces

Source Code:

Summary: It would be useful for attackers if they can identify the memory location where an application store its cryptographic keys. It will be more useful to do taint analysis for various purpose (e.g. identify if a key is insecure). This research uses an online dynamic verification system to identify the cryptographic key memory location and verify if they are secure.

Design: The key challenges to identify a cryptographic key are: 1) how to identify cryptographic operations without the signature (because they are intended for proprietary algorithm too), 2) how to identify the memory buffer of the cryptographic key. Additionally, to detect an insecure key in a complex program, it is important to design an optimized taint mechanism. For each of them, they present their key insights. For case 1, they use three different heuristics: identify basic blocks which involves the arithmetic operation, repeatedly executed, and their input-output relationship contain randomness. For case 2, they use two additional heuristics: usually, the key size is less than data buffer size and key are usually derived just before cryptographic execution with randomness. When the key is identified, in the next phase, they rerun the dynamic analysis to detect the insecure keys. To tag a key insecure, it should be: 1) generate with inadequate randomness, 2) in the remote-client environment, only one of them can participate in key derivation, or 3) key is not cleaned up immediately after the use.

Advantage: The design is simple and modularized. It only depends on one specific technology, dynamic binary analysis. The evaluation is influential. The cause is effective.

Disadvantage: The overhead is high for online verification. There is still a possibility of communication cut. The rerun for insecure key detection could be failed because of ASLR.

Discussion: Overall, the research has a significant contribution. More insecure key cases could be derived in future and implemented over it.

My Reading: Control Flow Integrity

Due to memory vulnerability in popular programming (C, C++), the hacker can use them to overwrite other memories and take control of the vulnerable system. To the attacker, control flow hijack is one of the first priority to check on if they can find a vulnerability. Control flow hijack stands for turning regular program execution path to the attacker’s desired target. As example, an attacker can jump into the shell by targeting system function call setting with appropriate arguments in memory. If the vulnerable program has root access, the attacker can so too. Control flow integrity keeps an active eye on program execution time so that the program cannot be departed from its expected path. Two things are most important in CFI. 1) A complete control flow graph. 2) A strong enforcement policy. If either of them lose a little bit of accuracy, it will harm the another. Both the features are important for each other, so it is expected that every CFI system properly explain them.

Source Code:

Summary: The project has tried to achieve an ambitious goal: based on their execution history, enforce a CFI policy that will allow only one valid target for an indirect jump/call. For decades, researchers have tried to design a strict enforcement, a strong CFI policy. But the performance overhead and complex real-world program fail them every time. This project is also no exception to them. They have failed to evaluate spec benchmark like gcc, xalancbmk, omnetpp etc. which are the most important spec benchmark to evaluate considering their complexity with implementing CFI defense (they also have the most number of indirect calls). The research hugely depends on Intel PT hardware support which usually fails to capture complete information if they have frequently encountered. Basically, Intel PT is designed for offline debug purpose, so there is no chance to fix this issue for an online analysis.

Design: The project can be divided into three parts: 1) constrain data identification 2) arbitrary data collection and 3) online analysis. Constrain data is a program variable which is not a function pointer but has a significant impact to fix where a function pointer will target. For example: if there is an array of function pointer, and someone uses variable i as the index for that array to call an indirect function; then i is going to be a constraint data. They use compile-time analysis to identify first instructions related to function pointer calculation, second, they check the operands of those instructions if they are variable. Once they know constraint data, they have to collect the data at runtime (2). They use compile-time instrumentation for this purpose. Now, as they are planning to use Intel PT to track execution history, so they have to figure out a way how to force the instrumention to entry the arbitrary value in Intel PT trace. Hence, they come out with a trick: they call their defined write function with constraining data value as the parameter and from there, they create multiple indirect function call with return only body where the target will be fixed by a chunk of the passed value. To reduce the number of packet trace, they also do the same for the sensitive basic block where they also assign a unique id for each of them. With all of these, they can now get rid of checking TNT packets (for conditional control transfer) while only checking TIP packets (indirect control transfer). For online analysis(3), they decide to it in separate processor parallel to the original program. They have defined a read method to reconstruct the arbitrary data from Intel PT packet trace and validate it with a mapping.

Advantage: If you can perfectly identify every constraint data for every sensitive indirect control transfer, then if you have a perfectly working hardware to trace the history, then if the separate validation does not harm your security guarantee, then it is a very good system.

Disadvantage: Unfortunately, not any of them is possible to work perfectly. The static analysis with the complex real-world application will never be successful with such a simple algorithm. There will be always special cases and you have to keep updating your analysis mechanism time by time. Next, a missing perfect hardware is eventually reported by themselves. And everyone knows, you need to know before you jump. Pause your program in sensitive system call-points can easily cause a DoS attack by creating a real long pending list of validation check through frequently created indirect control flow transfer. About performance overhead claim, they have mentioned 10% average performance overhead. But, for perlbench and h264ref they report 47% and 24% performance overhead when they have not evaluated similar benchmark gcc, omnetpp, and xalancbmk. If we even consider 25% overhead for each of them, the overall overhead will go up at least 15%. They have evaluated nginx and vsftpd but they have not mentioned what test suites they have used. Moreover, their system is creating more indirect call, maybe even more than that exists in the target program.

Discussion: To emphasize their security improvement, they have mentioned a case study from h264ref spec benchmark. They show there is a structure that has a member function pointer. This structure is later used to create an array of size 1200. With that, they simply assume, there are 1200 valid targets for that member function pointer (unique one for each of them) and they have reduced it to 1. But the truth is that member function has only taken three valid functions in the entire source code. The project is too ambitious and hence not practical.

PT-CFI: Transparent Backward-Edge Control Flow Violation Detection Using Intel Processor Trace

Source Code: N/A

Summary: Intel PT is an Intel hardware support for offline debugging. It can capture compressed data packets for indirect control flow transfer, conditional branch taken/not taken etc. PT-CFI attempts to use the hardware feature for the backward edges through enforcing a shadow style protection. It leaves forward indirect control transfer out of its scope due to the unavailability of complete point-to analysis to generate the required CFI policy. It also limits its verification to critical syscall to undo the performance overhead.

Design: PT-CFI has four components: 1) packet parser will generate the TIP sequence and fed to 2) TIP graph matching (build upon training). If not match, invokes 3) Deep inspection to decode and construct shadow stack. If a match with shadow stack, add the new TIP sequence to TIP graph; otherwise 4) the syscall hooker will be informed to terminate.

Advantage: The paper describe the basic knowledge on Intel PT well.

Disadvantage: The paper fails to clarify crucial parts. For example, in deep inspection, they construct a shadow stack based on Intel PT traces and claim to match with shadow stack (what shadow stack?). They leave forward indirect control transfer out of consideration. The performance overhead is still very high (e.g. gcc spec with 67%).

Discussion: Intel PT is introduced for offline analysis, using it for online validation is not overall a good idea. The CFI policy generation with training concept is not well described. Mostly, the deep inspection is hugely misleading.

GRIFFIN: Guarding Control Flows Using Intel Processor Trace

Source Code:

Summary: The author only attempt to prove the performance overhead optimization using Intel PT for online verification. They claim to verify the enforcement policy for both backward and forward indirect control transfer with different strictness of policy when they completely discard the discussion regarding how they achieve these policies to verify with.

Design: To efficiently extract the information from Intel PT trace log, they propose a fast lookup design. The basic block to derefered pointer lookup is basically a mechanism similar to a cache. They hardly describe anything related their Coarse-grained and Fine-grained policy, but as they only trace TIP packet in Intel PT trace, they at-most can use indirect control transfer information based policy. They simply omit enough discussion on stateful policy for forward control transfer by mentioning they use a similar approach to PathArmor.

Advantage: They briefed about advanced topics like how to handle context switch, process fork etc. when they omit much needed basic discussion.

Disadvantage: The writing is a complete hide and seek game. It misses much of the basic discussion. It completely discards anything about how they get the policies. However, they put enforcement in a parallel process and enforce strict restriction in sensitive syscall entry. This overall concise the whole idea of control flow integrity. Hence, it is not meaningful how the performance overhead they achieve.

Discussion: An offline debugging hardware should not be considered for online verification, it will be just a waste of time.

Transparent and Efficient CFI Enforcement with Intel Processor Trace

Source Code: N/A

Summary: The project is aiming to protect indirect control transfer through coarse-grained indirect control flow graph using Intel PT only at security sensitive system call point. The system uses a fast and slow check hybrid method to achieve efficiency. The fast check doesn’t require to decode the trace and only available if the dynamically fuzzing based training CFG have the exact trace. Otherwise, a slow path has to decode the trace and verify it against a static analysis based CFG (TypeArmor).

Design: The design consists of four tasks. First of all, there is a static analysis based CFG. Then, they will rank the CFG edges based dynamic fuzzing training CFG. There will be a system call interceptor in the kernel to warn security sensitive system call. The flow checker will collect trace and based on its trace, will match with the fast path or slow path. The CFG edges are only from indirect control transfer as the runtime will only use Intel PT TIP packets.

Advantage: The performance overhead is acceptable, but the security guarantee is very low. The paper tries to describe every step.

Disadvantage: The design is very concise. First of all, they only monitor security sensitive indirect control flow transfer. The fast path only available for edges with fuzzy CFG (depends on code coverage). The slow path uses static CFG which is overapproximate. The CFG itself only consists of indirect control transfer edges as Intel PT only use with TIP packets.

Discussion: This is the 3rd paper who also tries to use an offline debug purpose hardware, Intel PT, to use for online verification. Their limiting monitor on security sensitive system call once again prove that this hardware is not good choice for this purpose.

My Reading: Kernel Fuzzer

Linux kernel is a complicated code base written in c programming language. As for this, they are vulnerable to memory corruption. Generally, it is believed to be the kernel is trusted. The only way kernel can be acted according to the user is through system calls from user programme. So, the researchers are focusing on innovating new fuzzing techniques that can generate random syscall (with correct arguments) sequence. A number of vulnerabilities such as race-condition, data-races, user-after-free etc. are possible to be popped up with such technique. 

RAZZER: Finding Kernel Race Bugs through Fuzzing

Source Code: (expected to be released)

Summary: Existing kernel fuzzer can be divided into two categories: 1) generate a different combination of syscall with a different combination of multi-threaded environment. 2) Impose different time interleave in the kernel thread. A combined system with both features was expected to appear and razzer is that. However, they bring one more layer on top of that system which is a static analysis tool. They get 30 new races in kernel and 16 of them are confirmed.

Summary: Existing kernel fuzzer can be divided into two categories: 1) generate a different combination of syscall with a different combination of multi-threaded environment. 2) Impose different time interleave in the kernel thread. A combined system with both features was expected to appear and razzer is that. However, they bring one more layer on top of that system which is a static analysis tool. They get 30 new races in kernel and 16 of them are confirmed.

Design: The system consists of a static analyzer, a single thread fuzzer, and a multi-thread fuzzer. The static analyzer uses a point-to analysis to select a pair of memory access instructions as a candidate race pair (overapproximate). Next, the single thread fuzzer consists of a generator and executor where the generator generates a sequence of random syscalls and executor observes if a candidate race pair is appeared to be executed. If there is one, the sequence of random syscalls later push to multi-thread fuzzer. It also has two components: the generator splits the syscalls into different threads and the executor observes if the selected candidate race pair instructions access the same memory. If they are, they tag that candidate race pair as true race pair.

Advantage: Time interleaving in a thread is an important issue for detecting data races. Random syscall sequence check is equally important to cover most of the kernel code and hence increase various race conditions detection. A combination of race condition and data races detection technique helps to find out more data races. The system design is simple and hence practical.

Disadvantage: Instead of using random syscall sequence from syzkaller, they can use MoonShine for generating a logical sequence of syscalls. It may reduce the overhead and increase the vulnerability findings. The multi-thread fuzzer generator put different syscalls in different threads randomly. A deterministic measurement here could be useful if it is possible to detect which syscall cause the execution of an instruction in the candidate race pair.

Discussion: When they try to match two instructions accessing same memory concurrently while at least one of them is writing on that memory, they completely discard one major: exploitability. Their 16 out 30 true race pair being confirmed indicates that not every true race pair is exploitable. I believe a more accurate implementation of this design can find more true race pair and even can see more 50% of them as non-exploitable.

MoonShine: Optimizing OS Fuzzer Seed Selection with Trace Distillation

Source Code: N/A

Summary: Syzkaller is one of the most popular kernel fuzzer. It generates a sequence of random system calls. Due to the randomness, most of them are unrealistic cases. They lose the efficiency because they don’t consider dependency (both implicit and explicit) among system calls. There are some other kernel fuzzer who uses a large number of benchmark program and use their sequence of system call to their fuzzer. In this project, they use a filtering algorithm that can identify dependent syscalls from real-world applications and use these syscall blocks into the fuzzer to the randomness of the blocks.

Design: The project basically extends the syzkaller and replace their random syscall sequence generator with random syscall block sequence generator. In the first step, they pour down the real-world application and observe their explicit and implicit syscall dependency. One example could be a file read; a file first must open, then read and usually they also close the file. If we think deeper, before a file read, the application also usually allocate some memory as the buffer to store the read data. In kernel perspective, these dependencies could be implicit or explicit. An explicit dependency is if a return of a syscall is used as an argument to another syscall. An implicit dependency is if two different syscalls use the same data structure in the kernel. When an explicit syscall is straightforward to resolve, the implicit dependency requires a kernel level static analysis where it collects if two different syscalls may use the same global variable in the kernel where one is a write and another is a read.

Advantage: The motivation is realistic, the design is simple, hence the project is significant and practical. They have found 17 new vulnerabilities of which 7 of them can be also found through explicit dependency check.

Disadvantage: The system don’t try to prove exploitability like every other. But, as the goal of the system is to generate more logical fuzzing input than complete randomness, their probability of exploitability is higher. They have confirmed 9 out of 17 have already fixed.

Discussion: The motivation is genuine. Although the workload is not heavy, it is the impact which let it be a significant work.