Effective Techniques for Partial Evaluation-Based Program Debloating
Prevalent software engineering practices such as code reuse and one-size-fits-all have contributed to a significant increase in the size and complexity of software. A typical software uses only some of its features in practice. Also, actual deployment requires only a subset of implemented features. This situation leads to the problem of software bloat, which leads to resource wastage in the memory hierarchy. It also increases the attack surface since a larger feature count increases the probability of software bugs, which leads to security weaknesses. Further, it decreases the performance and reliability of software.
Many state-of-the-art tools are available to remove bloated features from software. Some of them require detailed high-level specifications to be written, such as test cases to describe desired preserving features in the specialized version of the software. This procedure can be erroneous and burdensome for users. Others use command-line arguments to define configuration information, propagate them as constants in the code and remove code that does not execute when these constants are present. They rely on using static analysis and partial evaluation.
We argue that the context-insensitive constant propagation techniques used in state-of-the-art tools are ineffective in terms of dead code elimination and suggest context-sensitive interprocedural constant propagation. We reason that context sensitivity is more beneficial in pruning the non-used parts of the software code in an actual deployment than context insensitivity. As context-sensitive constant propagation is slow and can lead to excessive function cloning, we introduce sparse constant propagation, which executes constant propagation only for variables that contain configuration information. We show that this performs better (providing higher code size reductions) than constant propagation for all program variables.
Secondly, we noticed that many software use a static file to describe its configuration information. State-of-the-art tools either do not take this into account and rely only on command-line arguments as constants to propagate or provide a manual method to deal with the static configuration file. We propose automated File I/O Specialization, i.e., we lift static configuration information in the file as constants in the code. This procedure simplifies the file parsing code and removes code that does not execute given those constants.
Thirdly, we propose deep learning methods to automatically detect functions that give specialization benefits with regard to size reduction. We present the first attempt at pretraining a BERT model for LLVM IR representation and fine-tuning on predicting which functions should be specialized. We created a corpus of about 12.7 million lines of code in LLVM IR representation to pre-train BERT. Our idea is to generate a training dataset of two classes of functions
(1) Specializable functions: Functions whose specialization leads to size reduction
(2) Unspecializable functions: Functions whose specialization does not give size reduction benefit. We can train our model on this dataset and predict specializable and unspecializable functions in future applications. We implemented sparse context-sensitivity and File I/O specialization on top of Trimmer, a configuration-driven code debloating tool. We evaluated our techniques on 20 common Linux utilities, including six benchmarks requiring static configuration files. Our evaluation shows that sparse context sensitivity provides higher code size reductions than context insensitivity and context sensitivity on average with reasonable analysis time, improvement in performance, and removal of common security vulnerabilities.