Projects in AVSV

Automatic Vulnerability Scanner for Mobile Applications

Within the framework of this project, we develop the next-generation vulnerability scanner VUSC, which can fully automatically identify security vulnerabilities in mobile apps and the associated backend services. With VUSC, companies can not only check their own developments, but also purchased applications or modules for which they do not have the source code. For the analysis, only the binary application as it is installed on the target device is required. Thus, users no longer have to rely on the statements of the respective manufacturers, but can check them efficiently. The core of VUSC is the use of platform-specific semantic models. Instead of only recognising simple patterns, the scanner understands the specific semantics of an Android app, an iOS app, a Java EE web service, and assess the code on this basis. This approach makes it possible to minimise false positives from the scanner. Instead of having to manually trace and check hundreds of messages in a time-consuming way, analysts can use VUSC to focus on the real problems. In addition, the scanner extracts detailed information on each vulnerability: Which server is being communicated with, which data is being transmitted, which algorithm is used for transport encryption? The project represents an important step in moving from coding practices and error patterns to a risk assessment. With the detailed data determined by VUSC, the analyst can much more easily assess which vulnerabilities are relevant and need to be prioritised for remediation.

More information about VUSC - the Code Scanner

Automatic Vulnerability Verification

Automated vulnerability scans, despite all their advantages, currently lead to significant additional effort for development teams. The potential vulnerabilities found by the scanner have to be manually checked to weed out false positives and assess the relevance of the issue for prioritised remediation. Many reported vulnerabilities turn out to be irrelevant, e.g. if a vulnerability is in a function that is only available to users with extensive permissions. In this case, an attacker would not gain any additional access possibilities through the gap that he does not already possess. In this project, we develop mechanisms to automatically assess such vulnerabilities identified by a code scanner. To do this, the scanner will attempt to execute the affected code and trigger the vulnerability. If this succeeds, the corresponding vulnerability is prioritised for immediate remediation. Otherwise, although it cannot be ruled out that the vulnerability still exists, the demonstrably existing and relevant vulnerabilities should be given priority when resources are limited. In this way, the level of security can be increased without slowing down development processes even more.

User-centric visualization of security problems

Today's software systems are created through software development processes, which are not immune to errors that can be exploited by attackers. Automated software scanners enable developers, security analysts and data protection officers to analyze their application and identify vulnerabilities or data breaches. Nevertheless, these user groups cannot be considered as a homogeneous group, as they have different technical knowledge. Also, security reports from software scanners are often difficult to understand due to many false positives, or the sheer number of vulnerabilities.
The goal of this project is therefore to develop visualizations in a user-centered approach for these user groups. This makes it possible to address the exact context of use, as well as the requirements and goals of the user groups. This includes the security status of an application, an overview of software vulnerabilities or the triage process. In this way, interactive visualizations enable users to make more informed privacy and cybersecurity decisions.

Machine Learning for Vulnerability Detection

To check for specific vulnerabilities of software code, the classical approach is to extract facts such as which API functions are called with which values, which data flows exist in the analyzed program, or which settings were made in configuration files of the target program. Then a rule or machine learning model may describe undesired combinations of these facts. In contrast, ATHENE research aim for a holistic approach that can deal with heterogeneous data extracted from different sources in a flexible way. To this end, deep probabilistic programming will be employed, which allows users to specify (deep) generative probabilistic models as high-level programs and then “compile” those models down into inference procedures. Probabilistic rules and deep networks are used to identify vulnerability spots in software applications and extract sophisticated relationships between them of software applications as well as to make inferences about facts involving those entities. Overall, this will lead to the first flexible platform for the rapid creation, modeling and management of training data and machine learning models for the detection of application vulnerabilities.

Vulnerability Detection for Hybrid Apps using Generic Analyses

Modern applications contain a lot of security vulnerabilities. Such vulnerabilities, among other things, leak data and expose code in a way that attackers can exploit for malicious purposes. Static code analysis can help alleviate such concerns by exposing vulnerabilities. Tools for static analysis can explore all code paths but are prone to false positives and lack precision thereby hindering practicality. They also suffer from an inability to deal with obfuscated code which a previous study by our team has shown to be prevalent. Current static analysis approaches target one language. In practice, however, it is common that a single application uses Java and JavaScript at the same time; JavaScript primarily for building user interfaces or web clients. The purpose of this project is to develop scalable and precise generic static analyses for hybrid applications. The approach will first deobfuscate code using the existing prototype StringHound and then apply the in-house scalable data-flow analysis in combination with state-of-the-art JavaScript type analysis. Modular, extensible framework OPAL will be used as a base.

Further development of the CogniCrypt Assistant to ensure the correct use of crypto libraries

Cryptographic APIs support developers in performing tasks that protect data, like encryption or password management. Previous research has suggested that using such APIs is hard. CogniCrypt, pioneered at TU Darmstadt, aids developers in making sure such misuses are avoided. It helps developers with two components. The code generator component provides template usage of cryptographic tasks to serve as a starting point, as code provided in online forums has been shown to be faulty. The static analysis component uses specifications written in a language called Crysl to point out errors in API usage. The goal of this project is to professionalize CogniCrypt by building an open source community around it and incorporating professional practices like nightly builds and continuous integrations. As part of the project, CogniCrypt will be extended beyond Java and JCA, to support other languages and libraries. Furhthermore approaches will be investigated to make the specification language Crysl support challenges posed by evolving APIs and standards.

Code transformers and knowledge graphs for ​vulnerability detection

Traditionally, vulnerability detection for code happens using static/dynamic analysis. With the availability of large volumes of code in the wild (GitHub/Stack Overflow etc.) that can serve as valuable data to learn from, it is a natural next step to explore the use of learning-based approaches to perform coding tasks. Recently, deep learning and transformers have been shown to be successful in both code synthesis and misuse detection. Although the projects like GitHub's Copilot have shown how exciting the possibilities are, recent studies have shown that a significant portion of the code generated is vulnerable to attacks. On top of this, current transformers for coding tasks treat code as syntactically formatted text. They ignore a vast array of semantic information encoded into code, like the control flow or data dependencies. Knowledge graphs, which capture semantic relationships in textual data, have been recently adapted to include code-level semantics mined from forums like StackOverflow and code documentation.
As part of this project, we will build a customizable code transformer architecture, where components like embeddings can be plugged and played to incorporate representations of code other than syntactic trees, and make them aware of semantic properties by integrating them with code knowledge graphs.