Current software systems encompass different programming languages and third-party libraries. Each component of these systems could then be compiled into different binary formats. As a result, such a system is no longer analyzable with a static analysis framework that supports only one format. For that reason, we are developing security analyses across different languages to handle multi-language systems. However, developing static analyses for multiple languages from scratch requires a significant effort. Therefore, the first goal of our work is the development of cross-language analyses by reusing and deeply integrating existing single-language analyses. This approach reduces the required implementation effort and ensures that cross-language analyses profit immediately from further developments of the involved single-language analyses.
Furthermore, we address the systemic security risks of libraries by developing effective methods for identifying libraries in known binary formats and scanning them for vulnerabilities.
Despite significant advances in code scanners for automatically detecting vulnerabilities in software, the patterns that these scanners detect are still provided by human experts and are encoded on a low technical level. They usually reference API calls and data flows on the instruction level. Scanners must be kept up to date with new libraries and possible variations of the same flaw to be found.
In this project, we aim to automatically derive these vulnerability patterns from high-level formalizations of security properties such as “cryptographic keys must not be hard-coded”. Such an approach allows the human analyst to focus on the security property at hand, while the AI takes care of building the analysis patterns. We further aim to extend code scanners to support multi-platform software.
Software ist in vielen Lebensbereichen weit verbreitet, aber kann Fehler oder Schwachstellen enthalten. Software-Schwachstellen sollten so früh wie möglich während der Entwicklung behoben werden. Um Entwickler zu unterstützen, können automatisierte Scanner Anwendungen analysieren, aber es gibt viele Gründe, warum diese Scanner derzeit in der Praxis nicht eingesetzt werden. Eine Möglichkeit, die Schnittstelle zwischen menschlichem Experten und dem Scanner zu verbessern, ist die Visualisierung von Daten. Dazu gehören Methoden zur effizienten visuellen Darstellung von Daten, um die Benutzer bei ihren Aufgaben zu unterstützen, z.B. bei der Triage von Sicherheitslücken. Mit Hilfe der Visualisierung von Softwareschwachstellen sollen so benutzbare und verständliche Anwendungen für Softwareentwickler entwickelt werden.
Within the framework of this project, we develop the next-generation vulnerability scanner VUSC, which can fully automatically identify security vulnerabilities in mobile apps and the associated backend services. With VUSC, companies can not only check their own developments, but also purchased applications or modules for which they do not have the source code. For the analysis, only the binary application as it is installed on the target device is required. Thus, users no longer have to rely on the statements of the respective manufacturers, but can check them efficiently. The core of VUSC is the use of platform-specific semantic models. Instead of only recognising simple patterns, the scanner understands the specific semantics of an Android app, an iOS app, a Java EE web service, and assess the code on this basis. This approach makes it possible to minimise false positives from the scanner. Instead of having to manually trace and check hundreds of messages in a time-consuming way, analysts can use VUSC to focus on the real problems. In addition, the scanner extracts detailed information on each vulnerability: Which server is being communicated with, which data is being transmitted, which algorithm is used for transport encryption? The project represents an important step in moving from coding practices and error patterns to a risk assessment. With the detailed data determined by VUSC, the analyst can much more easily assess which vulnerabilities are relevant and need to be prioritised for remediation.
Automated vulnerability scans, despite all their advantages, currently lead to significant additional effort for development teams. The potential vulnerabilities found by the scanner have to be manually checked to weed out false positives and assess the relevance of the issue for prioritised remediation. Many reported vulnerabilities turn out to be irrelevant, e.g. if a vulnerability is in a function that is only available to users with extensive permissions. In this case, an attacker would not gain any additional access possibilities through the gap that he does not already possess. In this project, we develop mechanisms to automatically assess such vulnerabilities identified by a code scanner. To do this, the scanner will attempt to execute the affected code and trigger the vulnerability. If this succeeds, the corresponding vulnerability is prioritised for immediate remediation. Otherwise, although it cannot be ruled out that the vulnerability still exists, the demonstrably existing and relevant vulnerabilities should be given priority when resources are limited. In this way, the level of security can be increased without slowing down development processes even more.
Traditionally, vulnerability detection for code happens using static/dynamic analysis. With the availability of large volumes of code in the wild (GitHub/Stack Overflow etc.) that can serve as valuable data to learn from, it is a natural next step to explore the use of learning-based approaches to perform coding tasks. Recently, deep learning and transformers have been shown to be successful in both code synthesis and misuse detection. Although the projects like GitHub's Copilot have shown how exciting the possibilities are, recent studies have shown that a significant portion of the code generated is vulnerable to attacks. On top of this, current transformers for coding tasks treat code as syntactically formatted text. They ignore a vast array of semantic information encoded into code, like the control flow or data dependencies. Knowledge graphs, which capture semantic relationships in textual data, have been recently adapted to include code-level semantics mined from forums like StackOverflow and code documentation.
As part of this project, we will build a customizable code transformer architecture, where components like embeddings can be plugged and played to incorporate representations of code other than syntactic trees, and make them aware of semantic properties by integrating them with code knowledge graphs.
Cryptographic APIs support developers in performing tasks that protect data, like encryption or password management. Previous research has suggested that using such APIs is hard. CogniCrypt, pioneered at TU Darmstadt, aids developers in making sure such misuses are avoided. It helps developers with two components. The code generator component provides template usage of cryptographic tasks to serve as a starting point, as code provided in online forums has been shown to be faulty. The static analysis component uses specifications written in a language called Crysl to point out errors in API usage. The goal of this project is to professionalize CogniCrypt by building an open source community around it and incorporating professional practices like nightly builds and continuous integrations. As part of the project, CogniCrypt will be extended beyond Java and JCA, to support other languages and libraries. Furhthermore approaches will be investigated to make the specification language Crysl support challenges posed by evolving APIs and standards.
To check for specific vulnerabilities of software code, the classical approach is to extract facts such as which API functions are called with which values, which data flows exist in the analyzed program, or which settings were made in configuration files of the target program. Then a rule or machine learning model may describe undesired combinations of these facts. In contrast, ATHENE research aim for a holistic approach that can deal with heterogeneous data extracted from different sources in a flexible way. To this end, deep probabilistic programming will be employed, which allows users to specify (deep) generative probabilistic models as high-level programs and then “compile” those models down into inference procedures. Probabilistic rules and deep networks are used to identify vulnerability spots in software applications and extract sophisticated relationships between them of software applications as well as to make inferences about facts involving those entities. Overall, this will lead to the first flexible platform for the rapid creation, modeling and management of training data and machine learning models for the detection of application vulnerabilities.
Today's software systems are created through software development processes, which are not immune to errors that can be exploited by attackers. Automated software scanners enable developers, security analysts and data protection officers to analyze their application and identify vulnerabilities or data breaches. Nevertheless, these user groups cannot be considered as a homogeneous group, as they have different technical knowledge. Also, security reports from software scanners are often difficult to understand due to many false positives, or the sheer number of vulnerabilities.
The goal of this project is therefore to develop visualizations in a user-centered approach for these user groups. This makes it possible to address the exact context of use, as well as the requirements and goals of the user groups. This includes the security status of an application, an overview of software vulnerabilities or the triage process. In this way, interactive visualizations enable users to make more informed privacy and cybersecurity decisions.