Extending a Static Code Analyzer by PostgreSQL-Specific Rules

Typ: Bachelorarbeit
Betreuer: Jan Nidzwetzki

The popular relational DBMS PostgreSQL is implemented in the C programming language, which is known for its ability to easily introduce errors. The codebase of PostgreSQL is huge (~1 mio lines of code) and maintained by a large community. However, a database management system is the backbone of many modern systems; therefore, code quality is essential. The patches are reviewed by the community, and the software is mainly tested using regression tests. However, these tests try to find bugs by comparing the computed result with an expected result and register crashes during the tests. Unfortunately, this type of (dynamic) testing covers only code paths that are taken during the tests.

In contrast, static code analysis is a technique that detects programming mistakes without executing the software; all possible code paths are analyzed and checked for problematic constructs. PostgreSQL introduces several custom data structures (hashes, lists, bitmap sets, paths, …). To detect errors when using these data structures, the static code analyzer
requires rules to detect problematic constructs.

For example, the function bms_add_member allows it to set a particular value to a bitmap set. The function signature is:

 Bitmapset *bms_add_member(Bitmapset *a, int x);

The function takes a bitmap set and a value to be added. In most cases, the passed bitmap set is modified. However, a new bitmap set needs to be created in some cases. The function also returns a bitmapset. If the passed bitmapset can be changed, the same reference is returned. If a new bitmapset needs to be created, the reference to the new bitmapset is
returned. Therefore, the following code is correct:

 bms = bms_add_member(bms, 5);

However, in most cases, the following code works, as the bitmap set passed is changed:

 bms_add_member(bms, 5); 

However, if a new bitmapset is created inside of bms_add_member(), the passed reference might be freed and now point to invalid memory. Using this reference might lead to an
undefined behavior or crashes. The example illustrates how easy it is to use this function in a wrong manner; the error is not detected by the compiler and the code works in most cases.

To detect this problem, a static code analyzer needs to know that the function's result should not be ignored and must always be assigned to the same variable as used in the first
argument.

The goal of this thesis is to identify rules that help avoid common programming mistakes in PostgreSQL. These rules should be implemented in a static code analyzer.

Research questions:

  • What tools exist to implement these static code checks? What are the advantages and disadvantages of these tools? What are the limitations?
  • Can the implemented rules detect programming mistakes in the current versions of PostgreSQL or PostgreSQL extensions?

This thesis requires skills in C development, software testing, basic compiler construction
knowledge, and database internals.

Literature:

Baudouin Schwederski | 10.06.2024