Dataflow analysis

Being able to track the current state of a program as well as meta information without actually executing it; static analysis.

Control-flow graph (CFG)

A graph where nodes are blocks and there is an edge between two blocks if there is a control-flow instruction (jump, conditional branch, return) from one block to the other.

Blocks are sequences of instructions that has a single distinct label, a possibly empty sequence of instructions, ending with a single control-flow instruction

Very useful for dataflow analysis.
Helpful to imagine edges leading from one instruction to the next within blocks as well for analysis.

Live variable analysis

A variable $v$ is considered live at a program point if $v$ is defined before the point and might be used by some future part of the execution path of the program.

Key word “might,” we would rather mistake a variable as live when it isn’t.
Variable use could depend on user input or other parts of the program that we don’t know the value of until we actually run it.

To perform liveness analysis is to calculate which variables are live between each statement.

To be useful, it should be more precise than scoping rules.

Consider the following program. Liveness information is computed between each statement.

int f(int x) {
  /* x is now live */
  int a = x + 2;
  /* x is live, a is now live */
  int b = a * a;
  /* x is live, b is now live, a is not live anymore */
  int c = b + x;
  /* c is live, b and x are not live anymore */
  return c;
}

The scopes of $x$ , $a$ , $b$ , and $c$ are all the same, but $a$ , $b$ , and $c$ are never live at the same time. This is a good example of the difference between liveness and scope. Furthermore, this means they could theoretically be allocated to the same register.

Liveness is associated with CFG edges

Edges connect blocks and sequential instructions, and are quite literally in between statements. We associate liveness information with each edge.

Useful for later optimizations where the same register can be used for different temporaries in the same instruction, therefore rendering it useless. See move-related edges.

Uses and definitions

Every instruction uses some variables and may also define new variables. We can keep track of the set of used variables and set of new definitions for each statement.

Create a map from a node/statement $s$ to the set of uses and definitions, respectively.
For example, $use [s]$ and $def [s]$ . If $s$ is a = b + c, then $use [s] = {b, c}$ and $def [s] = {a}$ .

Formal liveness definition

A variable $v$ is considered live on an edge $e$ in the CFG if

there exists some statement $s$ in the CFG such that $v \in use [s]$ .
there exists a directed path $p : e ⇝ s$ such that for every internal node $n \in p$ , $v \in / def [n]$ . In other words, every statement leading up to $s$ cannot redefine $v$ .

Computing liveness with dataflow analysis

Define two more sets for each statement $s$ :

$in [s]$ is the set of variables that are live on entry to $s$ .
$out [s]$ is the set of variables that are live on exit from $s$ .

Consider two sequential statements $s_{1} \to s_{2}$ . Note that we cannot claim $in [s_{2}] = out [s_{1}]$ . This is because in the CFG, there may be multiple statements that lead to $s_{2}$ . That is, there could also exist a $s_{1}^{'}$ such that $s_{1}^{'} \to s_{2}$ . So we have to consider $out [s_{1}^{'}]$ as well when computing $in [s_{2}]$ .

Other constraints that must be true for a statement $s$ :

$use [s] \subseteq in [s]$ . Set of variables that $s$ uses must be a subset of those that are live on entry to $s$ . $s$ doesn’t have to use all variables that are live, they might be used later in the program path.
$out [s] - def [s] \subseteq in [s]$ . If a variable is live on exist from $s$ , and $s$ does not redefine it, then it must have been live on entry to $s$ .

If a variable was redefined, then we remove it from the set of live variables on exit of a statement. We treat it as a completely new variable (and in LLVM, SSA means that this is true by default!).

If $s_{1} \to s_{2}$ , then $in [s_{2}] \subseteq out [s_{1}]$ . As explained above, it’s a subset relationship because multiple statements could feed into $s_{2}$ and provide liveness information.

Naive algorithm

For each variable defined in the program, find where it’s used in the program. Then, trace the path backwards in the CFG, marking the variable as live on edges until we reach the statement where it’s defined, or until we reach a previously-explored statement.

Extremely inefficient because we process variables one at a time, and may retrace the same path for different variables.

Iterative node algorithm

Instead, we can iteratively build up $in [s]$ and $out [s]$ for each statement, updating them as follows:

$in [s] \leftarrow use [s] \cup (out [s] - def [s])$
$out [s] \leftarrow ⋃_{s^{'} \in succ [s]} in [s^{'}]$

The algorithm stops executing when $in [s]$ and $out [s]$ stop changing for all statements $s$ in the CFG. This is guaranteed to occur because these two sets monotonically increase in size and there are only a finite number of statements in a program. Size never decreases, so we’ll reach the end eventually.

For each statement $s$ , initialize $in [s] = \emptyset$ and $out [s] = \emptyset$ . This is because for the best analysis, the liveness sets should be as small as possible, so starting out from empty makes sense.

for all statements s, in[s] = {} and out[s] = {}

repeat until no change in `in` and `out`:
  for all statements s:
    out[s] <- union(in[s']) where s' in succ[s]
    in[s] <- use[s] union (out[s] - def[s])
  end
end

Improving runtime by using a queue

Notice that $out [s]$ only changes if the successors of $s$ have changed. Therefore, we don’t need to process a statement $s^{'}$ if its successor $s$ has not changed.

To do this, we can instead use a queue instead of a while loop and for loop. For each statement, we check if $in [s]$ has changed after performing the union operation, and readd nodes that lead to $s$ back to the queue.

for all statements s, in[s] = {} and out[s] = {}
initialize a queue Q with all statements in the CFG

while Q is not empty:
  s = Q.pop()
  old_in = in[s]

  out[s] <- union(in[s']) where s' in succ[s]
  in[s] <- use[s] union (out[s] - def[s]) 

  if (old_in != in[s]):
    for all previous statements s in pred[s], execute Q.push(s)
  end
end

The program naturally stops when the queue empties out, implying that $in$ and $out$ aren’t changing anymore.

Vault

Explorer

Dataflow analysis

Control-flow graph (CFG)

Live variable analysis

Liveness is associated with CFG edges

Uses and definitions

Formal liveness definition

Computing liveness with dataflow analysis

Naive algorithm

Iterative node algorithm

Improving runtime by using a queue

Graph View

Table of Contents

Backlinks

Go to main siteVault

Explorer

Dataflow analysis

Control-flow graph (CFG)

Live variable analysis

Liveness is associated with CFG edges

Uses and definitions

Formal liveness definition

Computing liveness with dataflow analysis

Naive algorithm

Iterative node algorithm

Improving runtime by using a queue

Graph View

Table of Contents

Backlinks

Vault