Performing array aggregations in Microsoft SQL Server with XML PATH

Jeremy Dohmann
15 min readJun 2, 2021

One of the weaknesses of Microsoft SQL Server is its lack of array types, and the difficulty of implementing array-like functionality such as the ARRAY_AGG aggregate function.

This shortcoming goes much further than a simple inability to interact with niche, array-based databases. It is difficult to ask for any query that tries to relate one value to multiple other values.

Consider the tables below:

Cities:
+— — — — — -+ — — — — — — — — — — —— — +
| name | state | population
+ — — — — — -+ — — — — — — — — — — — — +
| Boston | Massachusetts | 685094
| New York | New York | 8623000
| Ithaca | New York | 31006
+ — — — — — -+ — — — — — — — — — — — — +
Neighborhoods:
+— — — —— — —— — —+ — — — —— — +
| name | city
+ — — — — — — — — + — —— — — — +
| Back Bay | Boston
| Beacon Hill | Boston
| North End | Boston
| Upper West Side | New York
| Harlem | New York
| South Hill | Ithaca
| West Hill | Ithaca
+ — — — — — — — — — + — — — — — +
Streets:
+— — — — — — — — — -+ — — — — — — — — -+
| name | neighborhood
+ — — — — — — — — — -+ — — — — — — — — -+
| Beacon St | Back Bay
| Commonwealth Ave | Back Bay
| Beacon St | Beacon Hill
| Amsterdam Ave | Upper West Side
| Broadway | Upper West Side
| Malcolm X Blvd | Harlem
+ — — — — — — — — — -+ — — — — — — — — -+

A natural question we might like to answer for this data is: for each city, what are the streets it contains?

This many-to-one relationship can be answered straightforwardly in PostgreSQL with a query like the following:

SELECT
outer_tbl.name,
COALESCE(subquery.name_list, ARRAY[]::VARCHAR[]) AS streets
FROM Cities outer_tbl
LEFT JOIN (
SELECT
Cities.name,
ARRAY_AGG(Streets.name) AS name_list
FROM Cities
JOIN Neighborhoods ON Cities.name = Neighborhoods.city
JOIN Streets ON Neighborhoods.name = Streets.neighborhood
GROUP BY Cities.name
) AS subquery
ON outer_tbl.name = subquery.name

Running the above query against the data above would return the following:

+ — — — — — -+ — — — — — — — — — — — — — — — — — — — — — -+
| name | streets
+ — — — — — -+ — — — — — — — — — — — — — — — —— — — — —— -+
| Boston | [“Commonwealth Ave”,”Beacon St”,”Beacon St”]
| New York | [”Broadway”,”Amsterdam Ave”,”Malcolm X Blvd”]
| Ithaca | []
+ — — — — — -+ — —— — — — — — — — — — — — — — — — — — —— -+

There are endless combinations of many-to-one relationships we might want to express, such as:

  • For each city, how many streets appear in two neighborhoods?
  • For each street, how many cities are there with a street bearing that name?
  • For each neighborhood, how many streets are there whose first letter is the same as the city containing that neighborhood?

Of course, any of these relations can be expressed without a GROUP BY or ARRAY_AGG, if you’re happy to receive one row per pair of entries for which that relation holds. More concretely, we could have performed the query above with no aggregation or group by and we would receive a table like the one below:

+ — — — — — -+ — — — — — — — — — — — — — — — — — — — — — -+
| name | streets
+ — — — — — -+ — — — — — — — — — — — — — — — —— — — — —— -+
| Boston | Commonwealth Ave
| Boston | Beacon St
...
| New York | Broadway
| New York | Amsterdam Ave
...
+ — — — — — -+ — —— — — — — — — — — — — — — — — — — — —— -+

This format is fine if you are interested only in querying one relation at a time from your data. However, if you ask questions about two types of many-to-one relation at once without a GROUP BY, then you’ll end up with a row for each unique triple for which both relations hold. As soon as you start querying about large numbers of relations simultaneously, you get a combinatorial explosion in rows, whereas with GROUP BY and ARRAY_AGG you end up with only one row per unique .

MSSQL’s inability to support array types, results in array aggregations being impossible to implement precisely, accessible only via hackish workarounds which only approximately perform aggregation.

In particular, most workarounds use some form of string-based aggregation. That is, some subquery converts each column values to a string then concatenates all such values together with a delimiter.

In later versions of MSSQL, this can be accomplished using STRING_AGG. However, for compatibility with <MSSQL 2016 we use XML PATH to aggregate all the values into the string representation of an XML object.

No matter which method you use, any system which emulates array aggregation by joining values into a single delimited string must represent the values in such a way that the original column values can be uniquely reconstructed from the delimited string so that they appear exactly as they did prior to the aggregation. A simple example of a scheme that doesn’t work is joining the values using , as a delimiter — any string values which originally contained a , will be interpreted as two values.

Put formally, any system that wants to robustly and reliably emulate the behavior of array aggregation must meet the following 5 criteria:

  • Delimit elements in a way such that any characters can be contained in the elements and reconstructed without interfering with the delimiting.
  • We can represent null elements in the list unambiguously.
  • We can represent empty lists unambiguously (i.e. zero-length lists).
  • We can represent empty elements (e.g. a city whose name is just the empty string) unambiguously.
  • We can represent different data types unambiguously (remember everything gets cast to a VARCHAR).

The rest of this article will discuss the workings of XML PATH subqueries, present our system for reliably performing array-like aggregation with XML PATH, and an appendix containing a proof of the correctness of our system.

XML-encoding

Appending the XML PATH directive to a subquery in MSSQL takes whatever table results from that subquery and represents it as an XML object. There are many different things one can do with XML PATH as it offers the full expressiveness of the XQuery query language, but the simplest type of aggregation is shown below. This query will simply take all the column values from the table and will string them together with commas (with no XML tags enclosing the results).

(SELECT
'delimiter' + column
FROM Table
FOR XML PATH (‘’))

There are two flavors of XML PATH aggregation. Both versions use the XML PATH directive shown above to select the desired column. Typically the subquery shown above aggregates the columns desired and then some outer query associates each aggregated string with the columns being grouped by.

The two versions differ in how the inner subquery stores the results of the XML PATH query. There is one version involving the TYPE directive, which converts the returned string into a typed XML object, and there is a non-TYPE version which leaves the returned value in string form, but converts XML reserved characters to XML-encoded reference entities. This encoding matters because the resulting string may be spit out with XML tags, so any occurrences of special XML characters in the original string values need to be properly encoded to prevent misinterpretation of the returned string (this is analogous to the issue of using commas to delimit a string which, itself, contains commas).

The version using the TYPE directive looks as follows:

(SELECTFOR XML PATH (‘’), TYPE).value(‘.’, ‘VARCHAR(max)’)

where the value function is used to convert the typed-XML object back to a string. There is one main shortcoming of the above method: it doesn’t properly handle many of the whitespace unicode code points, such as U+0006, U+001D etc. because it can’t convert them to the XML type.

The second version is otherwise the same besides the deletion of the TYPE directive and the value function

(SELECTFOR XML PATH (‘’))

version two performs the following substitutions, taken from the XML character entity reference spec, here, with the caveat that MSSQL 2017 doesn’t encode quot or apos.

&→&amp;

>→&gt;

<→&lt;

c→0xHEX;

c is a unicode character that XML PATH cannot correctly represent (these are the same characters which break the TYPE directive).

As noted it is not necessarily clear which characters will get swapped out — even though the XML spec suggests that and should get encoded, MSSQL doesn’t appear to encode them. Additionally, I have not been able to find a standard, complete list of the unicode characters which must be encoded. It may very well be that the full rewrite system is implementation specific.

One of the benefits of XML-encoding for reference entities is that it provides strong guarantees:

1. All occurrences of & are followed by lt;, gt;, amp; or 0xHEX;

2. Any occurrence span beginning at & and ending at the nearest ; represents a single character per the encoding above. In regex form we say (&.*?;) always captures a group encoding exactly one XML reference entity character in the original string.

3. There are no occurrences of < or > anywhere in the resulting string.

4. All of the mappings are unambiguously reversible as long as &amp; is reversed last. This is so because guarantees 1 and 2 above are true until ampersands are reinserted and they guarantee unambiguous reversibility as long as they hold.

5. The semi-colon is only necessary in the case where the middle, keyword element in the reference entity has variable length (e.g. for HEX numbers) or in the case where some keywords can be prefixes of others (these two conditions are related clearly). To see why this is true, if the keyword element is fixed length and not a prefix to any other keyword element, then regardless of the surrounding context we know exactly how many characters following the & to consume. The ; is clearly a convenience used to make the HEX elements easier to decode (as they are likely to be variable length and prefixes of one another).

Achieving our 5 intended functionalities for ARRAY_AGG with XML PATH

As a reminder, we would like to be able to use XML PATH to aggregate values of arbitrary data type from source columns into a single string encoding an array of values, such that we can reconstruct that array unambiguously. Doing so requires being mindful of the XML encoding mentioned above, it also requires reserving characters to denote null values and list delimiters. We also must encode the original column values so as to prevent collisions between the string versions of those values and the reserved characters we choose.

As a reminder, our 5 formal requirements are:

1. Delimit elements in a way such that the elements of the list can consist of any characters without impeding our ability to reconstruct the original columns

2. We can represent null elements in the list unambiguously

3. We can represent empty lists unambiguously (i.e. zero-length lists)

4. We can represent empty elements (e.g. a city whose name is just the empty string) unambiguously

5. We can represent different data types unambiguously (remember everything gets cast to a VARCHAR)

Our systems reserves | to delimit lists, ~ to indicate null values because they are relatively rare and human readable. We then apply an encoding system inspired by the XML-encoded rewrite system:

^ →^e

~→^n

|→^d

This system follows a logic similar to that of the XML-encoding. In particular ,we have a special escape character to mark the start of an encoding, | and we represent reserved symbols as the escape character followed by a keyword. The keywords are fixed length, so we don’t need to use a suffix like ; to get all the guarantees associated with the XML-encoding system.

The resulting subqueries, in full, look as follows:

SELECT
outer_copy.primary_key,
COALESCE(
(SELECT
‘|’ + COALESCE(
REPLACE(
REPLACE(
REPLACE(inner_table.agg_column, ‘^’, ‘^e’),
‘~’, ‘^n’),
‘|’, ‘^d’),
‘~’)
FROM Table inner_table
WHERE (inner_table.primary_key = outer_copy.primary_key)
FOR XML PATH (‘’)),
‘’) AS aggregation
FROM Table outer_copy
GROUP BY outer_copy.primary_key

Where the query above corresponds to something like:

SELECT 
Table.primary_key,
ARRAY_AGG(Table.agg_column)
FROM Table
GROUP BY Table.primary_key

Note that the outer query iterates over each primary key and selects an XML PATH-based aggregation subquery for that primary key. The inner subquery uses XML PATH to traverse every column that has the primary key we are indexing, finds each agg_column, implicitly applies the XML ref entity encoding, explicitly applies the encoding I described above, coalesces null values to ~ and then concatenates all resulting values together with |. If no values are found by the XML PATH then the whole null result is coalesced to the empty list ‘’. XML PATH implicitly casts any non-VARCHARs to a VARCHAR, allowing the end user to recast during post processing.

The resulting string-encoded array can be post processed by first splitting on |, casting ~ to null, then undoing the encodings (being careful to undo &amp; and ^e last).

Now that I’ve introduced the system above, I will show some example queries and their outputs. After that I will present an optional proof that the system above meets all 5 criteria for a good encoding.

Revisiting the tables from the intro and the query: for each city, what are the streets it contains?

SELECT
tbl_outer_0.name,
folded_subquery_1.fold_output_column AS neighborhoods
FROM Cities AS tbl_outer_0
JOIN (
SELECT
tbl_inner_1.name,
COALESCE(
(SELECT
‘|’ + COALESCE(
REPLACE(
REPLACE(
REPLACE(Streets.name, ‘^’, ‘^e’),
‘~’, ‘^n’),
‘|’, ‘^d’),
‘~’)
FROM Streets
JOIN Neighborhoods
ON Streets.neighborhood = Neighborhoods.name
WHERE tbl_inner_1.name = Neighborhoods.city
FOR XML PATH (‘’)),
‘’) AS fold_output_column
FROM Cities AS tbl_inner_1
) AS folded_subquery_1
ON tbl_outer_0.name = folded_subquery_1.name

This produce the following results:

+— — — — — -+ — — — — — — — — — — — — — — — — — — — — — +
# name # neighborhoods
+ — — — — — -+ —— — — — — — — — — — — — — — — — — — — — +
# Boston # “|Beacon St|Commonwealth Ave|Beacon St”
# New York # “|Amsterdam Ave|BroadwayMalcolm X Blvd”
# Ithaca # “”
+ — — — — — -+ — — — — — — — — — — — — — — — — — —— — — +

Where the empty list for Ithaca is realized as an empty string, and the list underlying the string representation for Boston/New York can be simply recovered by splitting on ‘|’.

Proof sections

Theorem We can encode any (possibly empty) list of (possibly empty or null) string elements [α,β,γ,…] as a single string |α|β|γ|… using the XML reference entity encodings and the rewrite system above, such that the original list of elements can be reconstructed unambiguously

Proof. Below is a grammar which is capable of parsing our encoded string into an AST describing the original content of the columns. I will show both that this grammar is an unambiguous CFG and that this grammar can be used to parse any string created by the subquery into a list of string elements corresponding to the values in the original columns.

1.enc_{escape} ∈Σ is a preterminal defined as the literal sequence ^e

2. enc_{null} ∈ Σ is a preterminal defined as the literal sequence ^n

3. enc_{|} ∈ Σ is a preterminal defined as the literal sequence ^d

4. xml_{&} ∈ Σ is a preterminal defined as the literal sequence &amp;

5. xml_{>} ∈ Σ is a preterminal defined as the literal sequence &gt;

6. xml_{<} ∈ Σ is a preterminal defined as the literal sequence &lt;

7. xml_{HEX} ∈ Σ is a preterminal defined as the literal sequence &0xHEX; where HEX is a hexidecimal code used to encode a character c in the (large, but closed and enumerable) set of unicode symbols that get converted to hexes by XML PATH, denote this set U_{HEX}

8. delim ∈ Σ is a preterminal defined as the literal |

9. null ∈ Σ is a preterminal defined as the literal ~

10. char ∈ Σ is a preterminal defined as any characters not in the set R≡{~,^,|,&,>,<,U_{HEX}}

We first lex the string, completely lexing all encoded sequences (the first 7 lexes) before lexing individual chars (8–10). This ordering is necessary because XML PATH doesn’t encode ; so it can’t be lexed to char before the xml encodings.

All possible encodings can be completely lexed unambiguously using the preterminals above such that the resulting string consists only of one of the 10 preterminal symbols above. This is true because:

  • All occurrences in the original string of symbols from R will be replaced by one of the encodings in 1–7.
  • ˆ and & occur iff the presence of at least one of the full encoded literal sequences from 1–7 follows therefore it is sufficient to include ˆand & in R.
  • The encodings are not prefixes of one another so they can occur in any order and still be lexed properly, additionally there is never ambiguity about which of 1–7 should apply when one of the opening delimiters ( ˆ and &) appears so all possible encoded sequences can be properly lexed.
  • delim and null will only ever appear wherever they were inserted to represent delimiters or null values resp. (because all of their original occurrences will be encoded as ˆn or ˆd).
  • After 1–9 are applied, only characters not from R will remain, therefore they can all be lexed as a character in char

We can think of lexing as a process which converts the input from the vocabulary of all possible strings of unicode characters to a string over the 10-symbol vocabulary defined above

We then apply the grammar below to the lexed input.

  1. L→ϵ

2. L→delim E L

3. E →ϵ

4. E → null

5. E → char E

6. E →xml_{>} E

7. E → xml_{<} E

8. E →xml_{HEX} E

9. E → xml_{&} E

10. E →enc_{escape} E

11. E →enc_{|} E

12. E→enc_{null} E

Fact 1: This grammar is unambiguous.

In general, determining the ambiguity of a CFG is undecidable, but there are certain special cases of CFGs which are known to be unambiguous such as LL(1) grammars. Proving that a grammar is an LL(1) grammar is sufficient to show that grammar is unambiguous. Briefly, an LL(1) grammar is a grammar which can be parsed left-to-right always following leftmost derivations by looking at only the next terminal symbol. A grammar is LL(1) if an LL(1) parse table can be constructed such that no cell contains more than one rule. A full explanation of what the table means or how to use it for parsing is beyond the scope of this proof. Please refer to this link for more info on LL parsing.

I used a great LL(1) parser generator provided by Andrew Duncan to convert the grammar above into the LL(1) parse table below. Where the number in each cell corresponds to the number of one of the 12 rules above. As you can see each cell has a unique rule in it, showing that the grammar is LL(1) and therefore is unambiguous.

Fact 2: Every possible concatenated series of encoded columns can be parsed by the grammar above.

All of the original elements constituting the aggregated columns, including empty elements and null, can be parsed as an E. Empty elements can be parsed using a single application of rule 3, nulls with an application of rule 4, and any other element can be parsed (due to the completeness of the lexing) using successive applications of rules 4–12 terminating in a single application of rule 3.

Fact 3: Empty lists can be parsed using rule 1.

All non empty lists of Es can be parsed as delim followed by an E, then followed by another list (possibly empty). This follows from the fact that delim only appears in places where it has been inserted between column values by XML PATH

Taken together, these three facts prove that any encoded string can be parsed unambiguously such that the individual elements are distinguishable.

Finally, we can reconstruct the original string elements from each E because the first terminals 1–7 correspond to unique symbols, terminal 9 corresponds to the unique string value None, and for any symbol matching terminal 10, the character underlying the char tag can just be emitted directly. □

Remark. We can properly reconstruct columns containing INT or DATETIME elements.

Proof. Different data types can be represented because they all implicitly get cast to VARCHAR by the XML PATH. In order to unbox them as the correct data type, we need to rely on the GraphQL Schema type inference. Fortunately, if we are selecting a column, it must be in our schema and therefore we know what data type it should be. □

--

--