## How to make use of Associations?

152

120

I have been curious about it for long. Now that Mathematica 10 arrives, I think it's time to ask the question: How the new Association data structure would be used to improve Mathematica programming?

There are a few related aspects of this question:

(1) How fast is the key-search in Association (on the website it is said to be extremely fast)? Would it be $O(N)$ or $O(N \log N)$ (note added: those estimate are stupid. See benchmarks below and more in answers)? Is it recommended to be used with a lot of keys? Also, how fast is the operation of inserting a key to an assoication? Would it be $O(N)$ or $O(N^2)$?

(2) Previously I use pattern matching to implement similar functionalities. For example, I do f[a]=x; f[b]=y to effectively realize f = <| a->x, b->y |>. What's the advantage of the latter over the former? (Seems keys can be more easily managed. But I don't have systematic understanding of the advantages.)

(3) The new structure Dataset is built upon Association. Is it efficient enough in terms of both memory space and computational speed, that I can use DataSet to store and calculate big data (say, table with more than 10 thousand rows, like MCMC chains)? Previously I used pure array.

(4) Examples where old code can be improved making use of Association?

I have read through the guide about Association, which is basically a collection of related functions. It would be great if the usefulness of those functions could get explained.

The replies are really great! Many thanks :)

I also did some benchmarks as below. The testing association is the same one as Leonid Shifrin's: Association[Thread[Range[n] -> Range[n] + 1]], where n is the $x$-axis of the plots. We observe (in one case. I am not sure about worst case):

• Creation, insertion (middle) and deletion (middle) are $O(N)$
• Insertion and removal at head or tail are $O(\log N)$
• Lookup is $O(1)$
• Association takes $O(N)$ memory space, with a larger constant factor than array

Note that, in the first figure, deletion is fast only when using Most or Rest. Delete[store, -1] and Delete[store, 1] are as slow as Delete[store, otherNumber]. Also, in the second figure, Association and Dataset takes almost the same memory thus not very visible.

5Interesting (+1), but very broad. – bobthechemist – 2014-07-09T20:06:03.263

7(2) if you use f[a]=x, etc., you will have precisely one copy of this data structure. This doesn't adhere to Mathematica's usual copy-on-write way of handling data. It also ties up the name f permanently. You couldn't e.g. do g=f then Clear[f] and then still have g available. It's not possible, e.g., to implement the equivalent of Append[<|a -> 1|>, b -> 2]. – Szabolcs – 2014-07-10T01:16:34.383

7I am also curios about the use of Associations. I propose that the community members who understand the answer simply give examples of such a use. My general question is: give us examples, where the Associationsenable one to achieve something that cannot be achieved (or can be, but with a higher cost) without them. – Alexei Boulbitch – 2014-07-10T07:42:39.697

One possible use case is this.

– Szabolcs – 2014-07-10T15:41:34.550

3For this specific case, using AssociationThread[Range[n], Range[n] + 1] makes creation twice as fast on my machine. – RunnyKine – 2014-07-12T15:56:35.013

@RunnyKine: Thanks! Interesting to learn this function :) – Yi Wang – 2014-07-12T16:01:21.787

164

# I. General

I will first try to briefly answer the questions, and then illustrate this with a small but practical application.

## 1.Speed of insertion / deletion

Associations are based on so called Hash Array Mapped Trie persistent data structure. One can think of this as a nested hash table, but it is more than that, because it has the following properties:

• Immutable
• Insertion and deletion are O(log N)

One may ask how does immutability even allow for such performance characteristics, and this is where it becomes non-trivial. Basically, it is this combination which is very important for our case, since Mathematica favors immutable code.

### Runtime efficiency

In practical terms, insertion and deletion are pretty fast, and you can also insert and delete key-value pairs in bulks. This latter capability is important speed booster. Here is a short example:

store = Association[{}];
Do[AppendTo[store, i -> i + 1], {i, 100000}]; // AbsoluteTiming
Length[store]

(* {0.497726, Null} *)

(* 100000 *)


Now constructing in a bulk:

store =
Association[
]; // AbsoluteTiming
Length[store]

(* {0.130438, Null} *)

(* 100000 *)


This small benchmark is important, since it shows that on one hand, insertion and deletion are fast enough that the runtime is dominated by the top-level evaluator (in the case of per-element insertion), and on the other hand, the difference is still not the order of magnitude or more - so insertion / deletion time is comparable with the top-level iteration overhead (although half - order of magnitude smaller).

### Memory efficiency

One thing to keep in mind is that Associations are rather memory-hungry. This is to be expected from the structures which can't take advantage of compact memory layout of e.g. arrays. Here is the benchmark for our example:

ByteCount[store]

(* 11167568 *)

ByteCount[Range[100000]]*2

(* 1600288 *)


The difference can be larger for more complex data structures, but it also can be smaller if many keys point to same expressions, since ByteCount does not account for memory-sharing.

## 2. Advantages over DownValues

• Immutability
• Possibility to access deeper levels inside Associations transparently by structural commands such as Map.
• Deep integration into the language - many old commands work on Associations, and there are many new functions (GroupBy, JoinAcross, etc), which make it easy to accomplish complex tasks.

### Immutability

Of these, I value immutability the most. What this means is that you can safely copy an Association, or pass it anywhere, and it becomes completely decoupled from the original one. It has no state. So, you don't have to worry about deep-copying vs shallow-copying, and all that.

This is very important since most other Mathematica's data structures are immutable (notably Lists, but also others such as general expressions, sparse arrays, etc). It takes a single mutable structure to break immutability, if it is present as a part in a larger structure. This is a really big deal, because it allows one in many cases to solve problems which otherwise would require manual resource management.

In particular, should you use symbols / their DownValues, you'd have to generate new symbols and then manage them - monitor when they are no longer needed and release them. This is a pain in the neck. With Associations, you don't have to do this - they are automatically garbage-collected by Mathematica, once no longer referenced. What's more, also other expressions inside those Associations are then garbage-collected, if those are also no longer referenced.

### Accessing deeper levels in expressions

This is very valuable, since it allows one to save a lot of coding effort and keep thinking on a higher level of abstraction. Here is an example:

stocks = <|1 -> <|"AAPL" -> {<|"company" -> "AAPL",
"date" -> {2014, 1, 2},
"open" -> 78.47009708386817|>, <|"company" -> "AAPL",
"date" -> {2014, 1, 3}, "open" -> 78.07775518503458|>},
"GE" -> {<|"company" -> "GE", "date" -> {2014, 1, 2},
"open" -> 27.393978181818177|>, <|"company" -> "GE",
"date" -> {2014, 1, 3}, "open" -> 27.05933042212518|>}|>|>


The first key here is the month, and is the only one since all prices happen to be for the January. Now, we can, for example, Map some function on various levels:

Map[f, stocks]

(*
<|1 ->
f[<|"AAPL" -> {<|"company" -> "AAPL", "date" -> {2014, 1, 2},
"open" -> 78.4701|>, <|"company" -> "AAPL",
"date" -> {2014, 1, 3}, "open" -> 78.0778|>},
"GE" -> {<|"company" -> "GE", "date" -> {2014, 1, 2},
"open" -> 27.394|>, <|"company" -> "GE",
"date" -> {2014, 1, 3}, "open" -> 27.0593|>}|>]|>
*)

Map[f, stocks, {2}]

(*
<|1 -> <|"AAPL" ->
f[{<|"company" -> "AAPL", "date" -> {2014, 1, 2},
"open" -> 78.4701|>, <|"company" -> "AAPL",
"date" -> {2014, 1, 3}, "open" -> 78.0778|>}],
"GE" -> f[{<|"company" -> "GE", "date" -> {2014, 1, 2},
"open" -> 27.394|>, <|"company" -> "GE",
"date" -> {2014, 1, 3}, "open" -> 27.0593|>}]|>|>
*)

Map[f, stocks, {3}]

(*
<|1 -> <|"AAPL" -> {f[<|"company" -> "AAPL",
"date" -> {2014, 1, 2}, "open" -> 78.4701|>],
f[<|"company" -> "AAPL", "date" -> {2014, 1, 3},
"open" -> 78.0778|>]},
"GE" -> {f[<|"company" -> "GE", "date" -> {2014, 1, 2},
"open" -> 27.394|>],
f[<|"company" -> "GE", "date" -> {2014, 1, 3},
"open" -> 27.0593|>]}|>|>
*)


This last example makes it easy to see how would we, for example, round all prices to integers:

Map[MapAt[Round, #, {Key["open"]}] &, #, {3}] & @ stocks

(*
<|1 -> <|"AAPL" -> {<|"company" -> "AAPL",
"date" -> {2014, 1, 2}, "open" -> 78|>, <|"company" -> "AAPL",
"date" -> {2014, 1, 3}, "open" -> 78|>},
"GE" -> {<|"company" -> "GE", "date" -> {2014, 1, 2},
"open" -> 27|>, <|"company" -> "GE", "date" -> {2014, 1, 3},
"open" -> 27|>}|>|>
*)


As noted by Taliesin Beynon in comments, there are more elegant ways to do this, using the new operator forms for Map and MapAt:

Map[MapAt[Round, "open"], #, {3}] & @ stocks


and

MapAt[Round, {All, All, All, "open"}] @ stocks


which illustrate my point of transparent access to deeper layers even more.

So, what does this buy us? A lot. We do use here immutability heavily, because it is only due to immutability that functions such as Map can operate on Associations efficiently, producing new ones, completely decoupled from the old ones. In fact, as long as manipulations are structural ones on "higher levels", this is very efficient, because the actual expressions (leaves at the bottom) might be untouched.

But there is more here. With just one command, we can transparently inject stuff on any level in an Association. This is very powerful capability. Just think of what would be involved in doing so with the traditional hash tables (nested DownValues, for instance). You will have to generate several symbols (often many of them), then manually traverse the nested structure (non-transparent, much more code, and slower), and also do manual resource management for these symbols.

### Integration with the language

I may expand later on this, but many examples have been given in other answers. Basically, lots of functions (Part, Map, MapAt, MapIndexed, Delete, etc) do work on Associations, including nested ones. Besides, you can use multi-arg Part on nested associations having inside them other expressions (Lists,Associations, etc). In addition to this, a host of new functions have been introduced to work specifically on Associations, making it easy to do complex data transformations (GroupBy, Merge, JoinAcross, Catenate, KeyMap, etc). The language support for Associations is approaching that for List-s. Together, they make the core of the data-processing primitives.

So, the addition of Associations to the language made it strictly more powerful, primarily because of two things: level of language integration and immutability.

## 3. Dataset and large data.

Right now, Dataset is not suitable for working with really large data (the one that does not fit into memory). Since Associations are rather memory-hungry, it puts additional constraints on the sizes of data sets amenable to the current version of Dataset.

However, work is underway to address this problem. Currently, the best way to view Dataset is IMO as a query language specification with an in-memory implementation. In the future, it can also have other / different implementations / backends.

Also, in practice, a lot of interesting data sets are still small enough that can be effectively worked with using Dataset. This is particularly true for various "business-type" data, which tend to often not be very huge. The huge data often involves large numerical data sets, and I am sure this case will be addressed by the Dataset framework in the near future.

See section III.

# II. Associations as objects / structs

Associations can be used as structs. To illustrate some of the possibilities, I will use a simple object which has to store person's first and last name, and have get and set methods for them, and also have an additional method to return full name. I will consider three different ways to implement this, two of which will use Associations

## 1. Mutable struct implementation (one of the possibilities)

Here is the code:

ClearAll[makePersonInfoMutable];
makePersonInfoMutable[fname_, lname_] :=
Module[{fn = fname, ln = lname, instance},
SetAttributes[instance, HoldAll];
instance @ getFirstName[] := fn;
instance @ setFirstName[name_] := fn = name;
instance @ getLastName[] := ln;
instance @ setLastName[name_] := ln = name;
instance @ getFullName[] := fn <> " " <> ln;
instance];


Here is how one can use this:

pinfo = makePersonInfoMutable["Leonid", "Shifrin"]

(* instance$552472 *) pinfo @ getFirstName[] (* "Leonid" *) pinfo @ setLastName["Brezhnev"] pinfo @ getFullName[] (* "Leonid Brezhnev" *)  This method is Ok, but it has some short-comings: one needs to introduce several internal mutable variables, which must be manually managed. Also, the instance variable itself must be managed. ## 2. Using Associations - the immutable way One can instead use Associations very simply, as follows: pinfoIm = <|"firstName" -> "Leonid", "lastName" -> "Shifrin"|> (* <|"firstName" -> "Leonid", "lastName" -> "Shifrin"|> *) pinfoIm["firstName"] (* "Leonid" *) AppendTo[pinfoIm, "lastName" -> "Brezhnev"] (* <|"firstName" -> "Leonid", "lastName" -> "Brezhnev"|> *)  This is fine and efficient, and no additional symbol /state management is needed here. However, this method also has its short-comings: • No natural way to define methods on such objects (have to be just functions, but then they will produce new objects) • What if I do want the changes made to the object to be reflected in other places where the object is used. In other words, what if I don't want to create an immutable copy, but want instead to share some state? So, this method is fine as long as the problem can be completely addressed by immutable objects (no state). ## 3. Combining Associations and mutability One can do this using the following method (of my own invention, so can't guarantee it will always work): pinfoSM = Module[{self}, self = <| "firstName" -> "Leonid", "lastName" -> "Shifrin", "setField" -> Function[{field, value}, self[field] = value; self = self], "fullName" -> Function[self@"firstName" <> " " <> self@"lastName"], "delete" -> Function[Remove[self]] |> ];  what happens here is that we capture the Module variable, and use it inside the Association. In this way, we inject some mutability into otherwise immutable structure. You can see that now, we can define "methods" - functions which work on this particular instance, and possibly mutate its state. Here is an example of use: pinfoSM["firstName"] (* "Leonid" *) pinfoSM["setField"]["lastName", "Brezhnev"]; pinfoSM["fullName"][] (* "Leonid Brezhnev" *)  Note that here we used an extra pair of brackets to perform the function call. If you don't like this syntax, you can instead of the line "fullName" -> Function[self@"firstName" <> " " <> self@"lastName"]  use "fullName" :> self@"firstName" <> " " <> self@"lastName"  and then call just pinfoSM["fullName"] (this is possible because Associations respect RuleDelayed for the key-value pairs, and don't then evaluate the r.h.s. (value) until it is extracted). In this way, the fields can be made to behave similar to Python's properties. EDIT As noted by saturasl in comments, the above version exhibits erroneous behavior when changed properties are accessed directly. In the last example, for instance, after the change we still get pinfoSM["lastName"] (* "Shifrin" *)  The reason is that while self has changed, the pinfoSM still stores the same field values for lastName and firstName. One possible solution here is in the spirit of Python's properties: hide the actual fields, and introduce the accessors with the names which we previously used for the fields themselves: pinfoSM = Module[{self}, self = <| "_firstName" -> "Leonid", "_lastName" -> "Shifrin", "setField" -> Function[{field, value}, self["_" <> field] = value; self = self], "fullName" -> Function[self@"firstName" <> " " <> self@"lastName"], "delete" -> Function[Remove[self]], "firstName" :> self@"_firstName", "lastName" :> self@"_lastName" |> ];  Now the previous code will all work, and we also have after change: pinfoSM["lastName"] (* "Brezhnev" *)  As it should be. It is understood that the fields "_firstName" and "_lastName" are private and should not be accessed directly, but rather via the "accessor" fields "firstName" and "lastName". This provides a level of indirection needed to account for the changes in self correctly. END EDIT So, this version is stateful. Still, depending on the problem, it may have advantages. One is for cases where you want all instances of the object to update if you make a change in one (in other words, you don't want an independent immutable copy). Another is that the "methods" here work specifically on a given instance. You do need to manage these objects (destroy them once they are no longer referenced), but here you only have one symbol which is stafeful. I find this construct to be a nice combination of mutable and immutable state. # III. Example: a toy hierarchical database Here, I will illustrate the utility of both Associations and a new operator form of functional programming in Mathematica, by constructing a toy hierarchical database of stock data. ## Sample data We start with the data: data = Composition[ Map[Association], Flatten[#, 1] &, Map[ Function[ company, Composition[ Map[ Composition[ Prepend["company" -> company], MapThread[Rule, {{"date", "open"}, #}] & ] ], If[MatchQ[#, _Missing], {}, #] &, FinancialData[#, "Open", {{2013, 12, 25}, {2014, 1, 05}}] & ] @ company ]]]@{"GOOG", "AAPL", "MSFT", "GE"}  Here is the result: (* {<|"company" -> "AAPL", "date" -> {2013, 12, 26}, "open" -> 80.2231|>, <|"company" -> "AAPL", "date" -> {2013, 12, 27}, "open" -> 79.6268|>, <|"company" -> "AAPL", "date" -> {2013, 12, 30}, "open" -> 78.7252|>, <|"company" -> "AAPL", "date" -> {2013, 12, 31}, "open" -> 78.2626|>, <|"company" -> "AAPL", "date" -> {2014, 1, 2}, "open" -> 78.4701|>, <|"company" -> "AAPL", "date" -> {2014, 1, 3}, "open" -> 78.0778|>, <|"company" -> "MSFT", "date" -> {2013, 12, 26}, "open" -> 36.6635|>, <|"company" -> "MSFT", "date" -> {2013, 12, 27}, "open" -> 37.0358|>, <|"company" -> "MSFT", "date" -> {2013, 12, 30}, "open" -> 36.681|>, <|"company" -> "MSFT", "date" -> {2013, 12, 31}, "open" -> 36.8601|>, <|"company" -> "MSFT", "date" -> {2014, 1, 2}, "open" -> 36.8173|>, <|"company" -> "MSFT", "date" -> {2014, 1, 3}, "open" -> 36.6658|>, <|"company" -> "GE", "date" -> {2013, 12, 26}, "open" -> 27.2125|>, <|"company" -> "GE", "date" -> {2013, 12, 27}, "open" -> 27.3698|>, <|"company" -> "GE", "date" -> {2013, 12, 30}, "open" -> 27.3708|>, <|"company" -> "GE", "date" -> {2013, 12, 31}, "open" -> 27.4322|>, <|"company" -> "GE", "date" -> {2014, 1, 2}, "open" -> 27.394|>, <|"company" -> "GE", "date" -> {2014, 1, 3}, "open" -> 27.0593|> } *)  Note that the code to construct this result heavily uses the operator forms for various functions (here Map and Prepend), and also Composition is frequently used. This has many advantages, including clarity and maintainability (but there are others too). ## Generating the transform to nested data store The following functions will generate a transform, that would transform the above data into a nested data store, built out of Lists and Associations ClearAll[keyWrap]; keyWrap[key_Integer] := key; keyWrap[key_] := Key[key]; ClearAll[pushUp]; (* Operator form *) pushUp[key_String] := pushUp[{key}]; pushUp[{keyPath__}] := With[{keys = Sequence @@ Map[keyWrap, {keyPath}]}, GroupBy[Part[#, keys] &] ]; (* Actual form *) pushUp[assocs : {__Association}, keys__] := pushUp[keys][assocs]; (* Constructs a transform to construct nested dataset from flat table *) ClearAll[pushUpNested]; pushUpNested[{}] := Identity; pushUpNested[specs : {_List ..}] := Composition[ Map[pushUpNested[Rest[specs]]], pushUp@First[specs] ];  The pushUp function is basically GroupBy, wrapped in a different syntax (which makes it easier to specify multi-part paths). I have simplified from the one I have used for my purposes - the original version was also deleting the key on which we group, from the grouped associations. In our case, we need to supply the specification to get the nested data set. Here is an example, where we group by the year first, then by the month, and then by the company name: transform = pushUpNested[{{"date", 1}, {"date", 2}, {"company"}}] (* Map[Map[Map[Identity]@*GroupBy[#1[[Sequence[Key["company"]]]] &]]@* GroupBy[#1[[Sequence[Key["date"], 2]]] &]]@* GroupBy[#1[[Sequence[Key["date"], 1]]] &] *)  Note that this operator approach has a number of advantages. It is declarative, and at the end we generate a complex transformation function which can be analyzed and argued about. Now, here is how it can be used:  nested = transform @ data (* <|2013 -> <|12 -> <|"AAPL" -> {<|"company" -> "AAPL", "date" -> {2013, 12, 26}, "open" -> 80.2231|>, <|"company" -> "AAPL", "date" -> {2013, 12, 27}, "open" -> 79.6268|>, <|"company" -> "AAPL", "date" -> {2013, 12, 30}, "open" -> 78.7252|>, <|"company" -> "AAPL", "date" -> {2013, 12, 31}, "open" -> 78.2626|>}, "MSFT" -> {<|"company" -> "MSFT", "date" -> {2013, 12, 26}, "open" -> 36.6635|>, <|"company" -> "MSFT", "date" -> {2013, 12, 27}, "open" -> 37.0358|>, <|"company" -> "MSFT", "date" -> {2013, 12, 30}, "open" -> 36.681|>, <|"company" -> "MSFT", "date" -> {2013, 12, 31}, "open" -> 36.8601|>}, "GE" -> {<|"company" -> "GE", "date" -> {2013, 12, 26}, "open" -> 27.2125|>, <|"company" -> "GE", "date" -> {2013, 12, 27}, "open" -> 27.3698|>, <|"company" -> "GE", "date" -> {2013, 12, 30}, "open" -> 27.3708|>, <|"company" -> "GE", "date" -> {2013, 12, 31}, "open" -> 27.4322|>}|>|>, 2014 -> <|1 -> <|"AAPL" -> {<|"company" -> "AAPL", "date" -> {2014, 1, 2}, "open" -> 78.4701|>, <|"company" -> "AAPL", "date" -> {2014, 1, 3}, "open" -> 78.0778|>}, "MSFT" -> {<|"company" -> "MSFT", "date" -> {2014, 1, 2}, "open" -> 36.8173|>, <|"company" -> "MSFT", "date" -> {2014, 1, 3}, "open" -> 36.6658|>}, "GE" -> {<|"company" -> "GE", "date" -> {2014, 1, 2}, "open" -> 27.394|>, <|"company" -> "GE", "date" -> {2014, 1, 3}, "open" -> 27.0593|>}|>|>|> *)  You can see the immediate advantage of this - it is very easy to construct any other nested structure we want, with different grouping at different levels. ## Querying the nested structure Along the same lines, here is how we can construct queries to run against this structure. For simplicity, I will only consider queries which specify explicitly the keys we want to keep at each level, as a list, or All if we want to keep all entries at that level. Here is the query generator: (* Modified Part, to stop at missing elements *) ClearAll[part]; part[m_Missing, spec__] := m; part[expr_, spec__] := Part[expr, spec]; (* Builds a query to run on nested dataset *) ClearAll[query]; query[{}] := Identity; query[spec : {(_List | All) ..}] := Composition[ Map[query[Rest[spec]]], With[{curr = First@spec}, If[curr === All, # &, part[#, Key /@ curr] & ] ] ];  It also heavily uses the operator form, constructing a rather complex function to query the nested data set, from a simple spec. Let us now try constructing some queries: q = query[{{2013}, All, {"AAPL", "MSFT"}}] (* Map[Map[Map[ Identity]@*(part[#1, Key /@ {"AAPL", "MSFT"}] &)]@*(#1 &)]@*(part[#1, Key /@ {2013}] &) *)  Now, we can run it: q @ nested (* <|2013 -> <|12 -> <|"AAPL" -> {<|"company" -> "AAPL", "date" -> {2013, 12, 26}, "open" -> 80.2231|>, <|"company" -> "AAPL", "date" -> {2013, 12, 27}, "open" -> 79.6268|>, <|"company" -> "AAPL", "date" -> {2013, 12, 30}, "open" -> 78.7252|>, <|"company" -> "AAPL", "date" -> {2013, 12, 31}, "open" -> 78.2626|>}, "MSFT" -> {<|"company" -> "MSFT", "date" -> {2013, 12, 26}, "open" -> 36.6635|>, <|"company" -> "MSFT", "date" -> {2013, 12, 27}, "open" -> 37.0358|>, <|"company" -> "MSFT", "date" -> {2013, 12, 30}, "open" -> 36.681|>, <|"company" -> "MSFT", "date" -> {2013, 12, 31}, "open" -> 36.8601|>}|>|>|> *)  Now, let's look back and see what we've done: in just a few lines of code, we have constructed a fully functional small hierarchical database (actually, a generator of such databases), based on nested Association-s, and then a query generator which allows one to construct and run simple queries against that database. Now, this has been a toy dataset. Is this construction practical for larger set of data (like, tens of thousands of records and more)? Yes! I have originally written this type of code for a problem involving data sets with hundreds of thousands of records, and the queries run extremely fast, as long as most of the data is categorical in nature (can be reduced to a finite small set of distinct keys on each level). Now, think about what would be involved in implementation of this type of thing without Associations. My bet is that this wouldn't even be possible, or at the very least would've been much more work. And because usual hash tables are not immutable, the whole elegant operator approach of construction of queries / transforms as function composition in a declarative way wouldn't even come to mind (at least as far as I am concerned). ### A note on Dataset Incidentally, Dataset uses a more complex version of the same set of ideas. I can now also partly answer the questions many people asked about what does the Dataset bring which isn't readily available without it. The answer is that, by generating queries in a way conceptually similar to the above query function (although, of course, in a much more general way), it brings a new level of automation to the query construction, particularly for nested hierarchical data sets. In a way, query is a toy example of a compiler from the simple declarative query specification to an actual query that can be run. The more complex is the query, the more this layer will buy you. # Conclusions So, at least for me, the presence of Associations in the language (as well as the operator form of many functions) not just simplifies many tasks, but actually opens new ways of thinking and programming. 44Someone upvoted after a few seconds of this being posted. Either this is a new speed record in reading, or some people have too much trust in me :) – Leonid Shifrin – 2014-07-10T19:14:40.187 2That's right. I only read your name on the first pass before voting :) Now is the second pass and I will definitely read again the technical points and the wonderful example. Thanks a lot! – Yi Wang – 2014-07-10T19:38:48.960 1This is definitely a long and awesome read. Some great ideas that I will be returning back to. Big +1 and thanks for sharing. – RunnyKine – 2014-07-10T19:47:15.640 5Great post, Leonid, thanks! I want to elaborate on operator forms, which I have a particular soft spot for. This expression doesn't contain any operator forms: Map[MapAt[Round, #, {Key["open"]}] &, #, {3}] & @ stocks. The level prevents us using operator forms for Map, but we can an operator form for the MapAt: Map[MapAt[Round, "open"], #, {3}] @ stocks. And in fact we can use an (undocumented) MapAt feature to do it all in one swoop: MapAt[Round, {All, All, All, "open"}] @ stocks – Taliesin Beynon – 2014-07-10T20:12:21.677 @TaliesinBeynon Thanks, Tali, well spotted! I will edit your versions in, and correct my comment. Great to see you here, finally :) – Leonid Shifrin – 2014-07-10T20:13:43.257 Associations is a fantastic addition. Yet to download 10 but thanks for this outline Leonid. – Mike Honeychurch – 2014-07-10T22:48:43.320 @MikeHoneychurch Totally agree, Associations are very powerful. I'll add some more stuff to this post soon. – Leonid Shifrin – 2014-07-10T22:51:43.767 1@LeonidShifrin: A naive question: considering Association has much quicker insertion than List, do you think one can also use it when we need linked list? (previously I learnt from you that we can use nested List to implement linked list) – Yi Wang – 2014-07-11T12:44:54.010 1@YiWang Insertion time is fast when you don't care where to insert, or insert at the beginning or at the end. I did not benchmark what happens when you insert somewhere in the middle. Besides, in linked lists you usually have some control over the "current position", and I don't see how to do this with Associations. But this has to be investigated further. – Leonid Shifrin – 2014-07-11T14:06:48.743 @LeonidShifrin: You are absolutely right. Insertion/deletion in the middle is very slow (I updated the question with a benchmark). So not all linked list operations can be done fast. However, if we want to do something like recursion over a list, we may not need to insert/delete in the middle. – Yi Wang – 2014-07-12T14:03:01.653 1I have been a devoted Mathematica user since version 1.2, but I never picked up much CS education. As a result you always loose me when you talk about immutable, deep copy, shallow copy, bindings, semantics, etc. – Ted Ersek – 2014-07-13T11:51:31.810 1@TedErsek I don't have any formal CS training either (I wish I had though, that would've been very useful for me now). My background has been Theoretical Physics, I switched to programming relatively recently. But the things you mentioned are not as bad or hard as they sound. Anyone with the familiarity of a few languages, like say C, Java and Mathematica, would be able to learn about these things relatively quickly. – Leonid Shifrin – 2014-07-13T11:56:48.047 @TedErsek Perhaps I should later add some simple examples to this post, to illustrate those things I mention (immutability, etc). The problem is that the post is already pretty long. – Leonid Shifrin – 2014-07-13T14:30:52.223 2@LeonidShifrin If you would like to include a few smaller examples that will be super nice. I am less familiar with immutability compared with QCD sum rules (I am a cosmologist though). – Yi Wang – 2014-07-17T06:59:29.190 @YiWang All right, I will add some examples. It will have to wait a little, though, perhaps a day or two. – Leonid Shifrin – 2014-07-17T11:52:24.163 1Is there any advantage to doing AppendTo[pinfoIm, "lastName" -> "Brezhnev"] vs. pinfoIm["lastName"] = "Brezhnev"? thanks. – Nasser – 2014-07-20T17:57:34.543 1@Nasser No, there isn't. In fact, your suggestion might be better, since it will preserve the order of the key. – Leonid Shifrin – 2014-07-21T09:25:20.377 2 I found it quite fun to use Association as a circular linked list for cyclic repetitions. Sort of like how an InternalBag can hold itself, I think this is quite powerful and might have more practical uses beyond what I used it for. Do you see any other good applications for it? – rm -rf – 2014-08-08T17:16:43.187 1@rm-rf A very nice and inventive application, of course you've got my vote. It is somewhat similar in spirit to the trick I am using above in my third version of the object implementation, but in your case it serves a somewhat different purpose. You probably still have to manage the Module-generated symbol, but the complexity reduction from managing complex structure to just a single symbol is quite significant. I can't make immediate examples, but this way of thinking is surely worth exploring much more. One may try to use Associations for novel ways to construct general data structures. – Leonid Shifrin – 2014-08-08T18:48:16.410 @rm-rf In general, I can tell that the addition of Associations made the language much more powerful, as far as I am concerned, particularly together with operator forms (for which we should thank a number of folks, Tali and Daniel in particular). I can already see that my code patterns have transformed and evolved significantly, to the extent that removing Assocations from the language now would make me feel helpless. One clear sign of importance of a given feature is that it not just simplifies some tasks, but change the ways you think and use the language, which is the case for me here. – Leonid Shifrin – 2014-08-08T18:53:38.560 Thanks for the insight, Leonid. I agree that the operator forms are really useful. I've mostly been using Associations as the equivalent of a dict in python, but it is certainly capable of a lot more and I've been playing with it, trying to find out other uses. Re: "You probably still have to manage the Module-generated symbol" — do you mean that this symbol won't be garbage collected? – rm -rf – 2014-08-09T18:52:47.747 Re: like dictionaries - to me, some of the more valuable parts of assocs are immutability and transparency on several levels of depth. The former allows to do things not easily possible otherwise (real fp on assocs), and the latter often saves a lot of boilerplate code and reduces complexity. Re: not garbage-collected - I'd think so. I did not do extensive tests, but since you return it from Module as a part of another expression, I'd think you will need to take care of its destruction yourself. If I am wrong, all the better! I will check this soon. – Leonid Shifrin – 2014-08-09T20:02:57.797 1this is incredibly helpful! You should write a book. Wait.. – gpap – 2014-10-09T22:20:52.200 4This should be part of the documentation / tutorials of Mathematica! (+1) – Silvia – 2015-05-31T16:44:05.490 @Silvia Thanks :) Particularly glad to hear this from such a veteran user as you. – Leonid Shifrin – 2015-05-31T19:56:59.013 1@Leonid Shifrin: There is an unnoticed bug in your code "II.3. Combining Associations and mutability". The variable self$___ updates itself, however, the variable pinfoSM remains un-updated, such that pinfoSM["lastName"] still gives Shifrin after we update self$___. A manual pinfoSM = self$___ will fix this. I wonder if there is an automatic approach? – saturasl – 2015-12-21T04:41:22.833

@Leonid Shifrin: A working fix is to replace "lastName" -> "Shifrin" with "n" -> 0, "lastName" :> If[self["n"] == 0, "Shifrin", self["lastName"]], and add ++self["n"] to the definitions for "setField" and "fullName". But I think there should be more elegant and general solution than this. – saturasl – 2015-12-21T05:13:00.800

@saturasl Thanks, nicely spotted! I have added a fix which seems most natural to me, see if you like it. – Leonid Shifrin – 2015-12-21T08:01:18.647

@Leonid Shifrin: I like your method, a natural approach to make use of the private and public variables. Thanks. – saturasl – 2015-12-21T16:35:14.243

@saturasl Glad that you liked it. Thanks again for pointing this out. I have seen such effects previously, but did not notice it here when originally posting. – Leonid Shifrin – 2015-12-21T16:47:58.943

@LeonidShifrin - I tried both forms of pinfoSM. As observed and discussed, the two versions yield different results for the command pinfoSM["lastName"] and equal results for the other tested commands. However, for each version I got the same results for all test commands with and without the command self=self. Does self=self change anything? Is it needed? – Soldalma – 2016-10-23T16:33:06.717

@Soldalma Probably not. I don't quite remember what was the original reason for it. Originally, I think I added self=self to mostly update the way result looks when stored in a variable. Looking at it now, I don't see any technical reason why self=self would be really necessary. – Leonid Shifrin – 2016-10-23T17:23:40.017

49

I helped design Association, and I designed and implemented Dataset, so I wanted to comment on question 3:

Dataset is designed explicitly for hierarchical data. It supports any 'shape' of data, inferring the shape when the Dataset is first created. It also tracks the shape of the data as transformations are applied to the dataset, using a type-inference scheme. When the data has certain forms it formats "nicely", an example would be a list of associations whose keys are all the same -- these get formatted as a table.

In reply to question 4, I wrote a little spiel on Hacker News that is sort-of relevant: https://news.ycombinator.com/item?id=8012451

4Your example mentioned on Hacker News does not quite work on a fresh Mathematica 10.0.0 session on Windows (it does on Linux): Executing titanic = ExampleData[{"Dataset", "Titanic"}];titanic[Histogram, "age"] produces Failure[ ...]. Only titanic = ExampleData[{"Dataset", "Titanic"}]; titanic[#, "age"] &@Histogram or any other previous mentioning (autoloading) of Histogram works. Sometimes I wonder why WRI releases beta versions as .0.0 releases. – Rolf Mertig – 2014-07-22T12:53:10.103

24

First of all, it's not a new data structure, it's a new only in a Wolfram Mathematica.

• About complexity of a data structure.

It's a Wolfram implementation of a hash-map. With a complexity of an operations (worst case in parenthesis):

Space   O(n)
Search  O(1) (O(n))
Insert  O(1) (O(n))
Delete  O(1) (O(n))


This can be easily checked: create custom Association and use different functions on it with AbsoluteTiming:

assoc = Table[
Association[
Table[i -> i^2, {i, n}]], {n, {1, 2^5, 2^10, 2^15, 2^20}}];
ListPlot[Table[
AbsoluteTiming[
Lookup[assoc[[n]],
RandomInteger[<|1 -> 1, 2 -> 2^5, 3 -> 2^10, 4 -> 2^15,
5 -> 2^20|>[n]]]][[1]], {n, 5}], PlotTheme -> "Web"]


Note 1: The map from one element is a most inefficient for searching.

Note 2: I've taken 1, 32, 1024, 32768 and 1048576 items respectively for 1..5. And try to find RandomInteger there.

Same code for AssociateTo and KeyDropFrom leads to O(1) and O(1) respectively.

Would it be O(N) or O(N log N)? Would it be O(N) or O(N^2)?

Instead this is O(1).

Pattern matching and Association is a completely different things, but, in some cases can be interchanged.
You can use complex test in pattern matching, like (define only for Integer):
isPrime[x_?IntegerQ] := PrimeQ[x];

And other possible patterns and matches. This kinda more powerful than Association`, isn't it?