Benefit of using Dataset to collect SQLSelect output

14

1

Considering a DatabaseExample included with Mathematica:

Needs["DatabaseLink`"]
conn = OpenSQLConnection["publisher"];
SQLSelect[conn, "ROYSCHED", "ShowColumnHeadings" -> True]//Short

{{TITLE_ID,LORANGE,HIRANGE,ROYALTY},{BS1011,0,5000,0.1},<<49>>,{PY2008,0,50000,0.1}}

Now with the introduction of v10 and functions such as Association and Dataset, we can convert this data to a Dataset:

sqltodataset[fields_, data_] /; Equal @@ Join[{Length@fields}, Length /@ data] :=
 Dataset[Association[Thread[fields -> #]] & /@ data]


data = SQLSelect[conn, "ROYSCHED", "ShowColumnHeadings" -> True];
ds = sqltodataset[First@data, Rest@data];

Querying should now be much faster:

Do[SQLSelect[conn, "ROYSCHED", {"ROYALTY"}, SQLColumn["TITLE_ID"] == "BS1001"], {100}] // Timing
Do[Normal[Select[ds, #"TITLE_ID" == "BS1001" &][[All, "ROYALTY"]]], {100}] // Timing

And it seems to be. Although not quite a fast as I expected:

{0.171601,Null}

{0.093601, Null}

But at the expense of RAM:

ByteCount[ds]

35644

Is there a real upside to converting to a Dataset, or should I just stick to SQL to query? And is there a more direct way to convert straight to a Dataset?

kale

Posted 2014-07-29T15:07:45.713

Reputation: 10 290

Just a note: you can use AssociationThread in place of Association[Thread[...]]. – RunnyKine – 2014-07-29T15:12:54.147

@RunnyKine Thanks... Soooo many new functions to learn. – kale – 2014-07-29T15:14:57.887

I know, makes using v10 so much fun :) – RunnyKine – 2014-07-29T15:15:33.053

1The benefit would be more pronounced with a remote database: pulling a large block of data and then manipulating it locally will be faster than numerous more specific queries. You can of course do that without a dataset; I think the reason to use a dataset in some cases would be to preserve the fields/keys so you can work in a more expressive syntax similar to what is available in a database. – mfvonh – 2014-07-29T17:14:20.847

1It has been discussed in chat several times, around the date of the release, that Dataset currently is optimized to work with hierarchical data, rather than tabular. There are plans to add SQL-like backend(s) to Dataset, which will enable efficient manipulation of tabular data, but those will likely not be based on Associations in the same manner as Dataset is (namely, one Association per row in a db). – Leonid Shifrin – 2014-07-29T17:26:21.497

@kale Im a little surprised at that result I would expect for any dataset of a reasonable size the Database query to win out over the Dataset "query". For small queries there is an overhead in the initial IO to establish a database connection - did you try running in the other order, or explicitly disconnecting between queries? Note also that databases may also have some sophisticated caching mechanisms that may have recognised that the Dataset retrieve was a superset of the first. – Gordon Coale – 2014-12-22T08:56:19.250

Answers

1

Depending on your requirements... I've gone radical. My practice is: - if I generate the data use associations and datasets save as .mx files or CloudObjects. - if data in SQL, use the simplest SQL possible to load the data then transform it to associations or datasets and save to .mx file or CloudObjects.

I'm a SQL free guy now and happier for it.

This allows much more flexibility when wrangling data and joining disparate data sources for computation.

Mitch

Posted 2014-07-29T15:07:45.713

Reputation: 146

I tend to do this too, but it does requires that your data or some reasonable fraction of it can be held in memory. Often that is not the case, at which point some SQL connection will still be required (although this can always just be to load chunks of data as needed). – b3m2a1 – 2018-04-03T07:34:41.953