Random sample from sql database uniformly over time

2

2

I'd like to sample rows from a table uniformly over time, here is the naive way to do this:

SELECT * FROM table ORDER BY RAND() LIMIT 10000

Unfortunately, executing this query causes the table to first get sorted, which can take very very very long if it is large, and the query freezes. There are other ways for dealing with random sampling in sql, but they are really complicated. Is there any shortcut to doing this with a DatabaseLink` command?

Code to get started

Here's how to I set up my connection:

<< DatabaseLink`
DatabaseExplorer[] (* wizard to create the connection "database_name"*)
conn = OpenSQLConnection["database_name"];

For the query DatabaseLink uses the function SQLSelect, which has these options:

enter image description here

Update:

Here are the table structure details:

  • Generic mysql db with single table (381 Gb)
  • Uses InnoDB storage engine
  • All columns have btree indices
  • The table has a primary key and 589 million rows
  • The column I'm random sampling from is 'created_at' of type datetime
  • The table's schema look like this:

    CREATE TABLE photos ( id int(10) unsigned NOT NULL AUTO_INCREMENT, tweet_id varchar(255) DEFAULT NULL, created_at datetime NOT NULL, hashtags varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci DEFAULT NULL, text varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci DEFAULT NULL, source varchar(255) DEFAULT NULL, image_hash text, PRIMARY KEY (id), UNIQUE KEY photos_tweet_id (tweet_id), KEY photos_date (object_date) ) ENGINE=InnoDB AUTO_INCREMENT=589069903 DEFAULT CHARSET=latin1

M.R.

Posted 2016-05-07T23:58:37.557

Reputation: 30 727

Question was closed 2016-05-11T15:58:35.593

Wouldn't this question be more about SQL than Mathematica though? – MarcoB – 2016-05-08T00:04:49.527

Lol, I'd tend to agree, but there are no good SQL solutions. Maybe Mathematica can circumvent this? – M.R. – 2016-05-08T00:06:12.620

1How is table structured? Does it have a primary key? If so what are the column(s) of the primary key and their data type? What database product are you using? – Edmund – 2016-05-08T13:07:30.873

@Edmund I'll add those details now. – M.R. – 2016-05-08T22:46:53.117

Extending what "Edmund" said... If the primary key is just an auto incriminated integer. Then select the max primary key then pick uniform (or whatever) set of keys then select those rows. This as the issue that you may try to select rows that no longer exist but you could be clever and check for existence first. If the row is gone roll the dice again. – c186282 – 2016-05-08T23:51:24.447

2One does not necessarily have to do a simple random sample. Assuming that there are no cyclic patterns in the database, you might consider selecting a random record in the first k records and then sample every k-th record. (So it's systematic after the random start.) There are lots of alternatives with desirable sampling properties that don't require knowing the complete population ahead of time. Details on those methods would be better addressed at Cross Validated. – JimB – 2016-05-09T01:11:37.893

No answers