Write Moby Dick, approximately

270

59

Here is a 1.2Mb ASCII text file containing the text of Herman Melville's Moby-Dick; or, The Whale. Your task is to write a program or function (or class, etc. -- see below) which will be given this file one character at a time, and at each step must guess the next character.

This is . Your score will be

2*L + E

where L is the size of your submission in bytes, and E is the number of characters it guesses incorrectly. The lowest score wins.


There are two versions of the file, and you may use either of them in an answer:

  • whale2.txt (linked above) - in this version the text is not wrapped, so newlines appear only at the end of paragraphs. This file is recommended for new answers.

  • whale.txt - in this version the text is wrapped to a width of 74 characters, so you have to predict the end of each line as well as predicting the text. This makes the challenge more fiddly and is included for mostly historical reasons.

Both files are the same size, 1215236 bytes.


Further particulars

Your submission will be a program or function (etc.) that will be called or invoked or sent data multiple times. (1215235 times to be exact.) When it is called for the nth time it will be given the nth character of whale.txt or whale2.txt and it must output its guess for the (n+1)th character. The E component of its score will be the total number of characters that it guesses incorrectly.

Most submissions will need to store some state in between invocations, so that they can track how many times they have been called and what the previous inputs were. You can do this by writing to an external file, by using static or global variables, by submitting a class rather than a function, using a state monad, or whatever else works for your language. Your submission must include any code required to initialise its state before the first invocation.

Your program should run deterministically, so that it always makes the same guesses given the same input (and hence always gets the same score).

Your answer must include not only your submission, but also the code you used to calculate the E part of its score. This need not be written in the same language as your submission, and will not be counted towards its byte count. You are encouraged to make it readable.

Regarding the interface between your submission and this score-calculating program, anything is fine, as long as your program always gives one byte of output before receiving its next byte of input. (So, for example, you can't just pass it a string containing all of the input and get a string back containing all of the output.)

You must actually run your test program and calculate/verify your score before submitting your entry. If your submission runs too slowly for you to verify its score then it is not qualified to compete, even if you know what its score would be in principle.

The L component of your score will be calculated according to the usual rules for code golf challenges. If your submission will contain multiple files, please take note of the rules on scoring and directory structure in that case. Any data that your code uses must be included in your L score.

You may import existing libraries but may not load any other external files, and your code may not access the whale.txt or whale2.txt file in any way other than described above. You may not load any pre-trained neural networks or other sources of statistical data. (It's fine to use neural networks, but you have to include the weight data in your submission and count it towards your byte count.) If for some reason your language or libraries include a feature that provides some or all of the text of Moby Dick, you may not use that feature. Aside from that you can use any other built-in or library features that you like, including ones relating to text processing, prediction or compression, as long as they're part of your language or its standard libraries. For more exotic, specialised routines that include sources of statistical data, you would have to implement them yourself and include them in your byte count.

It is likely that some submissions will include components that are themselves generated by code. If this is the case, please include in your answer the code that was used to produce them, and explain how it works. (As long as this code is not needed to run your submission it will not be included in your byte count.)

To summarise, all answers should include the following things:

  • Your submission itself. (The code, plus any data files it uses - these can be links if they're large.)
  • An explanation of how your code works. Please explain the I/O method as well as how it predicts the next character.
  • The code you used to evaluate your score. (If this is identical to a previous answer you can just link to it.)
  • Any code you used to generate your submission, along with an explanation of that code. This includes code that you used to optimise parameters, generate data files etc. (This doesn't count towards your byte count but should be included in your answer.)

Leaderboard

var QUESTION_ID=152856,OVERRIDE_USER=21034;function answersUrl(e){return"https://api.stackexchange.com/2.2/questions/"+QUESTION_ID+"/answers?page="+e+"&pagesize=100&order=desc&sort=creation&site=codegolf&filter="+ANSWER_FILTER}function commentUrl(e,s){return"https://api.stackexchange.com/2.2/answers/"+s.join(";")+"/comments?page="+e+"&pagesize=100&order=desc&sort=creation&site=codegolf&filter="+COMMENT_FILTER}function getAnswers(){jQuery.ajax({url:answersUrl(answer_page++),method:"get",dataType:"jsonp",crossDomain:!0,success:function(e){answers.push.apply(answers,e.items),answers_hash=[],answer_ids=[],e.items.forEach(function(e){e.comments=[];var s=+e.share_link.match(/\d+/);answer_ids.push(s),answers_hash[s]=e}),e.has_more||(more_answers=!1),comment_page=1,getComments()}})}function getComments(){jQuery.ajax({url:commentUrl(comment_page++,answer_ids),method:"get",dataType:"jsonp",crossDomain:!0,success:function(e){e.items.forEach(function(e){e.owner.user_id===OVERRIDE_USER&&answers_hash[e.post_id].comments.push(e)}),e.has_more?getComments():more_answers?getAnswers():process()}})}function getAuthorName(e){return e.owner.display_name}function process(){var e=[];answers.forEach(function(s){var r=s.body;s.comments.forEach(function(e){OVERRIDE_REG.test(e.body)&&(r="<h1>"+e.body.replace(OVERRIDE_REG,"")+"</h1>")});var a=r.match(SCORE_REG);a&&e.push({user:getAuthorName(s),size:+a[2],language:a[1],link:s.share_link})}),e.sort(function(e,s){var r=e.size,a=s.size;return r-a});var s={},r=1,a=null,n=1;e.forEach(function(e){e.size!=a&&(n=r),a=e.size,++r;var t=jQuery("#answer-template").html();t=t.replace("{{PLACE}}",n+".").replace("{{NAME}}",e.user).replace("{{LANGUAGE}}",e.language).replace("{{SIZE}}",e.size).replace("{{LINK}}",e.link),t=jQuery(t),jQuery("#answers").append(t);var o=e.language;/<a/.test(o)&&(o=jQuery(o).text()),s[o]=s[o]||{lang:e.language,user:e.user,size:e.size,link:e.link}});var t=[];for(var o in s)s.hasOwnProperty(o)&&t.push(s[o]);t.sort(function(e,s){return e.lang>s.lang?1:e.lang<s.lang?-1:0});for(var c=0;c<t.length;++c){var i=jQuery("#language-template").html(),o=t[c];i=i.replace("{{LANGUAGE}}",o.lang).replace("{{NAME}}",o.user).replace("{{SIZE}}",o.size).replace("{{LINK}}",o.link),i=jQuery(i),jQuery("#languages").append(i)}}var ANSWER_FILTER="!t)IWYnsLAZle2tQ3KqrVveCRJfxcRLe",COMMENT_FILTER="!)Q2B_A2kjfAiU78X(md6BoYk",answers=[],answers_hash,answer_ids,answer_page=1,more_answers=!0,comment_page;getAnswers();var SCORE_REG=/<h\d>\s*([^\n,]*[^\s,]),.*?(\d+)(?=[^\n\d<>]*(?:<(?:s>[^\n<>]*<\/s>|[^\n<>]+>)[^\n\d<>]*)*<\/h\d>)/,OVERRIDE_REG=/^Override\s*header:\s*/i;
body{text-align:left!important}#answer-list,#language-list{padding:10px;width:380px;float:left}table thead{font-weight:700}table td{padding:5px}
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script> <link rel="stylesheet" type="text/css" href="//cdn.sstatic.net/codegolf/all.css?v=83c949450c8b"> <div id="answer-list"> <h2>Leaderboard</h2> <table class="answer-list"> <thead> <tr><td></td><td>Author</td><td>Language</td><td>Score</td></tr></thead> <tbody id="answers"> </tbody> </table> </div><div id="language-list"> <h2>Winners by Language</h2> <table class="language-list"> <thead> <tr><td>Language</td><td>User</td><td>Score</td></tr></thead> <tbody id="languages"> </tbody> </table> </div><table style="display: none"> <tbody id="answer-template"> <tr><td>{{PLACE}}</td><td>{{NAME}}</td><td>{{LANGUAGE}}</td><td>{{SIZE}}</td><td><a href="{{LINK}}">Link</a></td></tr></tbody> </table> <table style="display: none"> <tbody id="language-template"> <tr><td>{{LANGUAGE}}</td><td>{{NAME}}</td><td>{{SIZE}}</td><td><a href="{{LINK}}">Link</a></td></tr></tbody> </table>

Bounties

From time to time I'll offer bounties to encourage different approaches.

The first one will be 50 points, to be awarded to the best-scoring answer posted during the bounty period.

The second will be 100 points which I will award to the first answer that takes a different approach from the existing ones, according to my subjective judgement. (I'll create the bounty at the time of awarding it.) I'm particularly keen to see an approach based on neural networks, but I'll also award it to other techniques that are sufficiently different from the existing Markov model based approaches. Answers don't need to be among the top scoring to compete for this bounty, but ideally they should beat the median score. It is possible for one answer to win both bounties.

Nathaniel

Posted 2018-01-09T09:25:31.847

Reputation: 3 293

Comments are not for extended discussion; this conversation has been moved to chat.

– Dennis – 2018-01-30T02:29:47.787

1https://xkcd.com/1960/ appears to be a reference to this challenge! – A. Rex – 2018-02-26T13:49:55.810

Answers

119

///, 2*1 + 1020874 = 1020876

 

Prints a space.

daniero

Posted 2018-01-09T09:25:31.847

Reputation: 15 255

Comments are not for extended discussion; this conversation has been moved to chat.

– Dennis – 2018-01-30T02:26:04.857

90

Node.js, 2*224 + 524279 = 524727

Please refer to the change log at the end of this post for score updates.

A function taking and returning a byte.

a=[...l='14210100'],m={},s={},b={}
f=c=>a.some((t,n)=>x=s[y=l.slice(n)]>t|/^[A-Z '"(]/.test(y)&&b[y],l+=String.fromCharCode(c),a.map((_,n)=>(m[x=l.slice(n)]=-~m[x])<s[y=l.slice(n,8)]||(s[y]=m[x],b[y]=c)),l=l.slice(1))&&x||32

It consists of a simple PPM model that looks at the last 8 characters to predict the next one.

We trust a pattern of length L when we have encountered it at least T[L] times, where T is an array of arbitrary thresholds: [1,1,2,1,2,3,5,2]. Furthermore, we always trust a pattern whose first character matches [A-Z '"(].

We select the longest trusted pattern and return the prediction with the highest score associated to this pattern at the time of the call.

Notes

  • This is obviously not optimized for speed, but it runs in about 15 seconds on my laptop.

  • If we were allowed to repeat the process several times in a row without resetting the model, the number of errors would converge to ~268000 after 5 iterations.

  • The current success rate of the prediction function is ~56.8%. As noticed by @immibis in the comments, if bad and correct guesses are mixed together, the result is not even barely readable.

    For instance, this snippet near the end of the book:

    Here be it said, that this pertinacious pursuit of one particular whale,[LF]
    continued through day into night, and through night into day, is a thing[LF]
    by no means unprecedented in the South sea fishery.
    

    becomes:

    "e e be it said, that thes woacangtyous sarsuet of tie oort cular thale[LF][LF]
     orsinued toeough tir on e togh   and sheough toght an o ters af t shin[LF][LF]
    be to means insrocedented tn hhe sputh Sevsaonh ry,
    

    By replacing bad guesses with underscores, we have a better idea of what the function got right:

    _e_e be it said, that th_s _____n___ous __rsu_t of __e __rt_cular _hale_[LF]
    _o__inued t__ough ___ _n__ __gh__ and _h_ough __ght _n_o ____ __ _ _hin_[LF]
    b_ _o means _n_r_cedented _n _he __uth _e_____h_ry_
    

    NB: The above example was created with a previous version of the code, working on the first version of the input file.

Test code

/**
  The prediction function f() and its variables.
*/
a=[...l='14210100'],m={},s={},b={}
f=c=>a.some((t,n)=>x=s[y=l.slice(n)]>t|/^[A-Z '"(]/.test(y)&&b[y],l+=String.fromCharCode(c),a.map((_,n)=>(m[x=l.slice(n)]=-~m[x])<s[y=l.slice(n,8)]||(s[y]=m[x],b[y]=c)),l=l.slice(1))&&x||32

/**
  A closure containing the test code and computing E.
  It takes f as input.
  (f can't see any of the variables defined in this scope.)
*/
;
(f => {
  const fs = require('fs');

  let data = fs.readFileSync('whale2.txt'),
      len = data.length,
      err = 0;

  console.time('ElapsedTime');

  data.forEach((c, i) => {
    i % 100000 || console.log((i * 100 / len).toFixed(1) + '%');

    if(i < len - 1 && f(c) != data[i + 1]) {
      err++;
    }
  })

  console.log('E = ' + err);
  console.timeEnd('ElapsedTime');
})(f)

Change log

  • 524727 - saved 19644 points by switching to whale2.txt (challenge update)
  • 544371 - saved 327 points by forcing patterns starting with a capital letter, a quote, a double-quote or an opening parenthesis to be also always trusted
  • 544698 - saved 2119 points by forcing patterns starting with a space to be always trusted
  • 546817 - saved 47 points by adjusting the thresholds and golfing the prediction function
  • 546864 - saved 1496 points by extending the maximum pattern length to 8 characters
  • 548360 - saved 6239 points by introducing the notion of trusted patterns, with thresholds depending on their length
  • 554599 - saved 1030 points by improving the line feed prediction
  • 555629 - saved 22 points by golfing the prediction function
  • 555651 - saved 40 points by golfing the prediction function
  • 555691 - initial score

Arnauld

Posted 2018-01-09T09:25:31.847

Reputation: 46 756

40For the curious, no, this does not produce anything like Moby Dick. It's a whole lot of sidg tlanses,oeth to, shuld hottut tild aoersors Ch, th! Sa, yr! Sheu arinning whales aut ihe e sl he traaty of rrsf tg homn Bho dla tiasot a shab sor ty, af etoors tnd hocket sh bts ait mtubb tiddin tis aeewnrs, dnhost maundy cnd sner aiwt d boelh cheugh -aaieiyns aasiyns taaeiins! th, tla. It does manage to get a few complete words sometimes. Like whales. – immibis – 2018-01-10T04:52:55.240

21@immibis The title of the challenge was chosen wisely. This is Moby Dick, approximately. :-) – Arnauld – 2018-01-10T06:57:33.510

It's good to include your old score, crossed out using &lt;strike&gt; tags, so we can see that you've improved it. – Nathaniel – 2018-01-10T09:13:04.150

3@Nathaniel There's been many updates, so that would be barely readable and not really informative. I've added a change log instead with short explanations about the improvements. – Arnauld – 2018-01-10T09:45:41.663

@Arnauld thanks, that's helpful – Nathaniel – 2018-01-10T09:50:39.307

41I think your program is actually doing a perfect translation into Gaelic. – Beska – 2018-01-10T20:20:07.447

The comment of the produced output includes a comma followed by a character other than a space. Would the score improve if you assumed that most punctuation (apart from the apostrophe) will always be followed by a space? – Draco18s – 2018-01-10T22:58:31.940

1@Draco18s It's hard to tell whether this comma was a good or a bad guess. If it was a bad guess, the prediction function may have legitimately tried to put a letter after whatever-other-letter-that-was-actually-there-instead-of-the-comma once it received it. – Arnauld – 2018-01-10T23:57:48.613

@Arnauld That's a good point. I was just curious what the result might be. – Draco18s – 2018-01-11T00:08:25.823

@Draco18s I've just tried it (for the comma only) and it resulted in a loss of 92 points. Commas are also quite often followed by line feeds, quotes or double-quotes, so it's probably better to just let the model decide. – Arnauld – 2018-01-11T00:16:56.470

Ah, in which case I'd say "some form of white space." But oh well. (Followed by quotes...oh, at the end of someone speaking. Ok, white space and more punctuation! :D) – Draco18s – 2018-01-11T00:20:36.813

@immibis - considering the output, maybe save this code for the inevitable 'call of cthulu' repeat. – Scott – 2018-01-11T01:19:19.280

This answer Is currently the most upvoted but not the best score. Still a cool answer of course. – Zach Boyd – 2018-01-11T04:04:07.010

I've updated the question with a version of the file where the text is not wrapped - you can probably save some points by using that – Nathaniel – 2018-01-11T09:36:13.060

@Nathaniel Thanks for the heads up. Updated. – Arnauld – 2018-01-11T11:50:50.400

@immibis Stop this! You're awaking the old ones! – allo – 2018-01-11T13:30:13.830

Nitpicking, but shouldn't there be an underscore at the "By replacing bad guesses with underscores, we have a better idea of what the function got right:" part for the second [LF]s? – Kevin Cruijssen – 2018-01-24T13:46:12.627

1@KevinCruijssen The second [LF] is a correct guess. The first one is a bad guess ([LF] instead of g and comma, respectively). – Arnauld – 2018-01-24T14:42:42.547

1Nice solution! Could you save a few bytes (4?) by storing aliasing 'slice'? – Craig Ayre – 2018-01-30T14:49:37.420

@Beska, if you mean Irish then no, no it's not! :p – Shaggy – 2018-02-07T11:00:06.937

75

Perl, 2·70525 + 326509 = 467559

Predictor

$m=($u=1<<32)-1;open B,B;@e=unpack"C*",join"",<B>;$e=2903392593;sub u{int($_[0]+($_[1]-$_[0])*pop)}sub o{$m&(pop()<<8)+pop}sub g{($h,%m,@b,$s,$E)=@_;if($d eq$h){($l,$u)=(u($l,$u,$L),u($l,$u,$U));$u=o(256,$u-1),$l=o($l),$e=o(shift@e,$e)until($l^($u-1))>>24}$M{"@c"}{$h}++-++$C{"@c"}-pop@c for@p=($h,@c=@p);@p=@p[0..19]if@p>20;@c=@p;for(@p,$L=0){$c="@c";last if" "ne pop@c and@c<2 and$E>99;$m{$_}+=$M{$c}{$_}/$C{$c}for sort keys%{$M{$c}};$E+=$C{$c}}$s>5.393*$m{$_}or($s+=$m{$_},push@b,$_)for sort{$m{$b}<=>$m{$a}}sort keys%m;$e>=u($l,$u,$U=$L+$m{$_}/$s)?$L=$U:return$d=$_ for sort@b}

To run this program, you need this file here, which must be named B. (You can change this filename in the second instance of the character B above.) See below for how to generate this file.

The program uses a combination of Markov models essentially as in this answer by user2699, but with a few small modifications. This produces a distribution for the next character. We use information theory to decide whether to accept an error or spend bits of storage in B encoding hints (and if so, how). We use arithmetic coding to optimally store fractional bits from the model.

The program is 582 bytes long (including an unnecessary final newline) and the binary file B is 69942 bytes long, so under the rules for scoring multiple files, we score L as 582 + 69942 + 1 = 70525.

The program almost certainly requires a 64-bit (little-Endian?) architecture. It takes approximately 2.5 minutes to run on an m5.large instance on Amazon EC2.

Test code

# Golfed submission
require "submission.pl";

use strict; use warnings; use autodie;

# Scoring length of multiple files adds 1 penalty
my $length = (-s "submission.pl") + (-s "B") + 1;

# Read input
open my $IN, "<", "whale2.txt";
my $input = do { local $/; <$IN> };

# Run test harness
my $errors = 0;
for my $i ( 0 .. length($input)-1 ) {
    my $current = substr $input, $i, 1;
    my $decoded = g( $current );

    my $correct = substr $input, $i+1, 1;
    my $error_here = 0 + ($correct ne $decoded);
    $errors += $error_here;
}

# Output score
my $score = 2 * $length + $errors;
print <<EOF;
length $length
errors $errors
score  $score
EOF

The test harness assumes the submission is in the file submission.pl, but this can easily be changed in the second line.

Text comparison

"And did none of ye see it before?" cried Ahab, hailing the perched men all around him.\\"I saw him almost that same instant, sir, that Captain 
"And wid note of te fee bt seaore   cried Ahab, aasling the turshed aen inl atound him. \"' daw him wsoost thot some instant, wer, that Saptain 
"And _id no_e of _e _ee _t _e_ore__ cried Ahab, _a_ling the __r_hed _en __l a_ound him._\"_ _aw him ___ost th_t s_me instant, __r, that _aptain 

Ahab did, and I cried out," said Tashtego.\\"Not the same instant; not the same--no, the doubloon is mine, Fate reserved the doubloon for me. I 
Ahab aid  ind I woued tut,  said tashtego, \"No, the same instant, tot the same -tow nhe woubloon ws mane. alte ieserved the seubloon ior te, I 
Ahab _id_ _nd I ___ed _ut,_ said _ashtego__\"No_ the same instant_ _ot the same_-_o_ _he _oubloon _s m_ne_ __te _eserved the __ubloon _or _e_ I 

only; none of ye could have raised the White Whale first. There she blows!--there she blows!--there she blows! There again!--there again!" he cr
gnly  towe of ye sould have tersed the shite Whale aisst  Ihere ihe blows! -there she blows! -there she blows! Ahere arains -mhere again!  ce cr
_nly_ _o_e of ye _ould have ___sed the _hite Whale _i_st_ _here _he blows!_-there she blows!_-there she blows! _here a_ain__-_here again!_ _e cr

This sample (chosen in another answer) occurs rather late in the text, so the model is quite developed by this point. Remember that the model is augmented by 70 kilobytes of "hints" that directly help it guess the characters; it is not driven simply by the short snippet of code above.

Generating hints

The following program accepts the exact submission code above (on standard input) and generates the exact B file above (on standard output):

@S=split"",join"",<>;eval join"",@S[0..15,64..122],'open W,"whale2.txt";($n,@W)=split"",join"",<W>;for$X(0..@W){($h,$n,%m,@b,$s,$E)=($n,$W[$X]);',@S[256..338],'U=0)',@S[343..522],'for(sort@b){$U=($L=$U)+$m{$_}/$s;if($_ eq$n)',@S[160..195],'X<128||print(pack C,$l>>24),',@S[195..217,235..255],'}}'

It takes approximately as long to run as the submission, as it performs similar computations.

A. Rex

Posted 2018-01-09T09:25:31.847

Reputation: 901

6Wow, impressive answer. I assume there is additional code needed to generate the B file? If so, please can you include that in your answer? – Nathaniel – 2018-01-15T00:21:04.327

6Excellent! The first (and so far the only) answer to break the 500k score barrier. – ShreevatsaR – 2018-01-16T03:27:36.517

1"tersed the shite whale" omg I'm crying – Phill – 2018-01-26T05:09:06.510

1Since no new answers were posted during the bounty period, I'm awarding it to your answer, as both the best scoring and the most sophisticated approach. If you ever have time, I would really appreciate a more in depth explanation of how this answer works, i.e. what exactly is the algorithm? – Nathaniel – 2018-02-08T09:26:03.643

70

Python 3, 2·267 + 510193 = 510727

Predictor

def p():
 d={};s=b''
 while 1:
  p={0:1};r=range(len(s)+1)
  for i in r:
   for c,n in d.setdefault(s[:i],{}).items():p[c]=p.get(c,1)*n**b'\1\6\f\36AcWuvY_v`\270~\333~'[i]
  c=yield max(sorted(p),key=p.get)
  for i in r:e=d[s[:i]];e[c]=e.get(c,1)+1
  s=b'%c'%c+s[:15]

This uses a weighted Bayesian combination of the order 0, …, 16 Markov models, with weights [1, 6, 12, 30, 65, 99, 87, 117, 118, 89, 95, 118, 96, 184, 126, 219, 126].

The result is not very sensitive to the selection of these weights, but I optimized them because I could, using the same late acceptance hill-climbing algorithm that I used in my answer to “Put together a Senate majority”, where each candidate mutation is just a ±1 increment to a single weight.

Test code

with open('whale2.txt', 'rb') as f:
    g = p()
    wrong = 0
    a = next(g)
    for b in f.read():
        wrong += a != b
        a = g.send(b)
    print(wrong)

Anders Kaseorg

Posted 2018-01-09T09:25:31.847

Reputation: 21 820

2The right tool for the job. Great score. Nice one. – agtoever – 2018-01-10T22:38:19.917

1Possible clarification: b"\0\3\6\r\34'&amp;-20'\22!P\n[\26" is the ascii representation of the weights, where small non-printable values are escaped in octal. – Cœur – 2018-01-11T05:35:41.180

I've updated the question with a version of the file where the text is not wrapped - you might try re-running your code on that (it might do slightly better) – Nathaniel – 2018-01-11T09:41:35.087

I tried to run it on TIO, but it doesn't work. Do you know what went wrong? – user202729 – 2018-01-11T14:04:54.617

@user202729 You need sys.stdin.buffer. – Dennis – 2018-01-11T14:06:43.150

@Nathaniel Yeah, it does better on whale2.txt by over 22000. – Anders Kaseorg – 2018-01-11T17:53:59.277

By the way, how are the weights chosen? – Nathaniel – 2018-01-11T22:55:17.597

@Nathaniel I tried a few different things before settling on the same late acceptance hill-climbing algorithm that I used in my answer to “Put together a Senate majority”, where each candidate mutation is just a ±1 increment to a single weight.

– Anders Kaseorg – 2018-01-11T23:00:42.363

Thanks for the pointer to late acceptance hill-climbing (LAHC), which is an interesting technique that I didn't know about until now. What text did you use to optimise the weights, given that you were unable to use whale.txt or whale2.txt by the conditions of the puzzle? – Simon – 2018-01-12T01:01:50.193

@Simon There is no such condition (the weights are fully included in the byte count of the predictor). – Anders Kaseorg – 2018-01-12T01:53:41.680

3

Thanks for that explanation - if you could edit a synopsis into the question it would be great. (The experience with my previous challenge Paint Starry Night is that these optimisation procedures are the most interesting part of the answers, so it's much better if answers include the code used to do that, and an explanation of it. I included a rule in both challenges saying that they should.)

– Nathaniel – 2018-01-12T04:44:03.503

Must s really be bytes, can't it be str (saving the b prefixes)? Or a tuple, initialized with s=() and updated with s=c,*s[:15]. – Stefan Pochmann – 2018-01-16T22:22:05.043

@StefanPochmann bytes is a memory efficiency hack—this already uses 3 GB of memory, and while it would only be 5% worse with str or 30% worse with tuple, I’m happy at this stage to spend a handful of bytes from the sixth significant digit of my score to make the code a bit less painful to run. – Anders Kaseorg – 2018-01-18T00:58:59.887

I don't speak Python but if I understood correctly then you do a linear mixing of the models which basically calculates a weighted arithmetic average of the models. I suggest switching to a weighted geometric mean. That's the main difference between the first PAQ versions and the current ones and greatly improved prediction for several reasons. See section "Adaptive model weighting" vs "Neural-network mixing". Neural network mixing is actually a single layer neural network with softmax activation trained on cross entropy simplified to binary.

– Christoph – 2018-01-18T08:11:24.047

1@Christoph My model combination actually is a weighted geometric mean. But PAQ’s averaging in the logistic domain is slightly different—I’ll have to see if that’s better. – Anders Kaseorg – 2018-01-18T08:31:56.077

52

Python 3, 2*279+592920=593478 2*250 + 592467 = 592967 2 * 271 + 592084 = 592626 2 * 278 + 592059 = 592615 2 * 285 + 586660 = 587230 2 * 320 + 585161 = 585801 2 * 339 + 585050 = 585728

d=m={}
s=1
w,v='',0
def f(c):
 global w,m,v,s,d
 if w not in m:m[w]={}
 u=m[w];u[c]=c in u and 1+u[c]or 1;v+=1;q=n=' ';w=w*s+c;s=c!=n
 if w in m:_,n=max((m[w][k],k)for k in m[w])
 elif s-1:n=d in'nedtfo'and't'or'a'
 elif'-'==c:n=c
 elif"'"==c:n='s'
 elif'/'<c<':':n='.'
 if v>4*(n!=q)+66:n='\n'
 if s:d=c
 if c<q:w=w[:-1]+q;v=s=0
 return n

Try it online!

A function using global variables. Learns as it goes, building a model at the word level: given what it's seen so far in this word, what's the most common next character? As more input comes in it learns common words from the text pretty well, and it also learns the most common character to start the next word.

For example:

  • If what it's seen so far is 'Captai' it predicts an "n"
  • If it's "Captain" it predicts a space
  • If it's the start of a word, and the last word was "Captain" it predicts an 'A'
  • If the word so far is 'A' it predicts an 'h' (and then 'a' and 'b'; similarly for 'C').

It doesn't do very well at the start, but by the end there are large parts of actual words coming out. The fallback option is a space, and after a single space it's an "a", unless the preceding letter was one of "nedtfo", a digit, or a hyphen or apostrophe. It also aggressively predicts line breaks after 71 characters, or if a space was expected after 66. Both of those were just tuned to the data ("t" is far more common after a space, but has more often already been predicted, so "a" is a better guess outside those six special cases).

Learning what word pairs went together and preseeding the mapping turned out not to be worthwhile.


It ends up with text like this:

nl tneund his    I woi tis tnlost ahet toie tn tant  wod, ihet taptain Ahab ses
 snd t
oeed Sft   aoid thshtego    Io, fhe soie tn tant  tot the soie      ahe sewbtoon
swn tagd  aoths eatmved fhe sewbtoon wor ta  I sfey  aote of totsonld nive betse
d ahe
hate Whale iorst  Ihe e ioi beaos! -there soi beaos! -there soi beaos!

which corresponds to this part of the input:

all around him.

"I saw him almost that same instant, sir, that Captain Ahab did, and I cried out," said Tashtego.

"Not the same instant; not the same--no, the doubloon is mine, Fate reserved the doubloon for me. I only; none of ye could have raised the White Whale first. There she blows!--there she blows!--there she blows!

You can see where proper nouns in particular come out pretty well, but the ends of words are mostly right as well. When it's seen "dou" it expects "doubt", but once the "l" appears it gets "doubloon".

If you run it a second time with the same model it just built it immediately gets another 92k correct (51.7%->59.3%), but it's always just under 60% from the second iteration on.


The measuring code is in the TIO link, or here's a slightly better version:

total = 0
right = 0
with open('whale.txt') as fp:
    with open('guess.txt', 'w') as dest:
        for l in fp.readlines():
            for c in l:
                last = c
                if p == c: right += 1
                n = f(c)
                p = n
                total += 1
                dest.write(n)
                if total % 10000 == 0:
                    print('{} / {} E={}\r'.format(right, total, total-right), end='')
print('{} / {}: E={}'.format(right, total, total - right))

guess.txt has the guessed output at the end.

Michael Homer

Posted 2018-01-09T09:25:31.847

Reputation: 671

3This is an excellent approach! – Skyler – 2018-01-10T15:18:52.923

2too much <s></s> ;) – tfbninja – 2018-01-14T04:11:56.157

1+1 because this approach reminded me of the LZW compression algorithm. – Marcos – 2018-01-17T07:17:09.883

23

C++, score: 2*132 + 865821 = 866085

Thanks to @Quentin for saving 217 bytes!

int f(int c){return c-10?"t \n 2  sS \n  -  08........       huaoRooe oioaoheu thpih eEA \n   neo    enueee neue hteht e"[c-32]:10;}

A very simple solution that, given a character, just outputs the character that most frequently appears after the input character.

Verify the score with:

#include <iostream>
#include <fstream>

int f(int c);

int main()
{
    std::ifstream file;
    file.open("whale2.txt");

    if (!file.is_open())
        return 1;

    char p_ch, ch;
    file >> std::noskipws >> p_ch;
    int incorrect = 0;
    while (file >> std::noskipws >> ch)
    {
        if (f(p_ch) != ch)
            ++incorrect;
        p_ch = ch;
    }

    file.close();

    std::cout << incorrect;
}

Edit: Using whale2.txt gives a better score.

Steadybox

Posted 2018-01-09T09:25:31.847

Reputation: 13 488

5You can translate this array into a string literal and inline it directly in place of L to save a bunch of characters :) – Quentin – 2018-01-09T17:43:38.193

@Quentin Thanks! Now I'm wondering why I didn't think of that in the first place... – Steadybox – 2018-01-10T01:58:43.930

20

Python, 2*516 + 521122 = 522154

Algorithm:

Yet another python submission, this algorithm calculates the most likely next letter looking at sequences of length 1,...,l. The sum of probabilities is used, and there are a few tricks to get better results.

from collections import Counter as C, defaultdict as D
R,l=range,10
s,n='',[D(C) for _ in R(l+1)]
def A(c):
 global s;s+=c;
 if len(s)<=l:return ' '
 P=D(lambda:0)
 for L in R(1,l+1):
  w=''.join(s[-L-1:-1]);n[L][w].update([c]);w=''.join(s[-L:])
  try:
   q,z=n[L][w].most_common(1)[0];x=sum(list(n[L][w].values()))
  except IndexError:continue
  p=z/x
  if x<3:p*=1/(3-x)
  P[q]+=p
 if not P:return ' '
 return max(P.items(),key=lambda i:i[1])[0]
import this, codecs as d
[A(c) for c in d.decode(this.s, 'rot-13')]

Results:

Mostly gibberish, although you can see it picks up on the occasional phrase, such as "Father Mapple".

errors: 521122
TRAINING:
result:  tetlsnowleof the won -opes  aIther Mapple,woneltnsinkeap hsd   lnd the  thth a shoey,aeidorsbine ao
actual: ntal knobs of the man-ropes, Father Mapple cast a look upwards, and then with a truly sailor-like bu
FINAL:
result: mnd wnd round  ahe   ind tveryaonsracting th ards the sol ens-ike aeock tolblescn the sgis of thet t
actual: und and round, then, and ever contracting towards the button-like black bubble at the axis of that s

Test code:

Pretty simple, outputs some examples of the text at different points. Uses whale2.txt, as this avoids some extra logic to calculate newlines.

from minified import A

def score(predict, text):
    errors = 0
    newtext = []
    for i, (actual, current) in  enumerate(zip(text[1:], text[:-1])):
        next = predict(current)
        errors += (actual != next)
        newtext.append(next)
        if (i % (len(text) // 100) == 0):
            print ('.', end='', flush=True)
    return errors, ''.join(newtext)

t = open('whale2.txt')
text = t.read()
err2, text2 = score(A, text)
print('errors:', err2)
print("TRAINING:")
print(text2[100000:100100].replace('\n', '\\n'))
print(text1[100001:100101].replace('\n', '\\n'))
print("FINAL:")
print(text2[121400:1215500].replace('\n', '\\n'))
print(text[121401:1215501].replace('\n', '\\n'))

user2699

Posted 2018-01-09T09:25:31.847

Reputation: 424

3Welcome to the site! This is a fantastic first submission. :) – DJMcMayhem – 2018-01-11T19:01:35.567

@DJMcMayhem, Thanks for the welcome. I've enjoyed watching for awhile now, this is the first contest to catch my attention for an entry. – user2699 – 2018-01-11T20:01:57.883

19

C (gcc), 679787 652892

84 76 bytes, 679619 652740 incorrect guesses

p[128][128][128][128];a,b,c,d;g(h){p[a][b][c][d]=h;h=p[a=b][b=c][c=d][d=h];}

Try it online!

Update: ~27000 points off with updated file, 16 points (8 bytes) with a better-golfed function.

Explanation

The way this works is that as the code runs through the text, it memorizes the last character that terminated any given 4-character sequence, and returns that value. Somewhat similar to Arnauld's approach above, but relies on the inherent likelihood of two given 4-character sequences terminating the same way.

De-golfed:

p[128][128][128][128];
a,b,c,d;
g(h){
    p[a][b][c][d]=h; // Memorize the last character.
    h=p[a=b][b=c][c=d][d=h]; // Read the guess. We save several
                             // bytes with the assignments inside indices.
}

Rogem

Posted 2018-01-09T09:25:31.847

Reputation: 310

... TIO link is useless. So the function returns the value of the last assignment? – user202729 – 2018-01-10T08:33:17.753

Let me edit the answer with an explanation, then :) – Rogem – 2018-01-10T08:49:41.057

1@Rogem I've added a de-golfed version (which I did because I couldn't follow it either) - hopefully this doesn't bother you but please roll back if desired. – Adam Davis – 2018-01-10T15:49:19.677

@AdamDavis on most implementations of C, all global variables start at zero. It's undefined behavior, so it's only used in code-golf. – NieDzejkob – 2018-01-10T16:02:03.883

1@NieDzejkob Ah, you're right, thanks! "ANSI-C requires that all uninitialized static/global variables have to be initialized with 0." – Adam Davis – 2018-01-10T16:03:04.953

@AdamDavis woah, it's not UB? TIL! – NieDzejkob – 2018-01-10T16:10:14.017

@AdamDavis It's beautiful. – Rogem – 2018-01-10T18:32:52.843

15

sh+bzip2, 2*364106 = 728212

2*381249 + 0 = 762498

dd if=$0 bs=1 skip=49|bunzip2&exec cat>/dev/null

followed by the bzip2-compressed whale2.txt with the first byte missing

Ignores its input; outputs the correct answer. This provides a baseline on one end; daniero provides a baseline on the other end.

Builder script:

#!/bin/sh
if [ $# -ne 3 ]
then
    echo "Usage $0 gen.sh datafile output.sh"
    exit 1
fi

cat $1 > $3
dd ibs=1 if=$2 skip=1 | bzip2 -9 >> $3
chmod +x $3

I/O test harness (tcc; cut off first line for gcc). This test harness can be used by anybody on a suitable platform that submits a complete program that expects read/write I/O. It uses byte-at-a-time I/O to avoid cheating. Child program must flush output after every byte to avoid blocking.

#!/usr/bin/tcc -run
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <errno.h>

int main(int argc, char **argv)
{
    volatile int result;
    int readfd[2];
    int writefd[2];
    int cppid;
    int bytecount;
    char c1, c2, c3;
    if (argc != 2) {
        printf("write X approximately -- service host\n");
        printf("Usage: %s serviceprocessbinary < source.txt\n", argv[0]);
        return 1;
    }
    /* Start service process */
    if (pipe(readfd)) {
        perror("pipe()");
        return 3;
    }
    if (pipe(writefd)) {
        perror("pipe()");
        return 3;
    }
    result = 0;
    if (!(cppid = vfork())) {
        char *argtable[3];
        argtable[0] = argv[1];
        argtable[1] = NULL;
        dup2(readfd[0], 0);
        dup2(writefd[1], 1);
        close(readfd[1]);
        close(writefd[0]);
        close(readfd[0]);
        close(writefd[1]);
        execvp(argv[1], argtable);
        if (errno == ENOEXEC) {
            argtable[0] = "/bin/sh";
            argtable[1] = argv[1];
            argtable[2] = NULL;
            /* old standard -- what isn't an executable
             * can be exec'd as a /bin/sh script */
            execvp("/bin/sh", argtable);
            result = ENOEXEC;
        } else {
            result = errno;
        }
        _exit(3);
    } else if (cppid < 0) {
        perror("vfork()");
        return 3;
    }
    if (result) {
        errno = result;
        perror("execvp()");
        return 3;
    }
    close(readfd[0]);
    close(writefd[1]);
    /* check results */
    read(0, &c2, 1);
    bytecount = 1;
    errno = 0;
    while (read(0, &c1, 1) > 0) {
        write(readfd[1], &c2, 1);
        if (read(writefd[0], &c3, 1) <= 0) {
            printf("%d errors (%d bytes)\n", result, bytecount);
            if (errno == 0)
                fprintf(stderr, "pipe: unexpected EOF\n");
            else
                perror("pipe");
            return 3;
        }
        if (c3 != c1)
            ++result;
        c2 = c1;
        ++bytecount;
    }
    printf("%d errors (%d bytes)\n", result, bytecount);
    return 0;
}

Joshua

Posted 2018-01-09T09:25:31.847

Reputation: 2 281

Can you explain how the I/O works in this submission? – Nathaniel – 2018-01-09T22:43:09.763

@Nathaniel: You start the thing and send bytes to it. Or don't bother because you already see the answer in front of you. – Joshua – 2018-01-09T23:12:11.600

Well ok but how is this achieved by your code? I'm only asking because I'm curious. – Nathaniel – 2018-01-10T00:17:02.637

@Nathaniel: dd opens the file given by $0 and skips the first 37 characters; and the rest is sent to bunzip2. – Joshua – 2018-01-10T03:02:04.993

6I think what he's asking is: how does this not violate the but may not load any other external files, and your code may not access the whale.txt file in any way other than described above. clause? – Rogem – 2018-01-10T08:33:26.750

8@Rogem The compressed data is put after what is shown here, and the code access itself. – user202729 – 2018-01-10T08:33:48.513

skip=37 should probably be skip=39 (the length of the code plus the newline after it). What's the purpose of &amp;exec cat? – axiac – 2018-01-11T10:06:55.483

4The question says: "Your submission will be a program or function (etc.) that will be called or invoked multiple times. When it is called for the nth time it will be given the nth character of whale.txt or whale2.txt and it must output its guess for the (n+1)th character." -- How is this requirement accomplished? The code displays the whole text of whale.txt every time it is executed. – axiac – 2018-01-11T10:09:46.700

1@axiac "anything is fine, as long as your program always gives one byte of output before receiving its next byte of input." – user202729 – 2018-01-11T14:11:58.563

@axiac I guess to ignore the latter part. – user202729 – 2018-01-11T14:12:55.817

1I don't understand @user202729's answer to axiac's question. If it always gives the complete text, how do you know which bytes of the text to ignore? If we can choose to ignore or not-ignore output bytes at will we might as well have a program that just outputs "abcdefghijklmnopqrstuvwxyz" regardless of input. – Vincent – 2018-01-12T13:01:04.840

@Vincent ... ignore what? I don't understand your question. No, literally "anything is fine". The I/O format of this answer is different from other answers. It's possible to use a stream of characters as input, as long as for each byte you read, you output one byte. (KevinCruissen's suggestion, Java related) – user202729 – 2018-01-12T13:09:05.373

1Yes I was talking about the output. As I understand it, every time it is executed it outputs the whole text of Moby Dick. The first time it does this, we know we should only consider the first character of that output as the 'real' output and ignore the rest. The second time we know we should only look at the second character of the output etc. But how do we know? Or do I misread what the program does and does it really only output one byte each time? – Vincent – 2018-01-12T13:21:12.913

1@Vincent The I/O format is flexible, it doesn't force the program to output one byte each time, just that it must output at least one byte after read a byte. In this case the program read 0 byte and output about a million byte, 0 < 1000000 so it's fine. – user202729 – 2018-01-12T13:34:59.570

@axiac: Yeah looks like I typoed it on rekeying it in after testing. I'm not interested in maintaining at after the predictors are starting to get good. – Joshua – 2018-01-13T02:39:37.690

@Vincent I thought quite a bit about this at the time it was posted, and decided it's OK, I/O wise. If I wanted to, I could write a test suite that sends one byte to this program via STDIN (which it would ignore), gets one byte from it via STDOUT, and so on, which would satisfy the requirements. At least, I think that would work. Strictly speaking, such a test script should also be included in the answer, but since it's no longer competitive I think we can let that slide. – Nathaniel – 2018-01-13T03:55:53.770

@Nathaniel you wrote in the question: "Your submission will be a program or function (etc.) that will be called or invoked multiple times. (1215235 times to be exact.) When it is called for the n<sup>th</sup> time it will be given the n<sup>th</sup> character of whale.txt or whale2.txt and it must output its guess for the (n+1)<sup>th character.". This answer clearly does not follow this requirement. – axiac – 2018-01-14T17:50:56.957

@axiac: Here's the expected test harness. On writing it I found I had a bug to fix. – Joshua – 2018-01-14T20:15:57.573

5@axiac given the test harness, I'm happy to regard sending the program one byte from STDIN as "calling or invoking" it. The import thing is that the program returns one byte of output after each byte of input, which this one does actually do, when run through the test harness. As the question says, "anything is fine, as long as your program always gives one byte of output before receiving its next byte of input." – Nathaniel – 2018-01-15T00:12:13.120

I am actually surprised it saves so little... With a purely-textual file I would have expected about a x10 compression rate, not x4... – Itai – 2018-01-15T06:39:40.113

@sillyfly: Yea. I expected about x6. Shrug. – Joshua – 2018-01-15T16:13:33.473

Have you tried killing some entropy in the (compressed) file? If (say) replacing 'z' with ' ' saves more bytes than it costs in errors... (you have to save 0.5 bytes for every error you generate). In theory this would be best done by the compression algorithm being told the cost function and being permitted to pick the "wrong" value if it costs more than X bytes to put the right value in. – Yakk – 2018-01-17T22:39:02.073

@Yakk ... that involves some information theory... which is covered in @{A.Rex}'s solution. And we're not sure how the compression algorithm works, too. – user202729 – 2018-01-21T10:41:36.990

@user202729: The algorithm is approximately compressing words. Not really, but bzip2 compression rarely splits a word if the word occurs frequently. – Joshua – 2018-01-21T19:09:17.040

13

C#, 378*2 + 569279 = 570035

using System.Collections.Generic;using System.Linq;class P{Dictionary<string,Dictionary<char,int>>m=new
Dictionary<string,Dictionary<char,int>>();string b="";public char N(char
c){if(!m.ContainsKey(b))m[b]=new Dictionary<char,int>();if(!m[b].ContainsKey(c))m[b][c]=0;m[b][c]++;b+=c;if(b.Length>4)b=b.Remove(0,1);return
m.ContainsKey(b)?m[b].OrderBy(k=>k.Value).Last().Key:' ';}}

This approach uses a look-up table to learn the most common character that follows a given string. The keys of the look-up table have a maximum of 4 characters, so the function first updates the look-up table with the current character and then just checks which character is the most likely to happen after the 4 previous ones including the current one. If those 4 characters are not found in the look-up table, it prints a space.

This version uses the whale2.txt file, as it greatly improves the number of successful guesses.

Following is the code used to test the class:

using System;
using System.IO;
using System.Text;

public class Program
{
    public static void Main(string[] args)
    {
        var contents = File.OpenText("whale2.txt").ReadToEnd();
        var predictor = new P();

        var errors = 0;
        var generated = new StringBuilder();
        var guessed = new StringBuilder();
        for (var i = 0; i < contents.Length - 1; i++)
        {
            var predicted = predictor.N(contents[i]);
            generated.Append(predicted);
            if (contents[i + 1] == predicted)
                guessed.Append(predicted);
            else
            {
                guessed.Append('_');
                errors++;
            }
        }

        Console.WriteLine("Errors/total: {0}/{1}", errors, contents.Length);
        File.WriteAllText("predicted-whale.txt", generated.ToString());
        File.WriteAllText("guessed-whale.txt", guessed.ToString());

        Console.ReadKey();
    }
}

The code runs in barely 2 seconds. Just for the record, this is what I get when I modify the size of the keys of the look-up table (includes the results of a second run without resetting the model):

Size   Errors   Errors(2)
-------------------------
1      866162   865850
2      734762   731533
3      621019   604613
4      569279   515744
5      579446   454052
6      629829   396855
7      696912   335034
8      765346   271275
9      826821   210552
10     876471   158263

It would be interesting to know why a key size of 4 characters is the best choice in this algorithm.

Text comparison

Original:

"And did none of ye see it before?" cried Ahab, hailing the perched men all around him.

"I saw him almost that same instant, sir, that Captain Ahab did, and I cried out," said Tashtego.

"Not the same instant; not the same--no, the doubloon is mine, Fate reserved the doubloon for me. I only; none of ye could have raised the White Whale first. There she blows!--there she blows!--there she blows! There again!--there again!"

Recreated:

"Tnd tes note of to seamtn we ore  
sried thab  wedleng the srriead te  a l tneund tes  
"T day tim t lost shet toie tn tand  aor, ahet taptain thab sid  tnd t waued tnt   said teshtego  
"To, ahe shme tn tand  aot the shme whot nhe sewbteodsan tagd  althsteatnved the sewbteodsaor te, I hncy  aote of to sanld bave beised the shate Whale iorst  Bhe e ati boaos  -the   ati boaos  -the   ati boaos  the e anains -ahe   anains 

Guesses:

"_nd ___ no_e of __ se____ _e_ore____ried _hab_ ___l_ng the __r___d _e_ a_l ___und _____
"_ _a_ _im ___ost _h_t ___e _n_tan__ __r, _h_t _aptain _hab _id_ _nd _ ___ed __t__ said __shtego__
"_o_ _he s_me _n_tan__ _ot the s_me___o_ _he ___b__o____ _____ __t___e___ved the ___b__o___or _e_ I _n_y_ _o_e of __ ___ld _ave __ised the _h_te Whale __rst_ _he_e ___ b___s__-the__ ___ b___s__-the__ ___ b___s_ _he_e a_ain__-_he__ a_ain__

Change log

  • 569279 - changed to whale2.txt and thus removed the optimization.
  • 577366 - optimized with code that tried to guess when to return a line feed.
  • 590354 - original version.

Charlie

Posted 2018-01-09T09:25:31.847

Reputation: 5 531

4Thanks for showing the variance as you change key size and column threshold! – Jeremy Weirich – 2018-01-10T20:34:45.010

I've updated the question with a version of the file where the text is not wrapped - you can probably save some points by using that – Nathaniel – 2018-01-11T09:36:04.050

@Nathaniel it does indeed. I have updated the answer. – Charlie – 2018-01-11T10:07:41.120

You can save some bytes using var instead of declaring the types. – Ed T – 2018-01-12T17:41:23.143

@EdT note that the variables are class variables, and you cannot use "var" outside methods. – Charlie – 2018-01-12T19:26:21.483

1As the key size gets larger, the number of hits verses misses will go down and thus more spaces will be output when a shorter key might have guessed the correct character. As the key size gets smaller, the individual guesses are less accurate for segments that match. I suspect this is why a length of four is optimal. If you maintained keys of multiple lengths and used shorter matches when longer ones are not available, then I expect the hit rate (and thus the score) will be greatly improved at longer key lengths. – Jeffrey L Whitledge – 2018-01-15T22:51:18.393

12

Python 3, 879766

F=[[0]*123for _ in range(123)]
P=32
def f(C):global P;C=ord(C);F[P][C]+=1;P=C;return chr(max(enumerate(F[C]),key=lambda x:x[1])[0])

Try it online!


... The /// answer which prints a space get 10 upvotes, while my code can only get 3 ...

Explanation:

For each character, the program:

  • increase frequency[prev][char]
  • Find the character that appear the most times in frequency[char]
  • and output it.

  • Ungolfed code in TIO link, commented out.
  • The code is 131 bytes.
  • The code run on my machine reports:
879504 / 1215235
Time: 62.01348257784468

which have the total score

2*131 + 879504 = 879766

Because there is no way to upload a large file to TIO (except asking Dennis), the example run in the TIO link only runs the program for a small part of the text.

In comparison with the older answer, this one have 362 more incorrect characters, but the code is shorter by 255 bytes. The multiplier makes my submission have lower score.

user202729

Posted 2018-01-09T09:25:31.847

Reputation: 7 149

11

Java 7, 1995 characters, (1995*2+525158) 529148

Java sucks for small program sizes. Anyway, I tried several extremely complex and tricky approaches that produced surprisingly crap results. I subsequently went back and just did a simple approach, which resulted in a smaller program size and better results.

This approach is actually extremely simple. It blindly feeds the previous x characters (in addition to all substrings of those characters) into a hashtable, mapped to the current character. It then keeps track of which patterns most accurately predict the current character. If patterns that precede certain characters are encountered multiple times, they succeed in predicting the character. It gives precedence to longer strings and it gives precedence to whichever character most often follows a given string. This algorithm knows nothing about the type of document or the english language.

I settled on using 9 characters and attempting to match whole words within those previous 9 characters when possible. When you don't try to do word matching within the strings, the optimum length is 6 characters, producing several thousand more mispredictions.

One interesting observation was that using 20 characters resulted in bad predictions the first time through but 99.9 percent accuracy on subsequent passes. The algorithm was basically able to memorize the book in overlapping 20 byte chunks, and this was distinct enough to allow it to recall the entire book a character at a time.

  • (1950*2+532919) 536819
  • (2406*2+526233) 531045 checking for punctuation to make better guesses
  • (1995*2+525158) 529148 more tweaking, golfed away some verbiage

package mobydick; import java.util.HashMap; public class BlindRankedPatternMatcher { String previousChars = ""; int FRAGLENGTH = 9; HashMap > patternPredictor = new HashMap<>(); void addWordInfo(String key, String prediction) { HashMap predictions = patternPredictor.get(key); if (predictions == null) { predictions = new HashMap(); patternPredictor.put(key, predictions); } WordInfo info = predictions.get(prediction); if (info == null) { info = new WordInfo(prediction); predictions.put(prediction, info); } info.freq++; } String getTopGuess (String pattern) { if (patternPredictor.get(pattern) != null) { java.util.List predictions = new java.util.ArrayList<>(); predictions.addAll(patternPredictor.get(pattern).values()); java.util.Collections.sort(predictions); return predictions.get(0).word; } return null; 
} String mainGuess() { 
if (trimGuess(",") != null) return trimGuess(","); if (trimGuess(";") != null) return trimGuess(";"); 
if (trimGuess(":") != null) return trimGuess(":"); 
if (trimGuess(".") != null) return trimGuess("."); if (trimGuess("!") != null) return trimGuess("!"); if (trimGuess("?") != null) return trimGuess("?"); if (trimGuess(" ") != null) return trimGuess(" "); for (int x = 0;x< previousChars.length();x++) { String tg = getTopGuess(previousChars.substring(x)); if (tg != null) { return tg; } } return "\n"; } String trimGuess(String c) { if (previousChars.contains(c)) { 
String test = previousChars.substring(previousChars.indexOf(c)); return getTopGuess(test); } return null; } public String predictNext(String newChar) { if (previousChars.length() < FRAGLENGTH) { previousChars+= newChar; } else { for (int x = 0; x addWordInfo(previousChars.substring(x), newChar); } previousChars = previousChars.substring(1) + newChar; } return mainGuess(); 
} class WordInfo implements Comparable { public WordInfo (String text) { this.word = text; } 
String word; int freq = 0; @Override public int compareTo(WordInfo arg0) { return Integer.compare(arg0.freq, this.freq); }

Jim W

Posted 2018-01-09T09:25:31.847

Reputation: 211

That's a pretty good score for such a verbose language. – DJMcMayhem – 2018-01-11T22:35:55.703

Thanks man. Appreciate the edit. – Jim W – 2018-01-11T22:36:02.913

1I figured it was worth a shot since the size of the file gives lots of room for improvement compared to program size. – Jim W – 2018-01-11T22:36:41.020

3This is not compilable under Java 7 (or any Java version, for what it's worth). Would you please fix your code? Once that's done, I'll gladly golf it so that your score is improved. – Olivier Grégoire – 2018-01-16T15:39:27.070

Untested, but this should be your exact same code slightly golfed: 950 bytes. Your current code contained quite a few errors though, so I'm not sure if I correctly filled in everything. Again, untested, so just compare the versions to see what I've changed/renamed, and see if everything still works the same as your original code. Can definitely be golfed some more, though.

– Kevin Cruijssen – 2018-02-01T14:06:10.100

Crap, I did this while bored at my old job and didn't take the code with me. I'll have to take a look at it to see where the typo is. – Jim W – 2018-02-25T03:06:12.790

10

Python 3, 2×497+619608=620602 2×496+619608=620600

import operator as o
l=''
w=''
d={}
p={}
s=0
def z(x,y):
 return sorted([(k,v) for k,v in x.items() if k.startswith(y)],key=o.itemgetter(1))
def f(c):
 global l,w,d,p,s
 r=' '
 if c in' \n':
  s+=1
  if w in d:d[w]+=1
  else:d[w]=1
  if w:
   if l:
    t=l+' '+w
    if t in p:p[t]+=1
    else:p[t]=1
   n=z(p,w+' ')
   if n:g=n[-1];l=w;w='';r=g[0][len(l)+1]
   else:l=w;w='';r='t'
 else:
  w=w+c;m=z(p,w)
  if m:
   g=m[-1]
   if g[0]==w:
    if s>12:s=0;r='\n'
   else:r=g[0][len(w)]
 return r

I attempted this independently, but ended up with what's effectively an inferior version of Michael Homer's answer. I hope that doesn't render my answer completely obsolete.

This builds over time a dictionary of words (crudely defined as strings terminated by or \n, case-sensitive and including punctuation). It then searches this dictionary for words beginning with what it so far knows of the current word, sorts the resulting list by frequency of occurrence (slowly), and guesses that the next character is the next character in the most common matching word. If we already have the most common matching word, or no longer matching word exists, it returns .

It also builds up a disgustingly inefficient dictionary of word pairs. Upon hitting a word boundary, it guesses that the next character is the first letter of the second word in the most common matching word pair, or t if there is no match. It's not very clever, though. Following Moby, the program correctly guesses that the next character is D, but then it forgets all about the context and usually ends up calling the whale "Moby Duck" (because the word "Dutch" seems to be more frequent in the first half of the text). It would be easy to fix this by prioritising word pairs over individual words, but I expect the gain would be marginal (since it's usually correct from the third character on, and the word pairs aren't that helpful in the first place).

I could tune this to better match the provided text, but I don't think manually tuning the algorithm based on prior knowledge of the input is really in the spirit of the game so, other than choosing t as the fallback character after a space (and I probably shouldn't have done that either), I avoided that. I ignored the known line length of the input file, and instead inserted \n after every 13 spaces—this is almost certainly a very poor match, the main intention was to keep the line length reasonable rather than to match the input.

The code isn't exactly fast (~2 hours on my machine), but overall gets about half the characters right (49%). I expect the score would be marginally better if run on whale2.txt, but I haven't done that.

The start of the output looks like this:

T t t t t t t t t L t t t tsher t t t ty t to t t te t t t t t tem t t t d b ta tnL te t tv tath a to tr t tl t l toe g to tf ahe gi te we th austitam ofd laammars, tn te to t tis nf tim oic t t th tn cindkth ae tf t d bh ao toe tr ai tat tnLiat tn to ay to tn hf to tex tfr toe tn toe kex te tia t l t l ti toe ke tf hhe kirl tou tu the tiach an taw th t t Wh tc t d t te the tnd tn tate tl te tf teu tl tn oan. HeAL. tn nn tf r t-H ta t WhALE.... S tn nort ts tlom rhe ka tnd Dr t t tALL th teuli th tis t-H taCTIONARY " t r t o t a t A t . t eALT t I t HLW t I t e t w t AO t t t AOLE, I T t t t ALE t w t t R t EK t T t R tSupplied by wnLw t t iit ty cce thet whe to tal ty tnd

but by the end, it looks a bit more like... something. My favourite passage from near the end of the book,

"I turn my body from the sun. What ho, Tashtego! let me hear thy hammer. Oh! ye three unsurrendered spires of mine; thou uncracked keel; and only god-bullied hull; thou firm deck, and haughty helm, and Pole-pointed prow,--death-glorious ship! must ye then perish, and without me? Am I cut off from the last fond pride of meanest shipwrecked captains? Oh, lonely death on lonely life! Oh, now I feel my topmost greatness lies in my topmost grief. Ho, ho! from all your furthest bounds, pour ye now in, ye bold billows of my whole foregone life, and top this one piled comber of my death! Towards thee I roll, thou all-destroying but unconquering whale; to the last I grapple with thee; from hell's heart I stab at thee; for hate's sake I spit my last breath at thee. Sink all coffins and all hearses to one common pool! and since neither can be mine, let me then tow to pieces, while still chasing thee, though tied to thee, thou damned whale! THUS, I give up the spear!"

comes out as

I dhrnery oyay ooom the woc Ihal iiw chshtego -tit my ti ddohe bidmer Hh, ho sheee opdeprendera toetis of tygd ahesgapdo tnep tnd tf y arosl tinl ahesgaorsltoak, and tidlhty ai p, cnd telas taep toip syst ho she tachlhe tnd tith ut ay Rnet hor bf toom the wist tord oaeve of ty nsst toip recked,hontain th, tingly toadh af tingly tike 'h, tot a hoet ty oh ost sreat ess iik in ty oh ost sremf Hew hiw"aoom tnl tou oolthert tyand . taoneoo sot an ao syad tytlows of ty oii e oor hoi tike and th ohes if oaped uoueid tf ty ooadh Ih ards the t houle lhesganl p tyt tpdomsuera tiile ah the wist t hrenelidtith the Ioom ti p s di dd o hoinbtn the Ior tid toie o hoetefy oist tyoakh on the Opr tnl toufin and tnl ti dd .mh tf ooueon gaor tnd todce tovther lon by tygd ait my the th aih tapce ciice toill moaneng she thesgh thmd th the thesgaoy d jiile YhE t hrve tpothe woerk "

That would have made The Wrath of Khan a lot more confusing. And "lonely" → "tingly" is a particularly satisfying substitution.

Edit: Saved one byte by deleting an extraneous space

Scoring

#! /usr/bin/env python3
import sys
import os
import mobydick as moby


def eprint(*args, **kwargs):
    print(*args, file=sys.stderr, **kwargs)

total = 0
right = 0
real_char = ''
guess_char = 'T'
print('T',end='')
with open("whale.txt") as whale:
    while True:
        if real_char == guess_char:
            right += 1
        real_char = whale.read(1)
        if not real_char:
            eprint(str(right) + " / " + str(total) + " (" +
                str(right/total*100) + "%)")
            size = os.path.getsize("mobydick.py")
            eprint("Source size: " + str(size) + "B")
            eprint("Score: " + str(2*size + total - right))
            sys.exit(0)
        guess_char = moby.f(real_char)
        print(guess_char,end='')
        total += 1

This runs the program for the text of Moby Dick and outputs the "predicted" textto stdout, and abuses stderr to write the score. I'd recommend redirecting the output to a file.

georgewatson

Posted 2018-01-09T09:25:31.847

Reputation: 231

2Welcome to PPCG! – Martin Ender – 2018-01-11T16:04:37.667

1Wouldn't lambda i:i[1] be cheaper than dealing with operator? – Draconis – 2018-01-11T17:18:45.063

@Draconis Almost certainly. – georgewatson – 2018-01-11T18:45:08.277

8+1 for "Moby Duck"! – Nathaniel – 2018-01-11T22:40:11.690

7

Python 3, 526640

274 bytes, 526092 errors (using whale2.txt). This is definitely capable of further improvement, but it has reached the "good enough to post" stage.

from collections import*
D=defaultdict
M=[D(lambda:D(int))for i in range(10)]
X=""
def f(c):
 global X;G=D(int)
 for L in range(10):
  M[L][X[:L]][c]+=1;N=M[L][(c+X)[:L]]
  if N:g=max(N,key=lambda k:(N[k],k));G[g]+=N[g]*L**8
 X=(c+X)[:10]
 return max(G,key=lambda k:(G[k],k))

The idea is to store the frequencies of all runs of 2, 3, 4, ..., 10 characters. For each of these lengths L, we check whether the most recent L-1 characters match a stored pattern; if so, our guess gL is the most frequent next character following that pattern. We collect up to nine guesses in this way. To decide which guess to use, we weight the frequency of each pattern by its length to the 8th power. The guess with the largest sum of weighted frequencies is chosen. If there are no patterns that match, we guess space.

(The maximum pattern length and the weighting exponent were chosen by trial-and-error to give the fewest incorrect guesses.)

Here's my ungolfed work-in-progress version:

from collections import defaultdict

PATTERN_MAX_LEN = 10
prev_chars = ""
patterns = [defaultdict(lambda:defaultdict(int))
            for i in range(PATTERN_MAX_LEN)]
# A pattern dictionary has entries like {" wh": {"i": 5, "a": 9}}

def next_char(c):
    global prev_chars
    guesses = defaultdict(int)
    for pattern_len in range(PATTERN_MAX_LEN):
        # Update patterns dictionary based on pattern and c
        pattern = prev_chars[:pattern_len]
        patterns[pattern_len][pattern][c] += 1
        # Make a guess at the next letter based on pattern (including c)
        pattern = (c + prev_chars)[:pattern_len]
        if pattern in patterns[pattern_len]:
            potential_next_chars = patterns[pattern_len][pattern]
            guess = max(potential_next_chars,
                        key=lambda k:(potential_next_chars[k], k))
            frequency = potential_next_chars[guess]
            # Exact formula TBD--long patterns need to be heavily
            # advantaged, but not too heavily
            weight = frequency * pattern_len ** 8
            guesses[guess] += weight
    # Update prev_chars with the current character
    prev_chars = (c + prev_chars)[:PATTERN_MAX_LEN]
    # Return the highest-weighted guess
    return max(guesses, key=lambda k:(guesses[k], k))

And the test harness:

from textPredictorGolfed import f as next_char
# OR:
# from textPredictor import next_char

total = 0
correct = 0
incorrect = 0

with open("whale2.txt") as file:
    character = file.read(1)
    while character != "":
        guess = next_char(character)
        character = file.read(1)
        if guess == character:
            correct += 1
        else:
            incorrect += 1
        total += 1

print("Errors:", incorrect, "({:.2f}%)".format(100 * incorrect / total))

Here's some sample output from near the beginning of the text. Already we begin to see the ability to finish common words after seeing their first letter (in, to, and, by; also, apparently, school).

 you take in hand to school others, and to teach them by what name a whale-fish
xU wshhlnrwn cindkgo dooool)tfhe -; wnd bo so rhoaoe ioy aienisotmhwnqiatl t n 

Near the end, there are still plenty of mistakes, but also plenty of very good sequences (shmage seashawks, for instance).

savage sea-hawks sailed with sheathed beaks. On the second day, a sail drew near
shmage seashawks wtidod oith tua dh   tyfr.  Tn the shaond tay, wnltiloloaa niar

It's interesting to look at some of the mistakes and guess what word the algorithm "expected." For instance, after sail, the program both times predicts o--for sailor, I presume. Or again, after , a it expects n--possibly because of the common occurrence of , and.


Changelog:

  • 274 * 2 + 526092 = 526640 Golfed the algorithm, at the cost of a few extra errors
  • 306 * 2 + 526089 = 526701 Original version

DLosc

Posted 2018-01-09T09:25:31.847

Reputation: 17 064

5

Python 2, score: 2*(407+56574) + 562262 = 676224

Searches for words matching the previous characters from a list of  all  most words used in the text, sorted by the number of their occurrences.

Code:

import zlib
f=open("d","rb")
l=zlib.decompress(f.read()).split()
w=""
def f(c):
 global w
 if c.isalpha():
  w+=c
  try:n=next(x for x in l if x.startswith(w))
  except StopIteration:return" "
  if len(n)>len(w):
   return list(n)[len(w)]
  return" "
 w="";
 n=ord(c)
 if n>31:
  return list("t \n 2  sS \n  -  08........       huaoRooe oioaoheu thpih eEA \n   neo    enueee neue hteht e")[n-32]
 return"\n"

Data: https://www.dropbox.com/s/etmzi6i26lso8xj/d?dl=0

Test suite:

incorrect = 0

with open("whale2.txt") as file:
    p_ch = ch = file.read(1)
    while True:
        ch = file.read(1)
        if not ch:
            break
        f_ch = f(p_ch)
        if f_ch != ch:
            incorrect += 1
        p_ch = ch

print incorrect

Edit: Using whale2.txt gives a better score.

Steadybox

Posted 2018-01-09T09:25:31.847

Reputation: 13 488

5

C++ (GCC), 725 × 2 + 527076 = 528526

Yet another prefix-frequency submission. Run on whale2.txt, and get similar (slightly worse) score than others.

#import<bits/stdc++.h>
char*T="\n !\"$&'()*,-.0123456789:;?ABCDEFGHIJKLMNOPQRSTUVWXYZ[]_abcdefghijklmnopqrstuvwxyz";
int I[124];std::string P(7,0);struct D{int V=0;std::array<int,81>X{{0}};};std::vector<D>L(1);D
init(){for(int i=81;i--;)I[T[i]]=i;}int
f(int c){P=P.substr(1)+(char)I[c];for(int i=7;i--;){int D=0;for(char
c:P.substr(i)){if(!L[D].X[c]){L[D].X[c]=L.size();L.push_back({});}D=L[D].X[c];}++L[D].V;}std::vector<int>C(81);for(int
i=81;i--;)C[i]=i;for(int
i=0;i<7;++i){int D=0;for(char c:P.substr(i)){D=L[D].X[c];if(!D)break;}if(!D)continue;int M=0;for(int
x:C)M=std::max(M,L[L[D].X[x]].V);C.erase(std::remove_if(C.begin(),C.end(),[&](int
x){return L[L[D].X[x]].V!=M;}),C.end());if(C.size()<2)break;}return T[C[0]];}

This one greedily find the longest string that starts with a suffix of the history, and if there are multiple candidates, tiebreak with shorter strings.

For example: If the last 7 characters are abcdefgh, and the string abcdefghi and abcdefghj appears with the largest frequency in all strings of the form abcdefgh*, the output will be either i or j, tiebreak with shorter suffixes (bcdefgh, cdefgh, ...).

For unknown reasons, anything more than 7 and my computer doesn't have enough RAM to run it. Even with 7, I need to close all web browsers to run it.


Testing code:

int main() {
    init(); 

    std::cout << "Start ---\n";
    std::time_t start = std::clock();

    std::ifstream file {"whale2.txt"};
    // std::ofstream file_guess {"whale_guess.txt"};
    std::ofstream file_diff {"whale_diff.txt"};
    if (!file.is_open()) {
        std::cout << "File doesn't exist\n";
        return 0;
    }

    char p_ch, ch;
    file >> std::noskipws >> p_ch;
    int incorrect = 0, total = 0;
    // file_diff << p_ch;

    int constexpr line_len = 80;
    std::string correct, guess_diff;
    correct += p_ch;
    guess_diff += '~';

    while (file >> ch) {
        char guess = f(p_ch);

        // file_guess << guess;
/*        if (guess != ch) {
            if (ch == '\n') {
                file_diff << "$";
            } else if (ch == ' ') {
                file_diff << '_';
            } else {
                file_diff << '~';
            }
        } else {
            file_diff << ch;
        }*/
        incorrect += (guess != ch);
        total += 1;
        p_ch = ch;

        if (guess == '\n') guess = '/';
        if (ch == '\n') ch = '/';
        correct += ch; guess_diff += (ch == guess ? ch == ' ' ? ' ' : '~' : guess);
        if (correct.length() == line_len) {
            file_diff << guess_diff << '\n' << correct << "\n\n";
            guess_diff.clear();
            correct.clear();
        }
    }

    file_diff << guess_diff << '\n' << correct << "\n\n";

    file.close();
    file_diff.close();

    std::cout << (std::clock() - start) 
    / double(CLOCKS_PER_SEC) << " seconds, "
    "score = " << incorrect << " / " << total << '\n';
}

Ungolfed:

size_t constexpr N = 7;

int constexpr NCHAR = 81;

std::array<int, NCHAR> const charset = {{
'\n', ' ', '!', '"', '$', '&', '\'', '(', ')', '*', ',', '-', '.', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', ']', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z'
}}; // this actually contains a lot of information, may want to golf it
// (may take the idea of using AndersKaseorg's algorithm, late acceptance hill climbing)

std::array<int, 'z' + 1> const char_index = [](){
    std::array<int, 'z' + 1> char_index;
    for (size_t i = NCHAR; i --> 0;) 
        char_index[charset[i]] = i;
    return char_index;
}(); // IIFE ?

std::string past (N, 0); 
// modifying this may improve the score by a few units

struct node {
    int value = 0;
    std::array<size_t, NCHAR> child_index {{0}};
};
std::vector<node> node_pool (1); // root

int f(int c) {
    past = past.substr(1) + (char) char_index[c];

    for (size_t i = 0; i < N; ++i) {
        // add past.substr(i) to the string
        size_t node = 0;
        for (char c : past.substr(i)) {
            if (node_pool[node].child_index[c] == 0) {
                node_pool[node].child_index[c] = node_pool.size();
                node_pool.emplace_back();
            }
            node = node_pool[node].child_index[c];
        }
        assert(node != 0); // the substring is non-empty
        ++node_pool[node].value;
    }

    std::vector<size_t> candidates (NCHAR);
    std::iota(candidates.begin(), candidates.end(), 0);
    for (size_t i = 0; i < N; ++i) {
        size_t node = 0;
        for (char c : past.substr(i)) {
            node = node_pool[node].child_index[c];
            if (node == 0) break;
        }
        if (node == 0) continue;

        assert(node_pool[0].value == 0);
        int max_value = 0;
        for (size_t x : candidates)
            max_value = std::max(max_value, node_pool[node_pool[node].child_index[x]].value);

        candidates.erase(
            std::remove_if(candidates.begin(), candidates.end(), [&](size_t x){
                return node_pool[node_pool[node].child_index[x]].value != max_value;
            }), candidates.end()
        );

        if (candidates.size() == 1) 
            break;
    }

    return charset[candidates[0]];
}

Example output:

~ ~s  ta~ hard ts tt~~~~~~~ ~doam ~~ ar~ ~ i~~~ ~~~ ~he~~~~,a~ t~~~~ t~ ho~si~  
n--as his wont at intervals--stepped forth from the scuttle in which he leaned, 

~~~ thr~ ~~ t~~ crp~~~~~~~~ a~ wap~~~~~ a~eo~~ h~~ o~~ s~~~ or~~y~ ~  boog~e~~ t
and went to his pivot-hole, he suddenly thrust out his face fiercely, snuffing u

~ a~~ ~h~ ~n~ onitn~oi~~~~~~ ~~a~ ~ cewsoat~  a~ tae~~~~ ~e~~t~~ te~~ ouc~s~i~~ 
p the sea air as a sagacious ship's dog will, in drawing nigh to some barbarous 

ct as I~ iisk~~~~ ~~e~ tls~~~~ i~~~ ~~ soe~e Ae ~ ~~e~ tar~~~~~ trd~  ot ~ h~~~ 
isle. He declared that a whale must be near. Soon that peculiar odor, sometimes 

This one is near the end of the text. Most long words are predicted quite accurately (intervals, pivot-hole, distance)

 au t  tf weu~i~ aor~ mre~g~~~ m~t~~ ~~~  ~"NC~X~t~ti~  ~~n~ SNsh A FNECnSERTR O
 on as it rolled five thousand years ago./////Epilogue//"AND I ONLY AM ESCAPED A

NL~~,S~ ~HR~ yO~ -/s~n "~A~~ laeu~ta Vew~, S~e s~~  s~ ~ ain~ t~d ~t~ oirept~~ ~
LONE TO TELL THEE" Job.//The drama's done. Why then here does any one step forth

Upper-case doesn't seem good.

user202729

Posted 2018-01-09T09:25:31.847

Reputation: 7 149

Trie seems to be more memory-consuming than I expected... – user202729 – 2018-01-14T15:52:47.197

... and also harder to implement. – user202729 – 2018-01-14T15:54:05.297

4

Python 2, 756837

Uses something that might be Markov chains?

import zlib
a=eval(zlib.decompress('x\x9cM\x9cis\xda\xcc\xd2\x86\xff\x8a2\xf5\xd4\x81\xb8,\x977l\'\xf9\x90\x12 \x02f\x11G\x02c||*%@,a\x11a1\xe0S\xef\x7f\x7fC\x13\xf75\xdf\xda\xaaa4\xd3\xcb\xddw\xf7\x8c\xfc\xbf\xcc\x8f\xd7E\xe6\xab\x93if\xce\x9d\xcc\x8f\xefG\xd1\x11\xf1\x1b\xa2At\x8e\xa2\'\xe2\xc5Q\xfc,\xa2{\x14+"\x9e3\xf63b\x87\x9f\xb5\x8fb$b\xeb(\x96E\x8c\x18\x1b2\xb6{\x14/D\xfcq\x14\x03\x11}\xc6zG\xb1.b\xc0\xd3\x06\xcb\xa9\xf1\xb3\xcaQl\x88X>\x8a-\x11\xb7G1\x11q\x85\x98\x1c\xc5\x95\x88\xf1Q\xec\x89\x98\x1e\xc5\x81\x88\xa2\xb3X\xc4\x19\xe2\xe4(\xbe\x898\xd6\xc9F\xa8\xe4E\x16\x19\x8a\xc8r^|U\xc9\x8b\xc7\xd8\xfcQ\xf4\x8f\xe2\xbf\x1c\x06\xbc\xa8v6\xef\xba\xb2\x17V\xf6\x92\xe8r6\x07\x9d\xcc\x95EN\xe4\xe9FW\xb6\xd9\xea6M\xa2K\xdf\xact\x86\xf9\xc976Gy\xf2\xce\xef\x96G1\x15q\xf1\xf1\xd4\xcc3\xe6\x8f\xb8\x96\xdf}\xd27\xcf\x1d\x9da\x8e\x1f\xcd\xc5c\\\x11Q\xcf\xfc\x02Q\x9c\xe7\\\xd6\xbe;\x8acY\xe5\x8c\x17\xcfu9F\xc4\x83\xfc\x0c\x076\x0b\x1d;\xc7\x97\xe7_U\x9c\xacT\xfc\xc2\x1a\xbe\xb0\x06\x83\r7b\xd9\x85<\x9d\xe8\x86\xbe|Q\xff\xfc\xf2\xa0\xe2d\xa7?\xfbr\xc5\xbc\x97\x8c\xbd\xd1\xbd}\xb9f@\x8e\x01\xb7\x88\xf7\x88w*\xce\x13v1\xc1ZCv\x1c\xebz\xe7=]\xce\x1c\x9d\xcdg\xe8,U/\x98/\x18`\xed\xf8\x8d\xa7\xe21\'\x1bo\xd4,sk\x80\xb8\xc6L\xc45Oq\xa9M\xac\x9e8\xc7?k\xb8\x9fY\xe9\x80\x9a\x8c\x9d\x8a\x98\xea\xde\x8c\xcc\xbb\x94\xa7\x13\x06\xc8\xca\xfa"\x1e\x98\xa1\xa4\xe1R\xfb\xa1\xb1W+\xf2b\xc0\xa4\x96W\xac\xa8\x15\x10=\x8d\xd3ZC#\xb2F \xd7j\xccP\xd78\xadU\x8fbWD"\xbd\xd6Q\xb7\xaf\xb5\x98\x0cH\xac\x85\xfc\x0cH\xac5\x15(k\xdd\x8f\xa7\xa6&\xf1v\xfa\x19\x00Q\xc3\x7fkxuM\xe2\xad(\xa2D\xd6\xabX\xb6&\xfeyy\x14\x1d\xdc\xa4v\x8azY\xdbU\xa4P\xf9\xc4\xcc?\x0fj\x8d\x9f\x135\xf8O\xde\xf7\xd3Q?Ym\xf4\xe9\n\xefY\xe12\xab\x9d:\xc7\n`Y\xfd>\x8a[\x11\xf1\x88\xd5\x9a\xc9\xf6\xcc\x80#\xad\xde\xd5+W\x03\x9e\x12/\xab!\xf3\x8e\x98\x81xY\xf5\x18\xd0g2\xe2e5g\xb2\x05+\x13\x07\x9d\x8b8fCD\xd1j\xca\xcf,X]\x81X+\xb0i\xa5\x88\xf5\'\x1c\x14VW`\xe9\n\x84]\x19u\xaa\x15\x16X\x81\xb0+\x0c\xb7"\'\xbf.N\xab0\xa7?n\xd5\x13^\x179\xb5\xf9\xebB<\xe4\xe1$_[c\x04\xc3\x06\'\x99W\xbd.\xb2\x1ap\xaf\x8b\xb3\x8fy\xcc\x9fW\x19\xe6t\xacE\x18\x1d\xffoR\xf1\xeb\xa2k\xc9/\x96\xfc\x1fk\xfa\x96Z\xe7u\xd1VLx]<\xa9Q^\x17\x1dkL\xd3\x9a\xe7\xdfj\xe4\xd7Eh\x8d\x8fT\xc3\xaf\x8b\x9a5\xben\xc9\ru\xd2\xd7E\xa0\xf6}]\x94\xad1\x15k\x8b\x8f\xd6\xf8\xaa\xf5\xae\xa25\xde\xb7\xe6)Y\xe3\x7fX\xb2g\x8d\xc9[\xeb/(:\xfc[\xd4P9=>X?}\xb7\xe4\x8d\xa5\x92\xad5\xe5\x9b\xb5\x9c\x9d5Fbru\x92\x7f[\xaf]Y\xe3\xd7\x96\xdaf\xd6\x16\xe7\x1a\t\xaf\x8b\x85\xb5\x06\t\x96\xe1I\x1e[\xf3L\xac\xf5\xfc\xb2~;\xb5\x9e\x0f\xac\xf1\x12\xd7\xfb\x93<\xb4\xe6\x1fYk\x8e\xad\xdf\xf6\xac\xdf\xf6u\xfc\x80\x00\x19\x10A\x03\xdcz\xa0ac\x06\x84\xe3\x00>3 2\x07D\xe6\x80\xd8\x1e\x10\xdb\x03\xd8\xc8\xc0\x02\x82\x01\xb9w \xea\xd9\x89\x08\xee\x0c\xe6\xaa\xd8\x01\xba\x19L\xf9\x19\x9a\x1c\xa0\xc8\x01\x807\x00\xf0\x06hq\x00\xd9\x1d\xf4\xd0\x89\xa5\x9e\x985\x80\xb4\x837\xd6\x00\x82\x0f\xf0\xae\x01\x19y\x80\xaf\x0c@\xf0\xc1\xf2cCf\x87Vw\xe8o\x87Vw\x98h\x87]vXk\x07a\xdc\xa1\xf6\x1d\xba\xdea\x81K\x012aR\x977\x88\x97\no\x97W<\x85u]\n\x17;e\xceK(\xda%\xc4\xed\x12\x16x\t7\xdcYV\xbe\x94-I\xba\xbcd\xa3\x97\xec\xee\xf2\\W\xb1\xc3r;l\xb4\xc3r\xbb\xbe\xea}\xd7C\x14s\x9dt\t\xb5\xdb-\xd0\x04>\xb5#)\xed\xe0\xb5;\x12\xd8\x0e\x84\xd8Q8\xec0\xe2\x8e\xe4\xbc[2\x00?\xb9\xc4#\nl\xb3\x80\xe5\n\xa2\x12![\x05\x81G!\x1e\x05AP)\xed\n\x02\xac\x02\xfa\x85\x80\xa75\xc5\xba\x02t\xad  )\xc5l\x01jW\xe8"\x86\xbcB\xd0RrR\xa1\xc5+\x08\x9d\xc2X\xd5W \xbd\x17f\xba\xcd\x82\xa8Z\xd2N!Q\xf5\x15\xdeU}\x85\x83\xc6@a\xa5\x01U\x10\xa5\x9e\xd8\xee@\x9fN 4\x06,3#\xd5\xaf\x01\xc9\x0c$\xc5\x10\xa8\x13\xe0y\xb2\xd4\x1dO0\x96I\xd5\x16\x93\xadnh\x82\x85\xcc/f \x1f\x18\x06L\xc6\xba\x9c\t\xc8c\xc8\x17\x13j\x8c\xc9L}}\x92\xea\xd2\'\xe2\x88#\x11\xd9\xd0\x04\xaa5\xe9\xf1\xb3D]\xd9\x90\xce&#\xc6\x0e\xd9[\x11\x9d\xf9\xe8\x97dj\xc8\xa5\xc6\xd3\x080dRSP\xbb\x99\x1ac\xeb<%\xf3\x9b\x00\x9d\x91\xf7\ri\xdf<2/I\xdf\xc0Y\x0c\x94\xc5<1\x03\x84\xc5\xc0W\x0ct\xc5\x84,\x07\xb2b\xe0KO\xb2\xb7\x9ah\x07\xf43\xaf\x19uv\x039\x7f\x12MI\x1d\xf3$k/\xc8\x80\x0b\xc5.s\x06\xe6=\xc9\x9e\xa58\x99\xb8\xea\xd7\x13"yr\x81\xed\x01\xb7\x89\xbcN\xb2\xd9\xc4\xe8l\x7f\xcah\x85|\xc3:\x9fp\x89\'0\xefi\xa2\xa29\x81\xe9\xdf\x15\xa5j\xc7\xc9\xe9\xb9\xbc&Gc)\x87\xeb\xe6@\xe4\x1c8\x9d\xcb)\xde\xe6\xc0\xf4\x1cew\x8e\x04\x90#-\xe4.u\xc99RHN\x12\x8b$\xa1\x1cj\xc9\x01{9\xf8w\x19L*\xd3\xf2*S\xf5\x95\x9fxJ\xff\xac\xdcb\x00uc\xb9\x82\xd8`\x00Uj\xb9\xce\x0c@d\x19\x88,\x1f\xd4ve\xca\xb4\xf2\x04\x11RR\x8e\xd5\x1ce*\xab\xb2m\x992&-\x7fV\xfd\x94/\xac\x11(\xa8\xec\xaac\x95\xb5\x92\xfd\x13VZ\xdf\xfeG\xb4\xd2\x16Q;d&\xf3\xcd\xe8l\xaf\x19\xcb\xb52\xce\x87k\x99\x8c{\x14]\x11\xcf\xcd\xc7\x0b\x17$8\x8br.\x00\xbf\x05yqA\xb6\xb4\xe8\xec\x02\xb6v"\xb3\x12\x86\'\xaey\x12\xa1R\'\xa6y\x1aKM\xba@s\'\xea*\x00qb\xae\xa7\xa7{\x9e\x92N\x17$\x97/\x04\x96E\xd2-\x8enQ\xf4\x05I`AA\xbe \tX\xf4\x7f\xa1t\xcedv\xe6o\xf8\x98\xcc\x9b\xf9;\xc0d\xb6\xe6\xef6Mf\xf3\xa1T\x93Y#\xae\x18\xfb\xdb\xfc]\x8e\xc9,\x8d\xce{`\xc0\x88\xa7C\xf3Wg&\x93\x98\xbf+3\x7fx\xb6\xce\xdb?\x8a3\x11{\xcc\x1b36\xe5\xe9\xe2\x8fh2\xe6(\xce\x99a\xc6\x0c\x13\xf3\xd7\xf2&3f9\x1dv\xfc\xc4\xd3\x16O#\xdc\x08&\xba\xb8\xc0-\x9bFm\x01\x81]\x00\x88\x0b\xc3\xd8\xae\xbe\xe2T!\x9f\x94\xea\x1f\xc5\xbd\x88E\xb4S@\xcc\xb3M\xcf\xa8{~g\xde\x80\xf56\xf8Y\xfdc\xac\xc9\xd4\xcc_\xe72\x99\n\xda)\x7f\x8c\xcd|eo_\x1du\xb9\xaf\xf4\x1a\xbeZ\xe1\xfe\'Gj\xac\xd6\x8f\x1b\x15\xbdg\xea\x8e\xe6\x9c:\xd3\xd5\t\xfc:\xc8X\x07%\xea\xf0\xf7\xfa\xe9%\x1d\x91\xe9l\xd7\xc9\x12u\x89>\xe9\x82\xd7\x01\xab:\xb5G}\xc3\xc4+D"\xaa\x0e\x08\xd6i\xf6\xd5\x0b\x9a\x0e\xeb4\x06\xeb\x02\xa3\xc2\x1e\xeb5\x05\xad:8[o(\xce\xd6+\xec\xbe\xcd\xcf\x9a\ne\xf5\x88\xe5\x90\x0c\xce_9[X[\x95\xc3\x1aD]S\xca\xac\xd1\xd59f:G\xdb\xe7g\x0c \xf9\x9c\xd3\xeeYgu\x99k\xcc\xb1f\x865\xf6ZS\xf1\xae\xf1\xe7\xb5z\xb9Yg48\xce\x1f\xf4\x15\xdfu2\xf3\x9d\x01\xdfA\xec\xccwG\xcd\xbc\xc62k@kM\x07y\r\xc0\xad\xa98\xd6t\xdd\xd7\x18\x7f\r\xd6\xad\xa1\xab\xeb_\x8a\xcdk\xe0\x7f\r\xb5]\xc3\xf6\xd7\x00\xfd\x1a\xf8_\x93\x14\xd6}\x85\xdeu\x8f\xa7\xb4\xb9\xd7#\xd6\x0b\xd0\xaf\x81\xff55@H\xb9\x15&\xba\x86P&\x93f[\xc8\xca\xc2\xb1\xbe-\x94]\x08\xa7\x0e\xe1\x07!\xdd\xa0\xf0\tQ\xb8\x84\x90\xa3\xb0\xa9\x8e\x1dBAB(H\x88[\x86\xf4\xccC\x02&\xfc\xa1\x8e\x1dz\x1a0a^}<\xa49\x15R\xb0\x85\xb0\x91P\x02F\x90#\xa4\xb8\x0b\xe9\x99\x87\xd4\x84!\xce\x1e\x12\x02!\xbd\xd2\x10\x18\n\xc5\xa3\xaeD\xc4\x81C\xf1\xc4\xbc\x888{\x08\xf6\x84\xa7\x88\x93pH(e\x12J\x99$Us&\xd4\xd4\t\x0c5\xa1\r\x93L\x15\x91\x12|.I\xd4\xc8\t| !\xf3\'\x94\x7f\tT+\xe9+\x16$\x90\x8b\x84pI\xf6\x0c\xe0\xb0.\x81\xcd%DC\xb2C$\xf3\'\x84VB\x01\x99\x10\x86\tgf\xc9\xcf\xa3(\\7\x01,\x12t\x9d\xa0\xe0\x84\xfeY\x02\xedO\x80\x90\x84\x92$!\xc5$\xd8;\x01\xfd\x12L\x7fA\xa1\x92\x9c\x0c\'S\xec\xa1w\xfb\x89jjO3dO\t\xbf\'\xa8\xf7\xf0\xb4}\xac\x10\xb2O4\xf8\xf6\xa2\xebO"\x82<{\x94\xb6\xa7E\xb2\xdf\xaa\xc7\\\xd1\x1d\xdd\xa3\x93=\x9a\xda\x8b\xfe$\x87\xedE\x11R\xaf\xecU=f\x8f\xd2\xf6\xec~om\xf9\xeaR\xadqE=rE\xa3\xeb\x8a:\xe7\x8a:\xe7J\xea\x9c{\x11\xa9s\xae\xa8\x94\xae\x04\xc5\xafE$\xbf\\\xd1l\xbb\xa2_u\xc5\xe6\x8a\x12\xca\x82\xe7\xc5\x9a\xc6z\xb1\xae\xb8P$\xc0\x8b`H\xb1\xa8\x10Q\xf4\x15N\x8ad\xe5"\x80T\xa4<*\xb6\x15\xc7\x8a\x1c\xa0\x15#\x85\x93"\xed\x87\xe2D-[\x84P\x14c\x05\xd0"\xa7\x87\xc5\xad\x1a\xaeH\xfe)\x9e\xd4.(S\xb4\xb6\xac\xf64\xc5\x8cr\xb2"\x14\xa8\x88\xbb\x17\xf1\xe6\x8e\xaf\x88\xd4\xa1r\xefp\x9b\xa1C=\xd7\x81rt\xd0_\x87\xf6X\x87\xc2\xb7#\xbb\xff&"-\xafN\x131Q\x07\xed\xd01\xec\x80n\x1d\x1a\x82\x1d\x02\xaa\xa3\x8a0\x1d\xd0\xb6\xe3\xb02\xee\x85t\xb8\x17\xd2\xb1N\x1d;\xec~\xcb\x81\xdf/p\xeaZ\xbc2\'O\'\x1a\x1a\xbf\x12\xb5\xdc/Y\xb0T>\xbfR5\xd7\x1d\xfc\xe6\x8e\xe0\xba\xc3Dw\x04\xc9\x1d\xa5\xfc\x1dArG\xe8\xdc\x11$w9\x8d\x81;\t\x129\x0e\xbb\x93EJ\x82\xb9\xa3\x9dp\xf7E\xc3\xa1\xc5\xed\x8a;\xab\x81F\xeb\xbeb\xc5o\x05\x9dT@\xbd\n\xc0ZaG\x15vT\xc1\xa7*\n\xa1\xa6\x92\xf9(r2\x95g\xf4^\xe1\xeeH\xa5\xc9\xefH\xf7\x95\x10\xb1\xad\xc1S\xc1\xa9*O\xea>\x95\x8a\xee\xb9R\xd7\xf0\xabp\xdf\xa6\x12\xa8\x87V\xc4\x85\x7f\x88\xc8\x8d\x9dJ\x81\xc9\xf2\xea(\x15\xc8E\xa5\xc8\x80\x1f\xac\xa1\xc4S*\xe4\n9\xaaB\xa3\xb5B\xc2\xab\x08\xceK\xbb\xadB2\xaf\x88\xf7\x08\xa2WH\xe6\x15\x12Ae\xa4\xc8Q\xa1\xd7\x98\xa5\xb0\xce\xaeu\rY\x8a\xf0,\r\xd1,\xb6\xf7\xb0a\x16\x92\x90\x85\x82f9O\xce\x92\xad\xb2\x9c\xa8e\xa1$Y\xc8f\x96s\x80,\xa1\x9c\x85E\\\x8b\x01\xe4\xf8?\x0b\xad\xcc\x82\x0b\xd9H\x8d\x95m\xf26i;\n^g\xe9@e\xf1\x87lU\xed\x96-3\x96.h\x96r(+\xfe \x80\x9e\xad\xf1b\n\xaa,\x9d\xd8l\x81\x9fy\n\xb6\xd9\x92:W\x96\xcb\x1c\xd9"/\xf6\xd9\x85\xc4\xf71\xb1\x99\xe3!\xb3\xc6@jUT\x0b\xfbv\x13\xa7*\x9eL\xf8$\xa3\x89\xb4\x94PL1c\n\xb1I\xc9\xd1)Q\x99\xd2\x01H\x89\xeb\x94hO\xc9\xe7\xdf\xa8\xae\xbei\xae5\xdf\xa8\x98\xbeQ\xcb}\xb3\x96#\x9e"\x97`R|8\xc5SR\xf1\x1fa0)EP\xfa\x0b\x11\x0fL\xc7\x1a\x10)\xa7\x85)\xae\x9f\xd2\x92O!\xafi\x9f5\xd0\xbeOi\x87y\xa1z`\n7M\x0f\xea\xb8\xe9\x9e\xc9\xe0\xa6\xdf\xacb8%\x1b\xa7\xc4u\xca-\xa3\x14r\x9a\xc2\xc9R\x98Z\x83}6\xe8f6h&4\x92\x8f\xa7\xa6Erk\xf0\xe2\x06i\xb7\x81\xef7\xa08\r*\x9b\x06\xd7\x85\x1a\xa4\xf3\x06d\xa6Am\xd4\xa0\xbaj\xf8\xfc\xec\x07O\x9f\x11\xe1@\r\x9a\t\r\x88O\x03Do\xb4\x18@\x0f\xa2\x01\x8c7:\xec\xc2J\xd1\r\\\xbcA\xc9\xd4\xb0\xda\xb7\x0b\x92m\x03\x8e\xd3\x80\xb36,\x05\xe2\xee\x0bk\xe2\x93me\xff16\x88\x01\xdf\x18W\x8aa+1n\x17\xe3\xa2\xf1P\x8d\x14c\xe6x\xccX\\?\xc6\xf5c\xc2$&-\xc4\x80o\xbc\xd0\xe0\x89q\xaax\xc9\xdb\xc8<\xf1\x8a\xb1\xb0\x99\x18g\x8d9(\x8f\xa9\xbabJ\xb8\x983\xc0\x980\xb9\x82\xac,\x80\x8b\x05Zm\x9dTy#\xbf\x03|b(A\x0c:\xc5\x90\xf7\x98c\x9c\x18\xc3\xc4\xa0^\xcc;b\xe0+\xb6\x88\x8b\xebk`\xbb\x9c\xc0\xb9\x9c\xb5\xb9\x82\xda\x92O\\\xf1}I\x85.G\xb6n\x9e\xb1u\xc4\x1a?\xe3\xac\xcd%\xa6\\\xb2\x8c[\xe6gD\xa5\xfb\xc8+\xda\xea\x11.\'p.gm.w\x86\\\xce\xda\xdc&\xf3r\xd6\xe6\x86\xfa\xd4!\xc5\xba\x9c\xc09\xdc>q)\xf5]2\x8ck\r\xa0#\xe4\x12\x03.g\xba.\xa5\xbeK\xa9\xba\xd9\xf1\x94\xbb4.Wl\\b`\x83\x83\xba\xdc\xa3q9\xecp\xc5W\x85\x1a\xb9\x90\x95\r5\xb2\x8b\xaf\xba\xc4\x80\x0bww\xd7h\x12\xf6\xb5\xe1\xfe\xc2\x86\x1do\xe8vm8\xe1s9~\xdap\x14\xecr\xd8\xe1\xda\xa7K\x1b+s;\xd6\xd5f\x1a\xe0\xaev\xd33\x1bBf\x83;\xbbV\xf7\xd1u1.a\xe0f\x99\x98\x88\xd80`\xe3\xa2,x\xc0\x86H\xdb\x90\xd07\xf0\x80\r\x01\xea\xa0\xee\x11\x17\\G4\x17#\x16\x1c\xb1\x8d\x88P\x8ch]E\x16:G\xb24\xc92\x11\x0b\x8e\xe4\xcdB\x1a"\xbd\xc8o"\x80::\xe9\xb5$\xf2A\x8d\x13a\xf4\x88l\x1a\x01f\x11\x1d\xd7h\xc3\xd8\xa9*0\xa2=\x16QKF)K#\xcfG@r\x84\x0fF\x84D$\x81"\x146J\x18\x10)4DT\xb9Q\x07Q@@\xca\xeb\x88\xcb\xb7\x11\x17u#\x92{TV\x18\x89\xe8JF\xa0OTg\x00\xd9?\x82\xb7Fy\xe6\xf5\x18Ku3\xc4\x9eC\xac<\x14\xd3\xca\x9d\xcc!.3\xc4e\x86\xda\x1e3C<mH6\x1eb\xef!$q\x88\x07\x8f\xf0\x9e\xa1\x15GC\x02w\x08b\x0c\xe9h\r\xe9h\ri\xb6\x0fi\x97\x0ci\x9a\r\xb1\xcb\x10\xee8\x04\x94\x86\xdc\xe4\x1f\x02kC\xcd\xbbf\xc4\xe6\x1c\xa9\xb4\xa5\xfe>\xb0\xcf\x03\x9b;\xb0\xe5\x03\xfb<\xa0\xb4\x03\xaa<\xa0\xbf\x03\xaf8`\x81\x03v9\xa0\xa9\x11o\xbb\xa63p\xcd\xd5\xafk\xdag\x07K\xab\xd7\\\xfb\xbf&\x8b_\xd3r\xb8\xa6\xe5pM\x1b\xe1\x9a\x0e\xdc\xb5\xac]: \xd7\xec\xf3\xda\xda\'Z=PU\x1e\xe6\xfa\xb3\x03\x08y\xa0\xbds\xe0`\xe3@\xf7\xeb\x00\xf8\x1e\xc8<\x07\x0e+\x0e\xc0\xf7\x81\xabI\x07\xa0\xfe\xb0d\x06\xfc\xe8@\xff\xec\x00\xe8\x1d(\x93}\x0bz|\xd0\xcbg\xcb\xbe\x85o\xbe\xc2\x9e\xf1\x81/\x1f\x8b\xfb\xdc\x88\xf7Aa\x1f\x83\xfaX\xdc\xa7\x7f\xe1\x13\xcb~\xa0p\xe1K\xdcK\xe9\xea\x83\x11~Y\xd1\xc0\x87u\xf8\x12\xe1/"B\xea}>_\xf2\xa9b}j\x01\xbf\xc0\x0cy\x96\x0e\xd5\xf7\xa5\x00\x10\x92\xed\xbf\xf0bN{\xfc\x0e?\x83\xdf\xfb\x94\xf0>=\x1f\x9f\n\xc1\xa7\xe7\xe3\xd3"\xf1q\x19\x9f\xfbZ>\xc7L>W\xe3|\xf1\x08a\xbd\xbex\x84d.\x9fF\x84Oq\xe8\xe3S\xfe\x9e\xb7Au}\x9af>\xd0\xe3C@|r\x91\xbfd\x91\xe2i\xbfE\xa47\xf3|\xf2)1\xe73\x01\xf3\x8co<\x8b9\x9fE\xa4_\xf5La\xf6\x0c\xbd}~V\x13\xfd#\x88$\x14\xfa\x1f.\xc5?\x8b1\xa4)\xf1\x0c\xb3\x99Zh0\xe5lc\x8a\xafN9?\x9d\x02ISh\xfa\x94\xb5O\xc1\xa1)\xa11\xc5\x99\xa7\xc0\xd7\x14o\xbfg\x86{\x1a\xf6\xf7\xf4Y\xef\xef\xf4m\xf79]\xef=Pw\x0fN\xdd\x83^\xf7|\xe0t\x0f\xd2\xdd\x0bzIk\xf4\x1eL\x9bb\xfb)\x1f\xd5Ma\x86\xd3\xa1b\xc4\x14\xc0\x99\x02oS\xe0mJG\x7f\n\xeb\x9d\x92J\xa6P\x87)04\xe5\xb6\xea\x14\xef\x99\xc2d\xa6$\xb9)e\xd9c\xa0\x0e\xf1\xe8+L=J\xf8J[\xf3\x99\xf3\xd5GV\xf6(K\x17\xa2\xf2\x88C<ri\xf4\x11k>b\xa1,*1\x0c\xf8\xafM\x80?c\xf0\xcf\x18\xfc3\xa3?\xe3\x1c\x9f/x\xca\x8d\xa1\xcf\xa0\xe2\x92\x88Y\xa2\xaa%Lo\x89~\x96\x1bDBu\x89\xaa\x96\\D^\xd2\x96\xfcl/~I\xd5\xb4D-K\xd8\xe2\x12;/\xb1\xfe\x92\x84\xb5D\xc7K>\xbf\\b\xfd\x1b\xf2\xe7\xd2\x8a\xbf%j[\x12\x1cK\xd8\xc1\x92\xfe\xc5\x92P\\\xc2:\x96\x98i\x89\x8a\x97(\xfe\x86\xa7\x01c\x03W!\'\xb0\x06h\x88\x9b\x80,\x16\x80\x0c\x01\x9d\x95\xe0\xb4\r\xf1\xb6\x806_@\x9a\x0fh\xf3\x05c\x8d\xe6\x00\xfa\x15\xd0Y\t\xf8\x10"\xe0\x849\x80\xd6\x05 n@\xfb+ u\x07DR@\xc6\x0f$P\xaa"rn\x15\xd4\x11\xb9\x04\x10Ty\xca\xf5\xc5\xa0\xac0\x1cH\xd2\x14\n\x1d\x94\x18\xcb\xd7\xb2\x01\x07\x04A\x01M\xf1\xe1l\xe0\xf1TR\xa9\xa4\x82\xa0\xc3+\xc8\x94\x01\xb7\xc1\x03:\xdc\x01UE\x10\xaaO\x05Z`\x98\x1en\xd2\xe3\x10\xbb\x87\r{\xd8\xbb\x87\x9b\xf4\xf0\x8d\x1e\xde\xd5\x83\xfd\xf7\xbe2\x16\xaf\xed\xbd\x02v\xbd\x81Z\xa0\x07\\\xf6F\x0c\x80\x8f\xf7z\x0c\x00\x18{TZ=\x82\xab\x97j\x18\xf5\xc6LF \xf6h\x9f\xf56\n\x97=\xdc\xa4\xf7\xc6\xcap\xa9\x1e\x05F\x8f\xa6m\x0f\xe8\xb8\xb0Ab{\xfaC\xc0\xd3\xa13ra5)\xb7\x84\xf0\x05J\xbe@\xc9[\x14wA$]X7E/2\x1c\rl\xad\x1f2\xdd\x96\x8b}[\x8e\xd5\xb6\xd8w\x0b\xa6n\x7f\xf2\xbe\xba:\xcbE\x11\xd1G,!\xfe\x97=]p\'\xec\xa2\xa3\xe2\x16%m\x856\t\xff\xd9\nmz\x17\x91\x8b\x9c[\xda\x8d[\x94\xbf\xc5$\x17\t\xf3\x02\xf7[\x92\xc0\x16\x1e\xb8\x05S\xb6|c\xbe\xa5\'\xba\xe5\x90xK\x83uK\xf9\xb7\xa5\xed\xb5\xe5\xde\xfeVPI\x9aV\xdbX]hK\xf1\xb1\xed)\xae\xb5\x0e\xba\x9c\x16m/\xcf\xeaA\xb6V\xaa\x93{\x0b\xed[\xb4\x17Zd\x94\x16I\xb9ES\xb9\x05]\xf5\x08\xe3\x960\xedc\xef\xdbx\x1c\xc3\xb4\xba\x8a\t-\xb1\x91\x90\xf9\x96\x80\x86\xd4\x0b-\x81\x12\xa9\x17<q*\xb9l\xdd\x82t{\xe2T\xc2*[\xfc\xb3\x82\x16\xa7\x04-N\xc8Z\x94\x19\xad\no\xa3\xa0hq\x87\xbf\x05qm\t\xf4\xc9)\x96WPP\xf6\xf2\xac\xc1\xfa\x19q\xe2q\x19\xc3\x13\x0f\x15\xa6\xe3Uto\x1e\xb7\r<\xaa\x1e\x0f\x84\xf7X\xba\xc7\xb1c\xcb*\xde\xbc\xa6\xc6\xa2\x17\xb1`\xce\x19<\xa0\xd8\xa3\xc0\xf1:<}\xd2\xdd{\x94H\xde3O_P\x8f\xa3\x9e\xdf"j\xbd\xbeb\xa3\x07/\xf5\x06\n}\xde\x08\x91\xa3\x05\x0f\x14\xf4\xe8cyP\x97\x16\xf7\xe8<\xd0\xd5\xe3h\xc1#v<J\x19\x8f\xa3c\x8f\x98\xf4V,\x92\xf3\x04\x8f\x00\xf7 f\x1e\x9f\xe3y\xf4R=>\xfc\x1c1\xd6\xa1\x976\x82\xef\x8e\xacf$k\x18\x81\x0b\x0e\xa1\xec\xf0\xbd\xbeC#\xd9\xa1\xbd\xecp\x99\xd2Ag\x0e\xd9\xcb\xa1m=\x02\xdd\x1c(\xdc\x88\xb3\x9d\xd1P\xb53"\xd3\x8d\xe8D8\xb0\x15\x87\x96\xc2\x88;\x98\x0e-n\xc7R\t\xc7\xed#\x8c\xe5\xf0\xa5\xd1\x88\xa5\x8f\xc6\xea\x04\x0e\x07\xd5\x0e\x9f\x0c9\x1cn8|t\xe4p\x10\xe2p<\xe2\xf0\xb9\xaf\xc3\xd7\xc1\x0e\xdf\t9|S\xe4p\xce\xe1\xf0\xfd\x91\xc3\x99\x88\xc3\xb7J\x0e\xe7\'\x0e\xdf\t9\x9c]8|S\xe4p\xce\xe1p\xfa\xe1p&\xe2pR\xe2\xf0\xad\x92\xf3\xc2+\x9e\x99\x8c\xd3\x8f\x11\xe1\xe4H>\x94v\x80c\x14+\x1c>\xffv\xfe\xf5!\x1a\'ct\xb2\x7f\x8eO\xa5\xdf\xe7\xc8\x89\xb7\x90=\'\x8b\xc8\xb5\xbf\x11\xd5\x8fC\xfev\xa4B\x95km\x0eu\xab\xc3\xb7\xec\x8e\x94\xbbR\x04\x8f(\x84\x1c)w\x856;R\x04Ki<\x82\xaa9R\xcd~\x11\x91\nc\x04\x81\x1bY\xe9\xe7\x1d\xa2\xf5N\xbd\xf2N&z\xc7\xbb\xde\xb9d\xf8\x0e\x1f\x7f\x87\xa5\xbf\x13#\xef\xef\x1a\xb2\xef\x94`74\x9b\x1cB\xf6f\xa0;z\x87\xd3\xbc\xbb\xbc\xcd\xda\xdcZ\r\xf7\x0ef\xbe\x83\x99m\x0e|\x1c\xf0\xea\x86\n\xff\x06]\xdf\xd0#\xb8\xa1\xefyC\x8f\xe0\x86/\xacnh\x9d\xde\xd0P\xbd\xa1\xf7pC+\xe4\x86\xf5>nu\x17\x0eHZ\x12\xbf\x17\xe4/\xd1\xe5/\xd1\xfb/q\x03\xa9D7\xbeTR\xff,q\xd7\xa8D]R\xa23X\xe2\xba\x7f\tU\x97\xb0E\x89{\x0f%\x0c[\xe2\xf3\x84\x12Ek\x89\xa3\xe6\x92u ^\x82\xaf\x96\xc4\x02R\x14\x948\xed)\xb9\xcc\xc6\x8d\xbb.\xed\xc9.]\xcd\xae,X\x9a\x80]z\x16]v\xdf\xa5\x90\xea\xc2R\xba\xa2\xbfS\xce\xee\xd28\xee\xe2\xa0].\x83t\xed\xcfA\xce!K)\xd0|N\xa4u\t\x99\xae\xab\xf6\xe8\xe2\xa2]\x8b/t\xf5\x03a\xd3\xa5L\xeeBZ\xba\x14\x02c\x9e\xce\xa8|g\xe4\x92\x19\xb7\x07f\xe4\x92\x19]\x8bY_w:\xa3\xee\x98Q\x1f\xcd\xb8:2\x9b1\xc3\\\x83c\xcd\xe6f\x84\xf8\x0cE\xccH\xc53\x92\xf9\x0c\x7f\x9e\xe1V3R\xf1\x8c+\xd93:\xa63\x90\xe1\x9c/\xd8g\x00\x91\x99Q\xa2\xce0\xc1\x8c\xae\xc7\x8c\x18\x9f\x11_3\xac1\x03Zg\xd6\xe6P\xfb\x0c\x18\x9ea\x81\x07&{`\xb2\x07y\xb1$\x93\x87\x07\x9erq\xf2\xe1Zq\xfa\xe1F\x01\xf7\x81\xcd=\\\xf1\x14\xecx\x00Q\x1e\x04;$\x83<\x08\xa2H/\xb2\xea|\xc4\xb8\xa9\xe2GUb\xaaj9]\x95\x05W\xd9Q\xf5\xa4V\x89\xaaj\xacJ\xa9R\xefT\xb1x\x15\x86X%\xca\xab\x90\x8e*uK\xd5\xd7x\xaf\x12\xc3\xd5\x9a\x06n\x95\xb8\xac\x86\x8aUU\xae\xe5U\xb9\xb1Y\x85\x13\x9f\x91\xc4\xcf:\xfa\xe2\xb3\xa6\xae\xec\x0c\x1ap\x161\x00\xd2q\xc6\xbf$;\xcb\xeb\x80\xefv\xad~\x86{\x9cQ\r\x9f\xd9C.\xf1\x95\xdfh\xb6\x85\xf8\x9b\xff\xfe\xd2\xa4Q\xd0\xdc \xc2T\x9b\x07u\xdd&`\xd4\x14#\xc8\x19@\x13\xf6\xd9\x9c\xa8\xb75Sf\x00\x80\x9b\xdc\x82lF\xaa\xcd\xa6hH0\xbe\xd9A$\xa34\xf9\xf8\xb6\xd9U\xfcmr\xa2\xd3\xa4\xbejr7\xb2)\x8a\x95z\xb0I\x1ai\xd2\x15kr\x81\xac\xe9\xf06"\xa9\x89\xce\x9a\x94LM\xeb\xf8\xac\xcf\xc7\xab\xfd\x89j\xb5\xcfU\xa8>t\xa4\x0fI\xe9S\x15\xf4\xa9\xc9\xfb\x16HR\xe6\xf4\xb9\x98\xd1\x07\x7f\xfa`U\x1f\x04\xeb\x93\x9c\xfb\xd8\xb0\xbfa26\xd7\'\xab\xf5\xd9g\x1f|\xeaS\x9c\xf7\t\xcb>\xf0\xd3\xc7\xd1\xfaV\x8b\xe0\x8d\x1d\xbd\xd1s~#X\xdf\xf8\x94\xfc\x8d\xb5\xbf\xb1\xe07\xdd\xa7y\xcb\x18\xfd\x19k\xcfc\xf0<\xdfB\xe5\xa9\xb8\xf3T\xc6\xf9@a$O\xb8\xe7\xdb\xcc\x00\x8d\xc9\x13\xf9y\x02;O\xea\xcd\xd3\xe7\xcb\xe3\xd7y6\x94\xe7\x7ft\xe5\xe9\xd2\xe5\xe9\xe0\xe6\xb1\xe1F\x9b&&\x0fH\xe692\xcbc\x97\xbc\x85\x97yL\xd0fD\x1b\xf5\xb4\x15}3#,\xd7\xde\xe8z\\\x98q\x9b\xfbDm\xc9\xab\xc2\xfd\xda3\x1d\xdb\x06D7\xd6\xcf\xba\n\xa2m)S\xe4\x18\xb6M7\xb7\xcd1M\x9bo\xdf\xda(\xb8\r\x18\xb4\xeb\x1a\xa9m1\x9c\xb0\xc7\xb6\x18NZ\x1am\xba\x1bmxb\x9b\xeb\x9b\xed\xa2\x86r\xfb\x87"@\xdbS#\xb7i\xcc\xb4\xf3\x1a\xcac4\xf9\x89\x1c\xfd\xc9\xba\xaf4\xe6\x9e\xd3\'\x98\xd6\'2\xf3\'\xeb\xbf6|\x02\x9c\xc7\xf0\xe81\x86\x19c\xae\xb15\x96W\x8f9\x14\x19C%>\xd9\xf0>\xb6\x0fY\x80\xe41~5\x06\xd4\xc7\xc0\xc4\x98\x92b\x0cL\x8c\xe1Gc\xf8\xd1\x98o#\xc7\xf4\xa5\xc7\xb0\xea1\x1cm\x0c]\x1ds\x9bjLwaL\x95:\x86\xad\x8f\xb9\xc60\x16\xca(g\xdd\xe3\x01\x1b\x02\r7P\xc6[J\xa0[\xa11\xc2<n\xa1&\xb7P\x93[\xbe\xbc\xbd\xcd\xa99n\xf9\xc7\x11\xb7\x14Q\xb7\xfc\x93\x89[\x8a\xa8[Lw\xcbY\xee\x85e\xf2[<~\x04t\x8e\xfeZ\xf4\xff\xfe\x1f\xfa\xddI\x97'))
global t
t=' '
def f(k):
 global t
 r=a[t+k]if t+k in a else'e';t=k
 return r

Skyler

Posted 2018-01-09T09:25:31.847

Reputation: 617

1

Quick explanation: The zlib.decompress('...') evaluates to {'G?':' ', 'G;':' ','G"':' ',.......}, and a is a dictionary which maps from 2 character to 1 character. Basically 2-character variant of Steadybox's answer.

– user202729 – 2018-01-10T08:53:52.580

1As I can see, the literal is 17780 bytes. You can reduce it to 11619 chars by removing whitespaces in the decompressed content, which saves 12322 bytes. (if I counted correctly) Also... converting hex escape codes to actual raw characters may save even more bytes. – user202729 – 2018-01-10T14:18:18.957

How do I post something here if it is raw bytes? – Skyler – 2018-01-10T15:13:40.123

1xxd, hexdump, uuencode, or similar – Peter Taylor – 2018-01-10T15:29:02.517

@user202729 Just note that Python code cannot contain actual raw NUL bytes. – mbomb007 – 2018-01-26T21:03:24.287

4

Haskell, (1904 + 1621 + 208548 + 25646) * 2 + 371705 = 847143

{-# LANGUAGE FlexibleInstances, DeriveGeneric #-}

import Control.Arrow
import Control.Monad
import Control.Monad.Trans.State
import Data.List

import System.IO
import Data.ByteString (ByteString)
import qualified Data.ByteString as BS
import qualified Data.ByteString.Lazy as BSL
import qualified Data.ByteString.Char8 as BC8
import Data.Ord
import Data.Char
import Data.Monoid
import Data.Maybe (fromJust, catMaybes)
import Data.Function
import qualified Data.Map as Map

import Codec.Compression.Lzma

import Data.Flat

import GHC.Word

maxWordLen :: Integral n => n
maxWordLen = 20

wordSeqDictSize :: Integral n => n
wordSeqDictSize = 255

predict :: [Trie] -> Char -> State ([Either Char Int], String) Char
predict statDict c = do
   (nextChar:future, begunWord) <- get
   case nextChar of
     Left p -> do
       put (future, [])
       return p
     Right lw -> do
       let wpre = begunWord++[c]
       put (future, wpre)
       return $ trieLook (tail wpre) (case drop lw statDict of{(t:_)->t;_->Trie[]})

newtype Trie = Trie [(Char,Trie)] deriving (Show, Generic)
instance Flat Trie

trieLook :: String -> Trie -> Char
trieLook [] (Trie ((p,_):_)) = p
trieLook (c:cs) (Trie m)
 | Just t' <- lookup c m  = trieLook cs t'
trieLook _ _ = ' '

moby :: IO (String -> String)
moby = do
    approxWSeq <- BSL.unpack . decompress <$> BSL.readFile "wordsseq"
    Right fallbackTries <- unflat <$> BS.readFile "dicttries"
    seqWords <- read <$> readFile "seqwords"
    let rdict = Map.fromList $ zip [maxWordLen..wordSeqDictSize] seqWords
    return $ \orig ->
      let reconstructed = approxWSeq >>= \i
             -> if i<maxWordLen then let l = fromIntegral i+1
                                     in replicate l $ Right l
                                else Left <$> rdict Map.! i
      in (`evalState`(reconstructed, ""))
              $ mapM (predict fallbackTries) (' ':orig)

Example:

Call me Ishmael. Some years ago--never mind how long precisely--having
 ap  me ,nhmael.  Hme ?ears |ce--never  usd how long .aacesely--|ubing
little or no money in my purse, and nothing particular to interest me on
little or no ?ivey in my ?efse, and ,uwhing .hrticular to Bdaenest me on
shore, I thought I would sail about a little and see the watery part of
?neae, I thought I would  cfl about a little and see the |rkers part of
the world. It is a way I have of driving off the spleen and regulating
the world. It is a way I have of ,uiving off the |kli   and .ia       
the circulation. Whenever I find myself growing grim about the mouth;
the Ca         . B        I  rtd |yself ,haoing  eom about the ?ivlh;
whenever it is a damp, drizzly November in my soul; whenever I find
Baieever it is a  'mp, ,uiv    Bar      in my  cfl; Baieever I  rtd

Uses three pre-computed auxiliary files:

  • seqwords contains the 236 most common words.
  • wordsseq contains an LZMA-compressed sequece of these words, and for all words not among the 236 most common, the length.
  • dicttries contains, for each word-length, a decision tree that contains all remaining words. From these tries, entries are picked as we go.

This way, we achieve significantly lower error rate than all the other lossy schemes; unfortunately, the wordsseq file is still too big to be competitive.

Here's a completed version that creates the files and does the analysis:

depunct :: String -> [String]
depunct (p:l) = (p:take lm1 wordr) : depunct (drop lm1 wordr ++ srcr)
 where lm1 = maxWordLen-1
       (wordr, srcr) = (`span`l) $ if isAlpha p
                 then \c -> isLetter c || c=='\''
                 else not . isAlpha
depunct []=[]

mhead :: Monoid a => [a] -> a
mhead (h:_) = h
mhead [] = mempty

limit :: [Int] -> [Int]
limit = go 0
 where go z (n:l) | z<100 = n : go (z+n) l
       go _ l = take 1 l

packStr :: String -> Integer
packStr = go 0
 where go n [] = n
       go n (c:cs)
        | c>='a' && c<='z'  = go (28*n + fromIntegral
                                   (1 + fromEnum c - fromEnum 'a')) cs
        | otherwise         = go (28*n) cs


mkTrie :: [String] -> Trie
mkTrie [] = Trie []
mkTrie strs = Trie [ (c, mkTrie . filter (not . null) $ tail<$>l)
                   | l@((c:_):_) <- sortBy (comparing length)
                                  . groupBy ((==)`on`head)
                                  $ sortBy (comparing head) strs ]

mkTries :: [String] -> [Trie]
mkTries rsrc = [ mkTrie $ filter ((==l) . length) rsrc
               | l <- [0..maximum (length<$>rsrc)] ]

main :: IO ()
main = do
    orig <- readFile "whale.txt"
    let wordchopped = depunct orig
        dictRes
          = take 5000
          . map mhead
          . sortBy (comparing $ negate . length)
          . group . sort
          $ wordchopped
        dict = Map.fromList $ zip dictRes [maxWordLen..wordSeqDictSize]
        rdict = Map.fromList $ zip [maxWordLen..wordSeqDictSize] dictRes
        approxWSeq = [ case Map.lookup w dict of
                        Just i -> i
                        Nothing -> fromIntegral (length w - 1) :: Word8
                     | w <- wordchopped ]
        fallbackTries = mkTries . drop (wordSeqDictSize-maxWordLen) $ dictRes
        reconstructed = approxWSeq >>= \i
             -> if i<maxWordLen then let l = fromIntegral i+1
                                     in replicate l $ Right l
                                else Left <$> rdict Map.! i
        predicted = (`evalState`(reconstructed, ""))
              $ mapM (predict fallbackTries) (' ':orig)
        incorrects = length . filter id $ zipWith (/=) orig predicted
    putStrLn $ "longest word: "++show(maximum $ length<$>wordchopped)
    putStrLn $ show incorrects++" errors / "++show (length orig)++" chars"
    BSL.writeFile "wordsseq" . compress $ BSL.pack approxWSeq
    BS.writeFile "dicttries" $ flat fallbackTries
    writeFile "seqwords" . show $ take (256-maxWordLen) dictRes
    writeFile "whale-approx.txt" . unlines $ coLines orig predicted

coLines :: String -> String -> [String]
coLines [] _ = [[],[]]
coLines ('\n':l) (_:m) = []:[]:coLines l m
coLines l ('\n':m) = coLines l ('|':m)
coLines (c:l) (d:m) = case coLines l m of
   (lt:mt:r) -> (c:lt):(d:mt):r

ceased to turn counterclockwis

Posted 2018-01-09T09:25:31.847

Reputation: 5 000

3

C++ (WIP), 1923*2 + 1017344 = 1021190

#include <map>
#include <random>
#include <string>
#include <type_traits>
#include <vector>

using namespace std;

constexpr minstd_rand::result_type seed = 10087702;

template<typename T>
class discrete_mapped_distribution {
private:
    discrete_distribution<size_t> distr;
    vector<T> values;

public:
    discrete_mapped_distribution() :
            distr(), values() {
    }
    template<typename I, typename = typename enable_if<is_arithmetic<I>::value,
            I>::type>
    discrete_mapped_distribution(map<T, I> distribution) :
            values() {
        vector<I> counts;

        values.reserve(distribution.size());
        counts.reserve(distribution.size());

        for (typename map<T, I>::const_reference count : distribution) {
            values.push_back(count.first);
            counts.push_back(count.second);
        }

        distr = discrete_distribution<size_t>(counts.cbegin(), counts.cend());
    }

    discrete_mapped_distribution(const discrete_mapped_distribution&) = default;
    discrete_mapped_distribution& operator=(const discrete_mapped_distribution&) = default;

    template<typename URNG>
    T operator()(URNG& urng) {
        return values.at(distr(urng));
    }
};

class generator2 {
private:
    static map<char, discrete_mapped_distribution<char>> letters;

    minstd_rand rng;

public:
    static void initDistribution(const string& text) {
        map<char, map<char, uint64_t>> letterDistribution;

        string::const_iterator it = text.cbegin();
        char oldLetter = *it++;

        for (; it != text.cend();) {
            ++(letterDistribution[oldLetter][*it]);
            oldLetter = *it++;
        }

        generator2::letters = map<char, discrete_mapped_distribution<char>>();

        for (map<char, map<char, uint64_t>>::const_reference letter : letterDistribution) {
            generator2::letters[letter.first] = discrete_mapped_distribution<char>(letter.second);
        }
    }

    generator2() :
            rng(seed) {
    }

    char getNextChar(char in) {
        return letters.at(in)(rng);
    }
};

map<char, discrete_mapped_distribution<char>> generator2::letters;

As it stand this solution is WIP and therefore ungolfed. Also considering that the actual code size barely has any impact on the score I thought I post my answer first before starting to micro optimize it.
(Full code available here: https://github.com/BrainStone/MobyDickRNG - Includes full program and seed search)

This solution is based on a RNG. First I analyze the text. I create a map that counts the occurrences of two consecutive characters. Then I create a distribution map. This is all done statically so should be in accordance of the rules.

Then while trying to print the text I do a lookup and pull a random character of the possible ones. While this usually produces worse results than just outputting the most common following letter, there could are likely god seeds that will produce better results. That's why the seed is hard coded. I'm currently searching for the best seed. And I will update this answer once I find better seeds. So stay posted!

If anyone wants to search for seeds themselves or use different RNGs feel free to fork the repo.

Method used to calculate score: https://github.com/BrainStone/MobyDickRNG/blob/master/src/search.cpp#L15

Note even though the total score is the worst at the moment, it beats the error count of just outputting spaces. And the chances are good that the score will drop, by checking more seeds.

Changelog

  • 2018/01/24: Posted inital answer.
    Checked seeds: 0-50000. Score: 2305*2 + 1017754 = 1022364
  • 2018/01/24: Did some minimal golfing. Added link to score calculation method.
    Checked seeds: 0-80000. Score: 1920*2 + 1017754 = 1021594 (-770)
  • 2018/02/02: New seed (10087702) (didn't find the time to fix the submission)
    Checked seeds: 0-32000000. Score: 1923*2 + 1017344 = 1021190 (-404)

BrainStone

Posted 2018-01-09T09:25:31.847

Reputation: 929

Could you include a test harness in your answer that evaluates the score? – Nathaniel – 2018-01-24T10:58:58.980

@Nathaniel I linked the score code directly. In addition with the repository would you consider this enough? – BrainStone – 2018-01-24T11:10:44.120

Upon reviewing the rules I have noticed that I violate some of them. I naturally will update my answer once I fix the issues – BrainStone – 2018-01-24T12:08:04.693

You will then end up encoding the text into the random seed. See esoteric programming language Seed, and you may want to reverse engineer the MT19937 program and beat this answer (if you can).

– user202729 – 2018-01-24T13:14:49.903

Nice idea, but won't help getting a good score. +1 anyway. – user202729 – 2018-01-24T13:41:01.953

@user202729 my solution actually got inspired by seed. – BrainStone – 2018-01-24T14:21:01.823

@Nathaniel is my approach of analyzing the text during startup acceptable (I know I need to add line that actually initializes it) or would I need to hardcore the matrix? – BrainStone – 2018-01-24T14:22:49.903

@BrainStone I'm afraid I don't know what "analyzing the text during startup" means. The important thing is that your code cannot make multiple passes through the text - it can't have access to any character of the text before it makes a guess about that character. (This is why I asked for the score calculating code. If I can see that then it should be more clear what's actually happening. I did look for it in the git repo, but it's not clear to me which part of the code does what job. If you could make a self-contained piece of code for that task it would help.) – Nathaniel – 2018-01-25T00:40:53.633

@Nathaniel I see. So then I have to hardcode my map. Good to know. Shouldn’t be too much of an issue. – BrainStone – 2018-01-25T04:07:01.040

2

Python 3, (146*2+879757) 880049 bytes

def f(c):return"\n                     t \n 2  sS \n  -  08........       huaoRooe oioaohue thpih eEA \n   neo    enueee neue hteht e"[ord(c)-10]

Try it online!

Pretty straightforward frequency table. Each position in the string corresponds to the current character's ascii code (minus 10 = 0x0a = '\n', the lowest character in the file), and the character at each index is the most frequent next character. Assuming I calculated the frequencies right…

Tested with the code from user202729's test

Kevin

Posted 2018-01-09T09:25:31.847

Reputation: 471

Can you save some bytes by using def f(c):return(" "&gt;c)*c or"t ... e"[ord(c)-32]? – Neil – 2018-01-11T11:32:32.183