Textual Analysis

A statistical analysis of texts which could support common authorship or theme of documents.

Method

Using standard Unix utilities, each text is first divided into sentences. Then each sentence is broken into N-grams -- sequences of N tokens for varying values of N. N-grams do not span sentence boundaries. For example, the simple text:

	My car goes fast.  My dog is brown.
yields the 3-grams:
	my car goes
	car goes fast
	my dog is
	dog is brown

Statistics are then collected for the N-gram set from each work, in comparison with the N-gram collection from the entire corpus. If one work has a lower redudancy among its N-grams, then it might be considered a richer language at that scale, at least in comparison to other works of the same approximate overall size.

The collection of unique N-grams for the entire corpus is used to calculate a "feature vector" for each work. If there were M unique N-grams in the entire corpus, then the feature vector for each text is a list of M floating-point values. The i'th such value in a work's feature vector is the percentage of N-grams for that work that are identical to the i'th N-gram in the overall list.

The "distance" between two works is then the Euclidian distance in that M-dimensional space. A distance of zero indicates that the two works have identical distributions of N-grams, and thus resemble each other in some statistical sense. It is tempting to say that a distance near zero indicates that the two texts are strongly correlated, but this is not necessarily true in general. This is, however, a useful qualitative measure -- works by the same author tend to have small inter-document distances.


Texts Used


Filename         Title and Author
-------------    ---------------------------------------------------

0hfinn.txt       "The Adventures of Huckleberry Finn," by Mark Twain
0lmiss.txt       "Life on the Mississippi", by Mark Twain
0tramp.txt       "A Tramp Abroad", By Mark Twain, 1880
0yankee.txt      "A Connecticut Yankee in King Arthur's Court", by Mark
		 Twain
1emma.txt        "Emma", by Jane Austen
1persua.txt      "Persuasion", by Jane Austen (1818)
1pride.txt       "Pride and Prejudice", by Jane Austen
1sense.txt       "Sense and Sensibility", by Jane Austen (1811)
2gmars.txt       "The Gods of Mars", by Edgar Rice Burroughs (1913)
2pmars.txt       "A Princess of Mars", by Edgar Rice Burroughs
2tarzan.txt      "Tarzan of the Apes", by Edgar Rice Burroughs
2timelf.txt      "The Land That Time Forgot", by Edgar Rice Burroughs
3alice.txt       "Alice's Adventures in Wonderland", by Lewis Carroll
3lglass.txt      "Through the Looking Glass", by Lewis Carroll
3snark.txt       The Hunting of the Snark", by Lewis Carroll
4agent.txt       "The Secret Agent", by Joseph Conrad
4hdark.txt       "Heart of Darkness", by Joseph Conrad
4sshar.txt       "The Secret Sharer", by Joseph Conrad
5great.txt       "Great Expectations", by Charles Dickens
5oliver.txt      "Oliver Twist", by Charles Dickens
5pwprs.txt       "The Pickwick Papers", by Charles Dickens
5twocity.txt     "A Tale of Two Cities", Charles Dickens
6callw.txt       "The Call of the Wild", by Jack London
6seawolf.txt     "The Sea Wolf", by Jack London
6whtfng.txt      "White Fang", by Jack London
7dmoro.txt       "The Island of Doctor Moreau", by H. G. Wells
7time.txt        "The Time Machine", by H(erbert) G(eorge) Wells [1898]
7warwrld.txt     "The War of the Worlds", by H(erbert) G(eorge) Wells [1898]
8human.txt       "Of Human Bondage", by Somerset Maugham
8moon.txt        "Moon and Sixpence", by Somerset Maugham
Pattrib.txt      "Attributes of the Mujahideen: Compliance
		 with the Sunnah", by Mufti Khubaib Sahib,
		 http://www.ummah.net.pk/harkat/jihad/attribut.htm
Pchechen.txt     "Chechen-Russo Conflict", by Abdullah
		 Khan, Feb 2000, footnotes removed,
		 http://www.amina.com/article/chechenrus_confl.html
Pcommun.txt      "The Communist Manifesto", by Karl Marx and Friedrich
		 Engels
Pexpel.txt       "Declaration Of War Against The Americans Occupying
		 The Land Of The Two Holy Places", Usama bin Laden, from
		 http://www.azzam.com/html/articlesdeclaration.htm
Pfunda.txt       "Fundamentalism", by Maulana Muhammad Masoud Azhar,
		 http://www.ummah.net.pk/harkat/jihad/fundamen.htm
Pjihad.txt       "Jihad: The forgotten obligation",
		 http://www.ummah.net.pk/harkat/jihad/o-jihad.htm
Pkoran.txt       "The Koran", M.H. Shakir translation
Pmiscon.txt      "7 Misconceptions In Fighting The Apostate Regime",
		 Al-Jama'ah Al-Islamiyyah (Islamic Group) in Egypt, from
		 http://www.azzam.com/html/articlesmisconceptions.htm
Pshamyl.txt      "The Jihad of Imam Shamyl", by Kerim Fenari,
		 http://www.amina.com/article/jihad_imamshamyl.html
Punabomb.txt     "Unabomber's Manifesto", by Ted Kaczinski
Overall Statistics
     54056 unique 1-grams for the corpus
    753606 unique 2-grams for the corpus
   1854335 unique 3-grams for the corpus
   2387282 unique 4-grams for the corpus

Measures for Individual Texts

                File   Sentences       Words   Characters
          0hfinn.txt       14131      112552       565633
          0lmiss.txt       15530      144653       813040
          0tramp.txt       18398      153328       857005
         0yankee.txt       15136      120622       642558
           1emma.txt       16770      158080       887254
         1persua.txt        8468       83309       467136
          1pride.txt       14179      121756       686896
          1sense.txt       14748      118575       672750
          2gmars.txt       10216       82691       452178
          2pmars.txt        7787       65884       363652
         2tarzan.txt       11858       85426       479989
         2timelf.txt        3774       37179       201350
          3alice.txt        3580       26439       147993
         3lglass.txt        4153       29268       167712
          3snark.txt         820        5090        29596
          4agent.txt       10406       91233       521937
          4hdark.txt        3695       38242       211905
          4sshar.txt        1839       16648        89249
          5great.txt       20868      184420       998394
         5oliver.txt       19931      156996       891646
          5pwprs.txt       38464      298162      1739725
        5twocity.txt       16038      135711       759010
          6callw.txt        3266       31811       179206
        6seawolf.txt       12191      106259       578582
         6whtfng.txt        7797       72225       399769
          7dmoro.txt        4888       43420       241523
           7time.txt        3381       32345       181680
        7warwrld.txt        7248       60374       343012
          8human.txt       28585      259330      1410243
           8moon.txt        9418       75036       407226
         Pattrib.txt         116        1170         7463
        Pchechen.txt         529        5577        35624
         Pcommun.txt        1766       11448        75591
          Pexpel.txt        1155       11956        68982
          Pfunda.txt         500        4711        27975
          Pjihad.txt        3637       35149       205581
          Pkoran.txt       24158      162467       888435
         Pmiscon.txt         882       10052        54280
         Pshamyl.txt         536        5207        31330
        Punabomb.txt        3603       34384       220786
For each file, for each value of N, below are:
0hfinn.txt has:
  111451 1-grams,    6836 unique ( 93.866% redundant,  12.646% of corpus)
  105161 2-grams,   43791 unique ( 58.358% redundant,   5.811% of corpus)
   98871 3-grams,   79763 unique ( 19.326% redundant,   4.301% of corpus)
   92829 4-grams,   88461 unique (  4.705% redundant,   3.706% of corpus)
                  17 words/sentence, on average
0lmiss.txt has:
  145265 1-grams,   12733 unique ( 91.235% redundant,  23.555% of corpus)
  137786 2-grams,   71812 unique ( 47.881% redundant,   9.529% of corpus)
  130307 3-grams,  113919 unique ( 12.576% redundant,   6.143% of corpus)
  123089 4-grams,  119823 unique (  2.653% redundant,   5.019% of corpus)
                  18 words/sentence, on average
0tramp.txt has:
  154416 1-grams,   13431 unique ( 91.302% redundant,  24.846% of corpus)
  146961 2-grams,   76231 unique ( 48.128% redundant,  10.115% of corpus)
  139506 3-grams,  121791 unique ( 12.698% redundant,   6.568% of corpus)
  132293 4-grams,  128790 unique (  2.648% redundant,   5.395% of corpus)
                  19 words/sentence, on average
0yankee.txt has:
  119092 1-grams,   11101 unique ( 90.679% redundant,  20.536% of corpus)
  112731 2-grams,   58966 unique ( 47.693% redundant,   7.825% of corpus)
  106370 3-grams,   93735 unique ( 11.878% redundant,   5.055% of corpus)
  100245 4-grams,   98020 unique (  2.220% redundant,   4.106% of corpus)
                  18 words/sentence, on average
1emma.txt has:
  159674 1-grams,    7329 unique ( 95.410% redundant,  13.558% of corpus)
  149636 2-grams,   61088 unique ( 59.176% redundant,   8.106% of corpus)
  139598 3-grams,  112788 unique ( 19.205% redundant,   6.082% of corpus)
  129953 4-grams,  124459 unique (  4.228% redundant,   5.213% of corpus)
                  14 words/sentence, on average
1persua.txt has:
   83414 1-grams,    5913 unique ( 92.911% redundant,  10.939% of corpus)
   79618 2-grams,   39297 unique ( 50.643% redundant,   5.215% of corpus)
   75822 3-grams,   65688 unique ( 13.366% redundant,   3.542% of corpus)
   72117 4-grams,   70381 unique (  2.407% redundant,   2.948% of corpus)
                  21 words/sentence, on average
1pride.txt has:
  121296 1-grams,    6442 unique ( 94.689% redundant,  11.917% of corpus)
  114404 2-grams,   50429 unique ( 55.920% redundant,   6.692% of corpus)
  107512 3-grams,   90158 unique ( 16.141% redundant,   4.862% of corpus)
  100836 4-grams,   97638 unique (  3.171% redundant,   4.090% of corpus)
                  16 words/sentence, on average
1sense.txt has:
  119273 1-grams,    6471 unique ( 94.575% redundant,  11.971% of corpus)
  113507 2-grams,   50641 unique ( 55.385% redundant,   6.720% of corpus)
  107741 3-grams,   90794 unique ( 15.729% redundant,   4.896% of corpus)
  102178 4-grams,   99060 unique (  3.052% redundant,   4.149% of corpus)
                  19 words/sentence, on average
2gmars.txt has:
   82726 1-grams,    6970 unique ( 91.575% redundant,  12.894% of corpus)
   78648 2-grams,   39857 unique ( 49.322% redundant,   5.289% of corpus)
   74570 3-grams,   63623 unique ( 14.680% redundant,   3.431% of corpus)
   70643 4-grams,   67822 unique (  3.993% redundant,   2.841% of corpus)
                  19 words/sentence, on average
2pmars.txt has:
   65910 1-grams,    6506 unique ( 90.129% redundant,  12.036% of corpus)
   63536 2-grams,   34779 unique ( 45.261% redundant,   4.615% of corpus)
   61162 3-grams,   53880 unique ( 11.906% redundant,   2.906% of corpus)
   58818 4-grams,   57291 unique (  2.596% redundant,   2.400% of corpus)
                  27 words/sentence, on average
2tarzan.txt has:
   85627 1-grams,    7467 unique ( 91.280% redundant,  13.813% of corpus)
   81127 2-grams,   42378 unique ( 47.763% redundant,   5.623% of corpus)
   76627 3-grams,   66548 unique ( 13.153% redundant,   3.589% of corpus)
   72319 4-grams,   70263 unique (  2.843% redundant,   2.943% of corpus)
                  18 words/sentence, on average
2timelf.txt has:
   37338 1-grams,    4848 unique ( 87.016% redundant,   8.968% of corpus)
   35495 2-grams,   21277 unique ( 40.056% redundant,   2.823% of corpus)
   33652 3-grams,   30494 unique (  9.384% redundant,   1.644% of corpus)
   31864 4-grams,   31252 unique (  1.921% redundant,   1.309% of corpus)
                  19 words/sentence, on average
3alice.txt has:
   26590 1-grams,    2641 unique ( 90.068% redundant,   4.886% of corpus)
   24721 2-grams,   13256 unique ( 46.378% redundant,   1.759% of corpus)
   22852 3-grams,   19515 unique ( 14.603% redundant,   1.052% of corpus)
   21089 4-grams,   20150 unique (  4.453% redundant,   0.844% of corpus)
                  13 words/sentence, on average
3lglass.txt has:
   29454 1-grams,    2823 unique ( 90.416% redundant,   5.222% of corpus)
   27122 2-grams,   14673 unique ( 45.900% redundant,   1.947% of corpus)
   24790 3-grams,   21579 unique ( 12.953% redundant,   1.164% of corpus)
   22563 4-grams,   21751 unique (  3.599% redundant,   0.911% of corpus)
                  12 words/sentence, on average
3snark.txt has:
    5063 1-grams,    1444 unique ( 71.479% redundant,   2.671% of corpus)
    4749 2-grams,    3669 unique ( 22.742% redundant,   0.487% of corpus)
    4435 3-grams,    4139 unique (  6.674% redundant,   0.223% of corpus)
    4128 4-grams,    3965 unique (  3.949% redundant,   0.166% of corpus)
                  16 words/sentence, on average
4agent.txt has:
   90144 1-grams,    9328 unique ( 89.652% redundant,  17.256% of corpus)
   83967 2-grams,   45666 unique ( 45.614% redundant,   6.060% of corpus)
   77790 3-grams,   67928 unique ( 12.678% redundant,   3.663% of corpus)
   71866 4-grams,   69820 unique (  2.847% redundant,   2.925% of corpus)
                  14 words/sentence, on average
4hdark.txt has:
   38445 1-grams,    5531 unique ( 85.613% redundant,  10.232% of corpus)
   35970 2-grams,   22440 unique ( 37.615% redundant,   2.978% of corpus)
   33495 3-grams,   30797 unique (  8.055% redundant,   1.661% of corpus)
   31147 4-grams,   30702 unique (  1.429% redundant,   1.286% of corpus)
                  14 words/sentence, on average
4sshar.txt has:
   16655 1-grams,    2805 unique ( 83.158% redundant,   5.189% of corpus)
   15488 2-grams,   10111 unique ( 34.717% redundant,   1.342% of corpus)
   14321 3-grams,   13263 unique (  7.388% redundant,   0.715% of corpus)
   13211 4-grams,   13008 unique (  1.537% redundant,   0.545% of corpus)
                  13 words/sentence, on average
5great.txt has:
  185167 1-grams,   11334 unique ( 93.879% redundant,  20.967% of corpus)
  174889 2-grams,   74397 unique ( 57.460% redundant,   9.872% of corpus)
  164611 3-grams,  133273 unique ( 19.038% redundant,   7.187% of corpus)
  154963 4-grams,  148124 unique (  4.413% redundant,   6.205% of corpus)
                  17 words/sentence, on average
5oliver.txt has:
  156779 1-grams,   10805 unique ( 93.108% redundant,  19.989% of corpus)
  146469 2-grams,   70083 unique ( 52.152% redundant,   9.300% of corpus)
  136159 3-grams,  116097 unique ( 14.734% redundant,   6.261% of corpus)
  126666 4-grams,  122781 unique (  3.067% redundant,   5.143% of corpus)
                  13 words/sentence, on average
5pwprs.txt has:
  298167 1-grams,   16086 unique ( 94.605% redundant,  29.758% of corpus)
  277637 2-grams,  118656 unique ( 57.262% redundant,  15.745% of corpus)
  257107 3-grams,  207869 unique ( 19.151% redundant,  11.210% of corpus)
  238576 4-grams,  226667 unique (  4.992% redundant,   9.495% of corpus)
                  13 words/sentence, on average
5twocity.txt has:
  135868 1-grams,   10177 unique ( 92.510% redundant,  18.827% of corpus)
  127712 2-grams,   62334 unique ( 51.192% redundant,   8.271% of corpus)
  119556 3-grams,  102571 unique ( 14.207% redundant,   5.531% of corpus)
  111890 4-grams,  108314 unique (  3.196% redundant,   4.537% of corpus)
                  15 words/sentence, on average
6callw.txt has:
   31845 1-grams,    4843 unique ( 84.792% redundant,   8.959% of corpus)
   30146 2-grams,   19062 unique ( 36.768% redundant,   2.529% of corpus)
   28447 3-grams,   26010 unique (  8.567% redundant,   1.403% of corpus)
   26778 4-grams,   26264 unique (  1.919% redundant,   1.100% of corpus)
                  18 words/sentence, on average
6seawolf.txt has:
  105623 1-grams,    9477 unique ( 91.028% redundant,  17.532% of corpus)
   98562 2-grams,   50245 unique ( 49.022% redundant,   6.667% of corpus)
   91501 3-grams,   78736 unique ( 13.951% redundant,   4.246% of corpus)
   84705 4-grams,   82109 unique (  3.065% redundant,   3.439% of corpus)
                  14 words/sentence, on average
6whtfng.txt has:
   71923 1-grams,    6996 unique ( 90.273% redundant,  12.942% of corpus)
   67164 2-grams,   34886 unique ( 48.058% redundant,   4.629% of corpus)
   62405 3-grams,   53318 unique ( 14.561% redundant,   2.875% of corpus)
   57712 4-grams,   55792 unique (  3.327% redundant,   2.337% of corpus)
                  15 words/sentence, on average
7dmoro.txt has:
   43585 1-grams,    5436 unique ( 87.528% redundant,  10.056% of corpus)
   40682 2-grams,   23482 unique ( 42.279% redundant,   3.116% of corpus)
   37779 3-grams,   33791 unique ( 10.556% redundant,   1.822% of corpus)
   34982 4-grams,   34148 unique (  2.384% redundant,   1.430% of corpus)
                  14 words/sentence, on average
7time.txt has:
   32457 1-grams,    4666 unique ( 85.624% redundant,   8.632% of corpus)
   30491 2-grams,   18603 unique ( 38.989% redundant,   2.469% of corpus)
   28525 3-grams,   25862 unique (  9.336% redundant,   1.395% of corpus)
   26586 4-grams,   26065 unique (  1.960% redundant,   1.092% of corpus)
                  16 words/sentence, on average
7warwrld.txt has:
   60568 1-grams,    7254 unique ( 88.023% redundant,  13.419% of corpus)
   57250 2-grams,   32863 unique ( 42.597% redundant,   4.361% of corpus)
   53932 3-grams,   48131 unique ( 10.756% redundant,   2.596% of corpus)
   50736 4-grams,   49642 unique (  2.156% redundant,   2.079% of corpus)
                  17 words/sentence, on average
8human.txt has:
  259146 1-grams,   12549 unique ( 95.158% redundant,  23.215% of corpus)
  241651 2-grams,   88956 unique ( 63.188% redundant,  11.804% of corpus)
  224156 3-grams,  168036 unique ( 25.036% redundant,   9.062% of corpus)
  207214 4-grams,  191862 unique (  7.409% redundant,   8.037% of corpus)
                  14 words/sentence, on average
8moon.txt has:
   74825 1-grams,    7076 unique ( 90.543% redundant,  13.090% of corpus)
   69433 2-grams,   34896 unique ( 49.741% redundant,   4.631% of corpus)
   64041 3-grams,   54685 unique ( 14.609% redundant,   2.949% of corpus)
   58875 4-grams,   56847 unique (  3.445% redundant,   2.381% of corpus)
                  13 words/sentence, on average
Pattrib.txt has:
    1190 1-grams,     405 unique ( 65.966% redundant,   0.749% of corpus)
    1142 2-grams,     796 unique ( 30.298% redundant,   0.106% of corpus)
    1094 3-grams,     894 unique ( 18.282% redundant,   0.048% of corpus)
    1047 4-grams,     907 unique ( 13.372% redundant,   0.038% of corpus)
                  21 words/sentence, on average
Pchechen.txt has:
    5658 1-grams,    1498 unique ( 73.524% redundant,   2.771% of corpus)
    5449 2-grams,    3980 unique ( 26.959% redundant,   0.528% of corpus)
    5240 3-grams,    4903 unique (  6.431% redundant,   0.264% of corpus)
    5032 4-grams,    4953 unique (  1.570% redundant,   0.207% of corpus)
                  26 words/sentence, on average
Pcommun.txt has:
   11426 1-grams,    2235 unique ( 80.439% redundant,   4.135% of corpus)
   10939 2-grams,    7240 unique ( 33.815% redundant,   0.961% of corpus)
   10452 3-grams,    9419 unique (  9.883% redundant,   0.508% of corpus)
    9971 4-grams,    9677 unique (  2.949% redundant,   0.405% of corpus)
                  22 words/sentence, on average
Pexpel.txt has:
   12261 1-grams,    2193 unique ( 82.114% redundant,   4.057% of corpus)
   11685 2-grams,    6915 unique ( 40.822% redundant,   0.918% of corpus)
   11109 3-grams,    8944 unique ( 19.489% redundant,   0.482% of corpus)
   10548 4-grams,    9189 unique ( 12.884% redundant,   0.385% of corpus)
                  20 words/sentence, on average
Pfunda.txt has:
    4800 1-grams,    1155 unique ( 75.938% redundant,   2.137% of corpus)
    4550 2-grams,    3182 unique ( 30.066% redundant,   0.422% of corpus)
    4300 3-grams,    3926 unique (  8.698% redundant,   0.212% of corpus)
    4054 4-grams,    3937 unique (  2.886% redundant,   0.165% of corpus)
                  18 words/sentence, on average
Pjihad.txt has:
   36418 1-grams,    3734 unique ( 89.747% redundant,   6.908% of corpus)
   34318 2-grams,   16551 unique ( 51.772% redundant,   2.196% of corpus)
   32218 3-grams,   24842 unique ( 22.894% redundant,   1.340% of corpus)
   30300 4-grams,   26600 unique ( 12.211% redundant,   1.114% of corpus)
                  16 words/sentence, on average
Pkoran.txt has:
  166010 1-grams,    5327 unique ( 96.791% redundant,   9.855% of corpus)
  156660 2-grams,   39206 unique ( 74.974% redundant,   5.202% of corpus)
  147310 3-grams,   81927 unique ( 44.385% redundant,   4.418% of corpus)
  138588 4-grams,  103629 unique ( 25.225% redundant,   4.341% of corpus)
                  16 words/sentence, on average
Pmiscon.txt has:
    9890 1-grams,    1230 unique ( 87.563% redundant,   2.275% of corpus)
    9460 2-grams,    4621 unique ( 51.152% redundant,   0.613% of corpus)
    9030 3-grams,    6590 unique ( 27.021% redundant,   0.355% of corpus)
    8617 4-grams,    7131 unique ( 17.245% redundant,   0.299% of corpus)
                  21 words/sentence, on average
Pshamyl.txt has:
    5210 1-grams,    1698 unique ( 67.409% redundant,   3.141% of corpus)
    4944 2-grams,    4107 unique ( 16.930% redundant,   0.545% of corpus)
    4678 3-grams,    4602 unique (  1.625% redundant,   0.248% of corpus)
    4413 4-grams,    4407 unique (  0.136% redundant,   0.185% of corpus)
                  19 words/sentence, on average
Punabomb.txt has:
   34507 1-grams,    4140 unique ( 88.002% redundant,   7.659% of corpus)
   32858 2-grams,   18676 unique ( 43.161% redundant,   2.478% of corpus)
   31209 3-grams,   26776 unique ( 14.204% redundant,   1.444% of corpus)
   29599 4-grams,   28017 unique (  5.345% redundant,   1.174% of corpus)
                  17 words/sentence, on average

Inter-Document Distances

Close matches are indicated with red, and somewhat close matches with orange, with some arbitrary cut-offs for "close" and "somewhat close". Every fourth line is bold, to make the table a little easier to read. This uses a smaller font, but you will still need to maximize your browser to see much of these rather large tables. Note that the tables are symmetric, so only half of each one is calculated and displayed.

 1-gram distances
          hfinn lmiss tramp yanke  emma persu pride sense gmars pmars tarza timel alice lglas snark agent hdark sshar great olive pwprs twoci callw seawo whtfn dmoro  time warwr human  moon attri chech commu expel funda jihad koran misco shamy unabo
   0hfinn  ---                                                                                                                                                                                                                                            0hfinn
   0lmiss 0.083  ---                                                                                                                                                                                                                                      0lmiss
   0tramp 0.086 0.005  ---                                                                                                                                                                                                                                0tramp
  0yankee 0.048 0.019 0.019  ---                                                                                                                                                                                                                          0yankee
    1emma 0.133 0.114 0.109 0.085  ---                                                                                                                                                                                                                    1emma
  1persua 0.135 0.081 0.079 0.069 0.026  ---                                                                                                                                                                                                              1persua
   1pride 0.146 0.105 0.102 0.084 0.018 0.020  ---                                                                                                                                                                                                        1pride
   1sense 0.149 0.111 0.109 0.090 0.021 0.024 0.013  ---                                                                                                                                                                                                  1sense
   2gmars 0.173 0.047 0.047 0.074 0.150 0.117 0.130 0.139  ---                                                                                                                                                                                            2gmars
   2pmars 0.146 0.045 0.046 0.060 0.131 0.101 0.112 0.120 0.012  ---                                                                                                                                                                                      2pmars
  2tarzan 0.170 0.045 0.048 0.079 0.150 0.100 0.125 0.138 0.051 0.061  ---                                                                                                                                                                                2tarzan
  2timelf 0.103 0.038 0.036 0.041 0.120 0.097 0.111 0.118 0.026 0.022 0.070  ---                                                                                                                                                                          2timelf
   3alice 0.125 0.077 0.076 0.084 0.115 0.104 0.114 0.117 0.110 0.111 0.105 0.095  ---                                                                                                                                                                    3alice
  3lglass 0.113 0.085 0.085 0.086 0.112 0.111 0.115 0.118 0.126 0.125 0.122 0.105 0.013  ---                                                                                                                                                              3lglass
   3snark 0.140 0.051 0.049 0.071 0.158 0.127 0.147 0.157 0.091 0.102 0.065 0.091 0.097 0.103  ---                                                                                                                                                        3snark
   4agent 0.197 0.061 0.062 0.095 0.127 0.089 0.104 0.115 0.071 0.081 0.038 0.098 0.119 0.128 0.090  ---                                                                                                                                                  4agent
   4hdark 0.118 0.038 0.038 0.049 0.109 0.089 0.102 0.111 0.041 0.040 0.064 0.038 0.099 0.100 0.083 0.050  ---                                                                                                                                            4hdark
   4sshar 0.129 0.069 0.069 0.072 0.125 0.115 0.119 0.128 0.056 0.051 0.100 0.048 0.122 0.119 0.110 0.087 0.022  ---                                                                                                                                      4sshar
   5great 0.067 0.066 0.066 0.039 0.070 0.075 0.070 0.080 0.090 0.068 0.124 0.051 0.101 0.092 0.115 0.120 0.049 0.046  ---                                                                                                                                5great
  5oliver 0.115 0.030 0.031 0.048 0.109 0.077 0.093 0.105 0.061 0.062 0.032 0.063 0.073 0.079 0.051 0.042 0.050 0.077 0.071  ---                                                                                                                          5oliver
   5pwprs 0.139 0.043 0.045 0.066 0.129 0.099 0.113 0.130 0.071 0.073 0.051 0.078 0.090 0.098 0.071 0.050 0.061 0.089 0.086 0.018  ---                                                                                                                    5pwprs
 5twocity 0.102 0.022 0.023 0.031 0.084 0.057 0.068 0.079 0.052 0.050 0.035 0.050 0.074 0.079 0.053 0.041 0.039 0.063 0.050 0.014 0.030  ---                                                                                                              5twocity
   6callw 0.145 0.058 0.063 0.082 0.193 0.132 0.168 0.179 0.110 0.108 0.040 0.107 0.140 0.156 0.081 0.080 0.101 0.139 0.146 0.053 0.075 0.057  ---                                                                                                        6callw
 6seawolf 0.072 0.032 0.033 0.030 0.105 0.088 0.099 0.108 0.047 0.033 0.068 0.021 0.084 0.088 0.079 0.090 0.033 0.037 0.034 0.048 0.065 0.035 0.083  ---                                                                                                  6seawolf
  6whtfng 0.161 0.069 0.073 0.092 0.180 0.124 0.154 0.167 0.112 0.115 0.036 0.114 0.145 0.158 0.087 0.064 0.094 0.129 0.144 0.054 0.079 0.055 0.022 0.089  ---                                                                                            6whtfng
   7dmoro 0.111 0.041 0.043 0.051 0.151 0.128 0.140 0.151 0.034 0.027 0.074 0.027 0.107 0.113 0.090 0.091 0.029 0.036 0.051 0.060 0.068 0.050 0.102 0.023 0.107  ---                                                                                      7dmoro
    7time 0.117 0.045 0.046 0.051 0.145 0.124 0.139 0.145 0.033 0.024 0.091 0.024 0.107 0.116 0.107 0.105 0.033 0.040 0.055 0.077 0.084 0.061 0.125 0.027 0.133 0.015  ---                                                                                7time
 7warwrld 0.137 0.026 0.028 0.054 0.173 0.127 0.157 0.164 0.027 0.029 0.048 0.033 0.102 0.120 0.079 0.072 0.042 0.069 0.095 0.052 0.060 0.046 0.070 0.041 0.084 0.027 0.027  ---                                                                          7warwrld
   8human 0.130 0.099 0.098 0.093 0.098 0.076 0.088 0.100 0.155 0.148 0.083 0.133 0.128 0.123 0.108 0.082 0.093 0.120 0.100 0.071 0.103 0.068 0.087 0.101 0.067 0.136 0.164 0.148  ---                                                                    8human
    8moon 0.091 0.081 0.077 0.058 0.063 0.067 0.062 0.071 0.109 0.095 0.113 0.077 0.113 0.101 0.114 0.097 0.047 0.053 0.033 0.078 0.104 0.061 0.141 0.054 0.124 0.078 0.090 0.122 0.053  ---                                                              8moon
  Pattrib 0.387 0.205 0.209 0.267 0.346 0.286 0.312 0.319 0.175 0.196 0.182 0.232 0.283 0.314 0.235 0.191 0.238 0.276 0.326 0.213 0.211 0.214 0.228 0.254 0.235 0.232 0.237 0.181 0.332 0.344  ---                                                        Pattrib
 Pchechen 0.294 0.125 0.141 0.193 0.309 0.239 0.276 0.283 0.138 0.152 0.125 0.169 0.205 0.238 0.162 0.166 0.191 0.218 0.261 0.149 0.155 0.154 0.143 0.177 0.158 0.173 0.178 0.119 0.257 0.280 0.193  ---                                                  Pchechen
  Pcommun 0.327 0.125 0.129 0.194 0.291 0.224 0.255 0.260 0.101 0.122 0.113 0.164 0.211 0.244 0.168 0.119 0.163 0.209 0.266 0.148 0.145 0.143 0.161 0.184 0.171 0.158 0.156 0.097 0.279 0.283 0.128 0.125  ---                                            Pcommun
   Pexpel 0.265 0.100 0.108 0.159 0.267 0.206 0.235 0.247 0.091 0.105 0.095 0.135 0.176 0.208 0.138 0.129 0.155 0.191 0.225 0.117 0.120 0.116 0.127 0.144 0.144 0.131 0.138 0.080 0.249 0.256 0.111 0.090 0.062  ---                                      Pexpel
   Pfunda 0.340 0.145 0.146 0.202 0.277 0.219 0.245 0.252 0.125 0.145 0.137 0.181 0.221 0.254 0.179 0.153 0.186 0.227 0.272 0.163 0.167 0.160 0.185 0.202 0.198 0.186 0.184 0.131 0.283 0.284 0.153 0.148 0.088 0.073  ---                                Pfunda
   Pjihad 0.268 0.108 0.115 0.162 0.251 0.204 0.224 0.236 0.110 0.129 0.107 0.150 0.182 0.204 0.126 0.128 0.150 0.181 0.214 0.119 0.127 0.119 0.144 0.153 0.150 0.142 0.156 0.111 0.222 0.225 0.150 0.113 0.113 0.057 0.093  ---                          Pjihad
   Pkoran 0.185 0.146 0.145 0.144 0.202 0.188 0.192 0.207 0.201 0.197 0.181 0.197 0.207 0.209 0.165 0.216 0.216 0.255 0.197 0.153 0.176 0.144 0.183 0.182 0.205 0.206 0.227 0.196 0.210 0.213 0.272 0.267 0.252 0.154 0.209 0.163  ---                    Pkoran
  Pmiscon 0.267 0.127 0.125 0.165 0.242 0.207 0.219 0.232 0.147 0.163 0.130 0.182 0.200 0.218 0.144 0.161 0.190 0.231 0.236 0.134 0.150 0.136 0.159 0.181 0.171 0.181 0.201 0.152 0.226 0.239 0.202 0.179 0.147 0.100 0.102 0.090 0.118  ---              Pmiscon
  Pshamyl 0.250 0.085 0.089 0.146 0.256 0.182 0.220 0.230 0.085 0.101 0.054 0.122 0.158 0.186 0.106 0.092 0.125 0.160 0.210 0.082 0.093 0.094 0.075 0.126 0.086 0.119 0.132 0.068 0.180 0.214 0.166 0.071 0.094 0.075 0.117 0.109 0.236 0.153  ---        Pshamyl
 Punabomb 0.268 0.106 0.102 0.143 0.190 0.154 0.171 0.175 0.124 0.145 0.131 0.160 0.185 0.203 0.129 0.124 0.154 0.192 0.210 0.130 0.141 0.121 0.181 0.181 0.186 0.177 0.176 0.135 0.218 0.209 0.188 0.164 0.111 0.121 0.108 0.115 0.184 0.114 0.146  ---  Punabomb
          hfinn lmiss tramp yanke  emma persu pride sense gmars pmars tarza timel alice lglas snark agent hdark sshar great olive pwprs twoci callw seawo whtfn dmoro  time warwr human  moon attri chech commu expel funda jihad koran misco shamy unabo


 2-gram distances
          hfinn lmiss tramp yanke  emma persu pride sense gmars pmars tarza timel alice lglas snark agent hdark sshar great olive pwprs twoci callw seawo whtfn dmoro  time warwr human  moon attri chech commu expel funda jihad koran misco shamy unabo
   0hfinn  ---                                                                                                                                                                                                                                            0hfinn
   0lmiss 0.283  ---                                                                                                                                                                                                                                      0lmiss
   0tramp 0.293 0.072  ---                                                                                                                                                                                                                                0tramp
  0yankee 0.226 0.125 0.106  ---                                                                                                                                                                                                                          0yankee
    1emma 0.466 0.339 0.310 0.301  ---                                                                                                                                                                                                                    1emma
  1persua 0.466 0.298 0.279 0.288 0.136  ---                                                                                                                                                                                                              1persua
   1pride 0.466 0.312 0.285 0.291 0.098 0.137  ---                                                                                                                                                                                                        1pride
   1sense 0.456 0.316 0.289 0.290 0.113 0.155 0.088  ---                                                                                                                                                                                                  1sense
   2gmars 0.490 0.198 0.188 0.258 0.439 0.390 0.394 0.401  ---                                                                                                                                                                                            2gmars
   2pmars 0.476 0.209 0.198 0.254 0.427 0.386 0.380 0.388 0.084  ---                                                                                                                                                                                      2pmars
  2tarzan 0.475 0.198 0.193 0.267 0.395 0.329 0.346 0.369 0.166 0.201  ---                                                                                                                                                                                2tarzan
  2timelf 0.410 0.195 0.183 0.222 0.410 0.370 0.383 0.388 0.137 0.154 0.213  ---                                                                                                                                                                          2timelf
   3alice 0.496 0.455 0.437 0.422 0.469 0.468 0.470 0.461 0.542 0.550 0.509 0.518  ---                                                                                                                                                                    3alice
  3lglass 0.439 0.423 0.408 0.386 0.452 0.457 0.455 0.447 0.535 0.538 0.502 0.505 0.238  ---                                                                                                                                                              3lglass
   3snark 0.663 0.560 0.549 0.572 0.632 0.624 0.610 0.619 0.627 0.631 0.591 0.628 0.682 0.668  ---                                                                                                                                                        3snark
   4agent 0.438 0.227 0.207 0.264 0.366 0.316 0.325 0.332 0.304 0.318 0.226 0.326 0.483 0.456 0.595  ---                                                                                                                                                  4agent
   4hdark 0.326 0.173 0.161 0.170 0.357 0.324 0.340 0.339 0.274 0.270 0.279 0.245 0.472 0.424 0.592 0.230  ---                                                                                                                                            4hdark
   4sshar 0.381 0.246 0.243 0.242 0.405 0.381 0.385 0.386 0.308 0.303 0.334 0.283 0.528 0.480 0.625 0.287 0.194  ---                                                                                                                                      4sshar
   5great 0.301 0.212 0.202 0.174 0.281 0.291 0.265 0.269 0.293 0.265 0.341 0.247 0.458 0.409 0.592 0.296 0.182 0.210  ---                                                                                                                                5great
  5oliver 0.392 0.214 0.201 0.244 0.322 0.295 0.283 0.299 0.295 0.309 0.221 0.311 0.384 0.380 0.547 0.223 0.252 0.310 0.232  ---                                                                                                                          5oliver
   5pwprs 0.414 0.211 0.198 0.260 0.358 0.329 0.317 0.336 0.283 0.297 0.223 0.310 0.410 0.413 0.558 0.227 0.265 0.322 0.254 0.096  ---                                                                                                                    5pwprs
 5twocity 0.360 0.149 0.130 0.176 0.268 0.239 0.227 0.242 0.239 0.249 0.186 0.256 0.423 0.399 0.529 0.184 0.197 0.259 0.174 0.126 0.139  ---                                                                                                              5twocity
   6callw 0.424 0.232 0.230 0.279 0.460 0.382 0.412 0.424 0.317 0.338 0.216 0.338 0.538 0.514 0.587 0.258 0.276 0.346 0.357 0.260 0.271 0.218  ---                                                                                                        6callw
 6seawolf 0.336 0.163 0.158 0.171 0.354 0.331 0.326 0.326 0.198 0.195 0.238 0.173 0.478 0.445 0.582 0.270 0.177 0.209 0.170 0.256 0.261 0.190 0.268  ---                                                                                                  6seawolf
  6whtfng 0.451 0.271 0.270 0.305 0.459 0.386 0.418 0.433 0.353 0.376 0.242 0.366 0.554 0.541 0.613 0.278 0.303 0.369 0.377 0.293 0.306 0.247 0.188 0.305  ---                                                                                            6whtfng
   7dmoro 0.388 0.180 0.174 0.217 0.448 0.406 0.416 0.414 0.183 0.189 0.231 0.187 0.494 0.479 0.609 0.290 0.204 0.253 0.222 0.284 0.272 0.226 0.304 0.174 0.335  ---                                                                                      7dmoro
    7time 0.391 0.209 0.198 0.220 0.444 0.414 0.425 0.413 0.210 0.202 0.294 0.197 0.494 0.479 0.631 0.335 0.217 0.268 0.228 0.326 0.320 0.268 0.364 0.195 0.394 0.160  ---                                                                                7time
 7warwrld 0.401 0.149 0.144 0.218 0.453 0.385 0.416 0.415 0.161 0.182 0.190 0.182 0.491 0.476 0.607 0.256 0.213 0.277 0.280 0.270 0.249 0.203 0.258 0.192 0.299 0.145 0.173  ---                                                                          7warwrld
   8human 0.362 0.280 0.273 0.265 0.319 0.281 0.303 0.326 0.420 0.426 0.274 0.392 0.470 0.438 0.586 0.255 0.270 0.344 0.287 0.251 0.294 0.225 0.230 0.302 0.237 0.371 0.423 0.366  ---                                                                    8human
    8moon 0.305 0.219 0.202 0.174 0.271 0.275 0.262 0.270 0.329 0.318 0.298 0.286 0.463 0.414 0.590 0.266 0.188 0.244 0.160 0.255 0.290 0.205 0.298 0.190 0.309 0.268 0.288 0.305 0.149  ---                                                              8moon
  Pattrib 0.847 0.692 0.697 0.753 0.819 0.779 0.799 0.805 0.658 0.681 0.668 0.696 0.835 0.844 0.859 0.728 0.763 0.779 0.797 0.750 0.718 0.721 0.732 0.727 0.753 0.698 0.726 0.665 0.805 0.802  ---                                                        Pattrib
 Pchechen 0.707 0.455 0.489 0.565 0.689 0.627 0.651 0.657 0.483 0.502 0.487 0.527 0.723 0.731 0.760 0.555 0.586 0.614 0.639 0.569 0.542 0.529 0.542 0.549 0.570 0.530 0.565 0.477 0.638 0.637 0.750  ---                                                  Pchechen
  Pcommun 0.664 0.356 0.355 0.474 0.629 0.550 0.579 0.586 0.326 0.359 0.338 0.402 0.670 0.677 0.721 0.423 0.488 0.524 0.566 0.461 0.421 0.411 0.450 0.439 0.479 0.388 0.432 0.316 0.596 0.572 0.651 0.501  ---                                            Pcommun
   Pexpel 0.654 0.349 0.357 0.467 0.619 0.557 0.576 0.588 0.317 0.356 0.332 0.396 0.656 0.675 0.714 0.447 0.501 0.527 0.568 0.470 0.423 0.407 0.454 0.430 0.480 0.381 0.429 0.312 0.589 0.568 0.598 0.480 0.308  ---                                      Pexpel
   Pfunda 0.748 0.529 0.518 0.599 0.687 0.646 0.657 0.665 0.534 0.557 0.545 0.582 0.756 0.755 0.784 0.596 0.624 0.658 0.673 0.607 0.591 0.562 0.603 0.598 0.632 0.575 0.602 0.529 0.695 0.672 0.735 0.648 0.538 0.474  ---                                Pfunda
   Pjihad 0.688 0.511 0.503 0.567 0.658 0.634 0.627 0.640 0.546 0.566 0.550 0.585 0.731 0.722 0.749 0.580 0.592 0.632 0.617 0.573 0.566 0.527 0.587 0.576 0.607 0.567 0.593 0.535 0.643 0.621 0.746 0.663 0.588 0.493 0.580  ---                          Pjihad
   Pkoran 0.619 0.483 0.470 0.516 0.588 0.593 0.558 0.579 0.528 0.546 0.522 0.560 0.696 0.683 0.712 0.581 0.579 0.636 0.581 0.544 0.544 0.481 0.564 0.541 0.596 0.549 0.582 0.527 0.595 0.562 0.765 0.694 0.604 0.495 0.625 0.595  ---                    Pkoran
  Pmiscon 0.712 0.501 0.482 0.558 0.643 0.638 0.617 0.633 0.518 0.551 0.526 0.575 0.738 0.733 0.752 0.600 0.623 0.668 0.653 0.573 0.569 0.530 0.601 0.579 0.625 0.566 0.601 0.533 0.663 0.632 0.748 0.668 0.557 0.452 0.512 0.559 0.490  ---              Pmiscon
  Pshamyl 0.644 0.457 0.450 0.524 0.655 0.589 0.605 0.610 0.483 0.497 0.464 0.524 0.699 0.688 0.722 0.493 0.521 0.564 0.583 0.501 0.494 0.464 0.465 0.512 0.513 0.512 0.537 0.459 0.574 0.580 0.793 0.466 0.543 0.539 0.645 0.656 0.680 0.664  ---        Pshamyl
 Punabomb 0.642 0.373 0.360 0.450 0.534 0.510 0.508 0.515 0.406 0.432 0.416 0.457 0.657 0.647 0.687 0.463 0.510 0.552 0.553 0.465 0.453 0.420 0.511 0.475 0.541 0.462 0.488 0.410 0.588 0.547 0.715 0.555 0.399 0.404 0.539 0.564 0.571 0.506 0.594  ---  Punabomb
          hfinn lmiss tramp yanke  emma persu pride sense gmars pmars tarza timel alice lglas snark agent hdark sshar great olive pwprs twoci callw seawo whtfn dmoro  time warwr human  moon attri chech commu expel funda jihad koran misco shamy unabo


 3-gram distances
          hfinn lmiss tramp yanke  emma persu pride sense gmars pmars tarza timel alice lglas snark agent hdark sshar great olive pwprs twoci callw seawo whtfn dmoro  time warwr human  moon attri chech commu expel funda jihad koran misco shamy unabo
   0hfinn  ---                                                                                                                                                                                                                                            0hfinn
   0lmiss 0.714  ---                                                                                                                                                                                                                                      0lmiss
   0tramp 0.736 0.643  ---                                                                                                                                                                                                                                0tramp
  0yankee 0.708 0.678 0.659  ---                                                                                                                                                                                                                          0yankee
    1emma 0.863 0.768 0.725 0.750  ---                                                                                                                                                                                                                    1emma
  1persua 0.882 0.803 0.776 0.792 0.548  ---                                                                                                                                                                                                              1persua
   1pride 0.876 0.790 0.757 0.784 0.502 0.607  ---                                                                                                                                                                                                        1pride
   1sense 0.871 0.795 0.761 0.784 0.524 0.629 0.541  ---                                                                                                                                                                                                  1sense
   2gmars 0.913 0.818 0.796 0.827 0.836 0.866 0.847 0.852  ---                                                                                                                                                                                            2gmars
   2pmars 0.907 0.812 0.794 0.824 0.831 0.866 0.839 0.848 0.691  ---                                                                                                                                                                                      2pmars
  2tarzan 0.895 0.824 0.812 0.837 0.801 0.830 0.817 0.834 0.807 0.815  ---                                                                                                                                                                                2tarzan
  2timelf 0.868 0.809 0.792 0.809 0.845 0.867 0.855 0.862 0.776 0.793 0.833  ---                                                                                                                                                                          2timelf
   3alice 0.876 0.870 0.867 0.858 0.858 0.877 0.871 0.869 0.927 0.925 0.914 0.917  ---                                                                                                                                                                    3alice
  3lglass 0.864 0.861 0.858 0.850 0.859 0.879 0.876 0.875 0.925 0.928 0.908 0.912 0.771  ---                                                                                                                                                              3lglass
   3snark 0.980 0.973 0.970 0.972 0.966 0.972 0.967 0.967 0.981 0.983 0.978 0.983 0.984 0.983  ---                                                                                                                                                        3snark
   4agent 0.864 0.807 0.793 0.809 0.790 0.811 0.813 0.816 0.876 0.885 0.842 0.885 0.891 0.880 0.975  ---                                                                                                                                                  4agent
   4hdark 0.837 0.790 0.785 0.792 0.835 0.860 0.855 0.859 0.858 0.863 0.879 0.856 0.909 0.897 0.979 0.829  ---                                                                                                                                            4hdark
   4sshar 0.885 0.844 0.842 0.841 0.864 0.887 0.884 0.884 0.885 0.894 0.905 0.883 0.928 0.918 0.982 0.856 0.841  ---                                                                                                                                      4sshar
   5great 0.755 0.692 0.681 0.683 0.686 0.744 0.713 0.724 0.770 0.764 0.817 0.770 0.846 0.827 0.966 0.771 0.757 0.787  ---                                                                                                                                5great
  5oliver 0.815 0.762 0.758 0.779 0.737 0.775 0.751 0.762 0.845 0.854 0.800 0.861 0.860 0.846 0.971 0.785 0.844 0.868 0.677  ---                                                                                                                          5oliver
   5pwprs 0.799 0.741 0.735 0.762 0.722 0.765 0.736 0.753 0.840 0.846 0.803 0.853 0.849 0.842 0.970 0.777 0.831 0.866 0.669 0.557  ---                                                                                                                    5pwprs
 5twocity 0.824 0.747 0.735 0.746 0.718 0.761 0.741 0.761 0.833 0.834 0.801 0.842 0.871 0.858 0.969 0.782 0.821 0.858 0.650 0.691 0.686  ---                                                                                                              5twocity
   6callw 0.893 0.850 0.846 0.854 0.888 0.890 0.896 0.898 0.903 0.913 0.857 0.901 0.940 0.931 0.979 0.868 0.895 0.921 0.864 0.865 0.861 0.853  ---                                                                                                        6callw
 6seawolf 0.826 0.744 0.731 0.733 0.771 0.814 0.797 0.800 0.773 0.783 0.816 0.767 0.888 0.867 0.973 0.816 0.792 0.820 0.675 0.790 0.780 0.764 0.831  ---                                                                                                  6seawolf
  6whtfng 0.855 0.796 0.789 0.789 0.830 0.835 0.849 0.850 0.856 0.875 0.793 0.859 0.908 0.898 0.973 0.804 0.839 0.880 0.801 0.807 0.802 0.789 0.752 0.764  ---                                                                                            6whtfng
   7dmoro 0.857 0.811 0.809 0.818 0.869 0.894 0.881 0.880 0.835 0.844 0.869 0.841 0.913 0.906 0.979 0.876 0.830 0.875 0.764 0.849 0.843 0.833 0.907 0.788 0.865  ---                                                                                      7dmoro
    7time 0.884 0.838 0.826 0.832 0.879 0.899 0.891 0.887 0.847 0.849 0.900 0.852 0.927 0.918 0.986 0.898 0.846 0.890 0.783 0.878 0.874 0.862 0.927 0.816 0.892 0.812  ---                                                                                7time
 7warwrld 0.846 0.782 0.778 0.795 0.857 0.872 0.867 0.866 0.822 0.832 0.851 0.828 0.902 0.891 0.978 0.847 0.821 0.874 0.768 0.829 0.817 0.807 0.881 0.783 0.826 0.780 0.806  ---                                                                          7warwrld
   8human 0.757 0.713 0.694 0.707 0.678 0.707 0.713 0.733 0.851 0.854 0.728 0.832 0.834 0.820 0.963 0.693 0.774 0.838 0.664 0.687 0.683 0.692 0.765 0.733 0.652 0.828 0.865 0.800  ---                                                                    8human
    8moon 0.811 0.740 0.719 0.726 0.689 0.766 0.733 0.750 0.806 0.806 0.800 0.812 0.878 0.862 0.974 0.786 0.781 0.827 0.647 0.765 0.760 0.751 0.858 0.720 0.786 0.797 0.821 0.804 0.564  ---                                                              8moon
  Pattrib 0.994 0.992 0.994 0.993 0.993 0.992 0.994 0.995 0.997 0.996 0.998 0.996 0.997 0.995 0.999 0.995 0.994 0.998 0.993 0.996 0.993 0.996 0.996 0.994 0.996 0.995 0.997 0.995 0.996 0.998  ---                                                        Pattrib
 Pchechen 0.978 0.959 0.966 0.970 0.974 0.976 0.976 0.974 0.980 0.970 0.979 0.978 0.983 0.986 0.996 0.979 0.983 0.989 0.975 0.976 0.972 0.974 0.980 0.975 0.973 0.983 0.985 0.976 0.968 0.979 0.999  ---                                                  Pchechen
  Pcommun 0.981 0.955 0.957 0.962 0.968 0.967 0.967 0.966 0.967 0.967 0.973 0.977 0.986 0.983 0.995 0.963 0.977 0.983 0.967 0.964 0.958 0.961 0.977 0.968 0.967 0.978 0.976 0.964 0.974 0.973 0.999 0.987  ---                                            Pcommun
   Pexpel 0.975 0.946 0.945 0.953 0.956 0.963 0.962 0.960 0.952 0.965 0.971 0.976 0.984 0.983 0.996 0.967 0.972 0.982 0.962 0.958 0.958 0.953 0.973 0.963 0.972 0.967 0.973 0.961 0.972 0.968 0.994 0.974 0.976  ---                                      Pexpel
   Pfunda 0.980 0.964 0.963 0.967 0.970 0.975 0.974 0.973 0.978 0.980 0.981 0.981 0.986 0.986 0.996 0.974 0.978 0.988 0.974 0.974 0.972 0.970 0.978 0.974 0.975 0.980 0.984 0.974 0.977 0.979 0.991 0.987 0.986 0.954  ---                                Pfunda
   Pjihad 0.979 0.963 0.959 0.966 0.964 0.972 0.967 0.969 0.970 0.973 0.973 0.978 0.987 0.986 0.996 0.975 0.978 0.985 0.966 0.970 0.970 0.964 0.982 0.971 0.974 0.979 0.982 0.978 0.967 0.968 0.991 0.989 0.991 0.952 0.957  ---                          Pjihad
   Pkoran 0.958 0.922 0.914 0.925 0.910 0.933 0.915 0.927 0.941 0.941 0.943 0.956 0.975 0.973 0.991 0.957 0.958 0.975 0.928 0.934 0.936 0.914 0.962 0.941 0.955 0.959 0.968 0.956 0.941 0.931 0.997 0.989 0.984 0.935 0.965 0.944  ---                    Pkoran
  Pmiscon 0.977 0.952 0.931 0.955 0.948 0.961 0.954 0.957 0.963 0.969 0.972 0.976 0.985 0.987 0.997 0.975 0.978 0.988 0.967 0.966 0.966 0.953 0.982 0.967 0.974 0.978 0.983 0.976 0.974 0.967 0.998 0.993 0.984 0.908 0.921 0.943 0.903  ---              Pmiscon
  Pshamyl 0.980 0.966 0.969 0.971 0.979 0.977 0.977 0.978 0.972 0.975 0.969 0.976 0.987 0.987 0.997 0.976 0.979 0.987 0.975 0.972 0.972 0.970 0.972 0.974 0.966 0.981 0.984 0.975 0.969 0.979 0.998 0.960 0.988 0.988 0.990 0.991 0.994 0.995  ---        Pshamyl
 Punabomb 0.962 0.922 0.913 0.931 0.920 0.936 0.928 0.929 0.953 0.955 0.957 0.960 0.974 0.968 0.993 0.945 0.963 0.976 0.947 0.942 0.940 0.938 0.972 0.949 0.960 0.965 0.966 0.945 0.944 0.945 0.998 0.978 0.948 0.961 0.969 0.977 0.958 0.945 0.989  ---  Punabomb
          hfinn lmiss tramp yanke  emma persu pride sense gmars pmars tarza timel alice lglas snark agent hdark sshar great olive pwprs twoci callw seawo whtfn dmoro  time warwr human  moon attri chech commu expel funda jihad koran misco shamy unabo


 4-gram distances
          hfinn lmiss tramp yanke  emma persu pride sense gmars pmars tarza timel alice lglas snark agent hdark sshar great olive pwprs twoci callw seawo whtfn dmoro  time warwr human  moon attri chech commu expel funda jihad koran misco shamy unabo
   0hfinn  ---                                                                                                                                                                                                                                            0hfinn
   0lmiss 0.957  ---                                                                                                                                                                                                                                      0lmiss
   0tramp 0.964 0.948  ---                                                                                                                                                                                                                                0tramp
  0yankee 0.962 0.959 0.956  ---                                                                                                                                                                                                                          0yankee
    1emma 0.985 0.972 0.964 0.974  ---                                                                                                                                                                                                                    1emma
  1persua 0.990 0.980 0.976 0.982 0.930  ---                                                                                                                                                                                                              1persua
   1pride 0.987 0.975 0.971 0.978 0.914 0.946  ---                                                                                                                                                                                                        1pride
   1sense 0.986 0.974 0.969 0.975 0.917 0.949 0.920  ---                                                                                                                                                                                                  1sense
   2gmars 0.992 0.978 0.975 0.983 0.981 0.988 0.983 0.983  ---                                                                                                                                                                                            2gmars
   2pmars 0.991 0.977 0.975 0.983 0.981 0.989 0.982 0.983 0.943  ---                                                                                                                                                                                      2pmars
  2tarzan 0.989 0.980 0.978 0.984 0.977 0.986 0.980 0.982 0.968 0.969  ---                                                                                                                                                                                2tarzan
  2timelf 0.988 0.979 0.976 0.982 0.985 0.990 0.986 0.985 0.965 0.974 0.977  ---                                                                                                                                                                          2timelf
   3alice 0.984 0.984 0.985 0.985 0.983 0.987 0.983 0.982 0.992 0.993 0.990 0.994  ---                                                                                                                                                                    3alice
  3lglass 0.985 0.983 0.983 0.985 0.983 0.988 0.984 0.985 0.992 0.992 0.989 0.991 0.943  ---                                                                                                                                                              3lglass
   3snark 0.998 0.997 0.995 0.996 0.998 0.998 0.997 0.998 0.998 0.998 0.998 0.999 0.998 0.998  ---                                                                                                                                                        3snark
   4agent 0.981 0.971 0.969 0.978 0.973 0.980 0.977 0.975 0.981 0.983 0.978 0.985 0.985 0.982 0.997  ---                                                                                                                                                  4agent
   4hdark 0.983 0.978 0.977 0.980 0.984 0.989 0.985 0.985 0.985 0.986 0.988 0.987 0.992 0.991 0.998 0.976  ---                                                                                                                                            4hdark
   4sshar 0.989 0.982 0.983 0.986 0.987 0.993 0.989 0.989 0.988 0.990 0.990 0.989 0.992 0.990 0.999 0.979 0.985  ---                                                                                                                                      4sshar
   5great 0.973 0.963 0.961 0.967 0.958 0.973 0.964 0.965 0.976 0.976 0.982 0.978 0.979 0.979 0.998 0.965 0.977 0.977  ---                                                                                                                                5great
  5oliver 0.979 0.973 0.972 0.978 0.966 0.974 0.967 0.967 0.979 0.982 0.977 0.985 0.979 0.976 0.996 0.967 0.984 0.986 0.950  ---                                                                                                                          5oliver
   5pwprs 0.971 0.962 0.961 0.970 0.951 0.963 0.953 0.956 0.971 0.974 0.972 0.979 0.974 0.972 0.994 0.956 0.977 0.980 0.938 0.904  ---                                                                                                                    5pwprs
 5twocity 0.986 0.976 0.975 0.979 0.971 0.981 0.975 0.977 0.985 0.983 0.982 0.988 0.987 0.986 0.997 0.972 0.987 0.987 0.954 0.963 0.954  ---                                                                                                              5twocity
   6callw 0.989 0.981 0.982 0.985 0.991 0.992 0.989 0.987 0.988 0.989 0.986 0.990 0.995 0.994 0.999 0.983 0.990 0.992 0.990 0.987 0.979 0.991  ---                                                                                                        6callw
 6seawolf 0.983 0.969 0.965 0.972 0.974 0.983 0.976 0.974 0.968 0.973 0.977 0.971 0.988 0.984 0.996 0.972 0.978 0.981 0.962 0.973 0.963 0.978 0.978  ---                                                                                                  6seawolf
  6whtfng 0.986 0.973 0.972 0.978 0.982 0.984 0.982 0.979 0.974 0.979 0.973 0.981 0.991 0.987 0.997 0.968 0.980 0.988 0.978 0.977 0.967 0.981 0.964 0.958  ---                                                                                            6whtfng
   7dmoro 0.986 0.982 0.981 0.986 0.986 0.993 0.989 0.988 0.984 0.985 0.986 0.984 0.992 0.990 0.998 0.986 0.984 0.991 0.977 0.985 0.980 0.987 0.994 0.980 0.985  ---                                                                                      7dmoro
    7time 0.990 0.982 0.981 0.985 0.988 0.992 0.988 0.988 0.982 0.985 0.990 0.985 0.993 0.992 0.999 0.988 0.985 0.992 0.978 0.985 0.983 0.988 0.994 0.981 0.989 0.975  ---                                                                                7time
 7warwrld 0.987 0.978 0.975 0.982 0.986 0.990 0.987 0.985 0.977 0.979 0.980 0.980 0.991 0.989 0.999 0.980 0.983 0.989 0.977 0.981 0.975 0.983 0.988 0.973 0.978 0.974 0.974  ---                                                                          7warwrld
   8human 0.961 0.954 0.950 0.957 0.941 0.957 0.952 0.951 0.975 0.977 0.959 0.976 0.966 0.964 0.991 0.937 0.967 0.973 0.943 0.946 0.933 0.958 0.962 0.953 0.934 0.977 0.981 0.969  ---                                                                    8human
    8moon 0.981 0.970 0.967 0.974 0.949 0.975 0.964 0.967 0.976 0.976 0.974 0.980 0.985 0.983 0.997 0.972 0.977 0.984 0.961 0.972 0.966 0.978 0.988 0.968 0.978 0.979 0.978 0.976 0.906  ---                                                              8moon
  Pattrib 1.000 0.999 0.999 0.999 1.000 1.000 0.999 0.999 1.000 0.999 0.999 1.000 1.000 1.000 1.000 0.999 1.000 1.000 1.000 0.999 0.997 1.000 1.000 1.001 1.000 1.000 1.000 1.000 0.997 0.999  ---                                                        Pattrib
 Pchechen 0.998 0.995 0.995 0.996 0.998 0.999 0.997 0.997 0.998 0.996 0.997 0.998 0.999 1.000 1.000 0.997 0.999 0.999 0.999 0.997 0.995 1.000 0.997 0.998 0.996 1.000 0.999 0.998 0.991 0.998 1.000  ---                                                  Pchechen
  Pcommun 0.999 0.994 0.994 0.994 0.997 0.997 0.995 0.995 0.996 0.994 0.996 0.996 1.000 0.999 1.000 0.995 0.997 0.998 0.998 0.995 0.992 0.997 0.998 0.997 0.994 0.998 0.998 0.997 0.994 0.997 1.000 0.998  ---                                            Pcommun
   Pexpel 0.998 0.994 0.994 0.995 0.998 0.998 0.997 0.996 0.990 0.996 0.996 0.999 0.999 0.999 1.000 0.995 0.998 0.998 0.998 0.996 0.993 0.998 0.997 0.998 0.998 0.999 0.997 0.998 0.994 0.998 0.999 0.997 0.996  ---                                      Pexpel
   Pfunda 0.999 0.997 0.996 0.995 0.997 0.998 0.996 0.995 0.999 0.997 0.998 0.998 0.999 0.999 0.999 0.997 0.998 0.999 0.999 0.997 0.994 0.998 0.997 0.998 0.996 0.999 0.999 0.998 0.994 0.998 0.999 0.998 0.998 0.995  ---                                Pfunda
   Pjihad 0.998 0.996 0.995 0.996 0.997 0.999 0.997 0.997 0.996 0.997 0.996 0.997 0.999 0.999 1.000 0.996 0.998 0.998 0.997 0.997 0.995 0.998 0.998 0.998 0.997 0.999 0.998 0.999 0.994 0.997 0.998 0.998 0.999 0.987 0.986  ---                          Pjihad
   Pkoran 0.997 0.992 0.992 0.992 0.992 0.995 0.992 0.993 0.995 0.993 0.994 0.997 0.999 0.999 1.000 0.994 0.996 0.997 0.994 0.992 0.991 0.992 0.998 0.996 0.996 0.998 0.998 0.997 0.990 0.994 1.000 1.000 0.999 0.987 0.994 0.988  ---                    Pkoran
  Pmiscon 0.999 0.996 0.994 0.995 0.997 0.997 0.995 0.996 0.997 0.998 0.997 0.997 0.999 0.999 1.000 0.998 0.998 0.999 0.998 0.997 0.994 0.998 0.998 0.997 0.998 0.999 0.999 0.998 0.994 0.998 1.000 1.000 0.998 0.972 0.990 0.985 0.979  ---              Pmiscon
  Pshamyl 0.999 0.997 0.997 0.997 0.999 0.999 0.997 0.998 0.997 0.997 0.997 0.998 0.999 1.000 1.000 0.998 0.999 0.999 0.999 0.997 0.994 0.998 0.998 0.999 0.997 0.999 0.999 0.998 0.994 0.998 1.000 0.986 0.998 0.999 0.999 0.999 1.000 0.999  ---        Pshamyl
 Punabomb 0.997 0.992 0.992 0.992 0.991 0.995 0.992 0.991 0.996 0.995 0.995 0.996 0.998 0.998 1.000 0.993 0.996 0.999 0.996 0.994 0.991 0.996 0.997 0.996 0.995 0.998 0.997 0.995 0.989 0.993 1.000 0.998 0.993 0.995 0.994 0.998 0.998 0.997 1.000  ---  Punabomb
          hfinn lmiss tramp yanke  emma persu pride sense gmars pmars tarza timel alice lglas snark agent hdark sshar great olive pwprs twoci callw seawo whtfn dmoro  time warwr human  moon attri chech commu expel funda jihad koran misco shamy unabo



Analysis and Interpretation

As expected, for longer sequences, works resemble each other less and less, and the "distances" increase until each work is basically isolated as a statistically unique work.

This technique does yield relatively small "distances" for works by the same author, even when analyzed at different scales (N-grams of varying length). Note the relatively small distances between works by the same author. The file names were designed to cluster similar works together. It also can yield small distances for works on the same topic -- "War of the Worlds" and Burroughs' two books about Mars are relatively close.

Disappointingly, political writings did not appear to be particularly close to each other. The Koran was included in the corpus as it was expected that some political works would either quote from it or attempt similar phrasing. However, it is inappropriate to assign too much meaning to any statistics involving the political texts, since they are the shortest ones in the corpus.

More importantly, many of the political works are translations rather than the original texts. This technique should be applicable to texts in other languages. Word ordering is more free in some languages than it is in English, and in others it may be less free. This makes it hard to predict just how useful N-gram statistical analysis of non-English texts would be. Also, the corpus should include the works of interest in their original language(s), plus other works in the same language(s).


Detailed N-Gram Lists


Each report lists the 100 most common N-grams and the number of their occurances. N-grams occuring only once are not reported.


© Bob Cromwell, April 2004
This page created with /bin/vi