TEXTCOMBINE-REV

Return to main page




# TEXTCOMBINE-REV, A software for combining text files to obtain high-quality
# pseudo-randomness in practice (replacing an earlier retracted software)


# Contents.
#
# (1) Prologue.
#
# (2) Release and version notes.
#
# (3) Codes of the software.
#
# (4) Software installation and preparation of the text files for the example.
#
# (5) Running the example.
#
# (6) Epilogue.



###############################################################################



# Prologue.

# Shannon, who introduced the concept of entropy of informations, did some
# experiments to estimate the entropy of common English texts [1]. A later work
# of Cover and King gave an estimate of 1.34 bits per letter [2].  Assuming the
# correctness of their estimate, this implies that, if the letters are
# appropriately coded into the space of 5 bits, one needs only to combine 4
# text files in order to obtain bit sequences of full entopy, since
# 4*1.34 = 5.36 > 5.
#
# [1] C. E. Shannon, Predication and Entropy of Printed English, Collected
#     Papers, pp.194-208, IEEE.
#
# [2] T. M. Cover, R. C. King, A Convergent Gambling Estimate of the Entropy of
#     English, IEEE Trans. Inf. Theory, vol. 24, 1978, pp. 413-421.

# The present software has a predecessor, TEXTCOMBINE-SP, which was soon
# retracted after its first release owing to a bug in the code of a utility 
# function that the author coded and employed during the design phase of that
# software to check for the bias of the encoding of the letters into 5-bit
# byte sequences.

# What has been achieved by the present software can be tersely summarized as 
# follows, assuming the general case where the text files are sufficiently
# large:
#
# (1) The generated byte sequences pass, via design specifications of the
#     software, Maurer's universal test and the autocorrelation test for all
#     d in the range [1, 16] as well as the ENT test with an entropy value
#     according to it of at least 7.99 bits per byte. The software is namely
#     coded such that it would give up, reporting failure, after a certain
#     specified maximum amount of processing has been done without finding
#     a solution.
#
# (2) An extensive expermiment of the present author done on all different
#     combinations, totalling 3060 in number, of 4 source materials (of size
#     600 KB each) taken from 18 different books of English literature
#     downloaded from Project Gutenberg resulted in the following:
#
#     (a) No case of failure was ever encountered. On the contrary, the above 
#         mentioned processing limit, which is in terms of rounds of certain 
#         preprocessing of source materials before they are xor-ed together,
#         was by far not being approached in the experiment. For details, see
#         Epilogue.
#
#     (b) The worst case of entropy according to ENT in the experiment was 
#         higher than 7.998 bits per byte and the average CPU-time was less than
#         15 sec on author's PC.

# The interested potential users of this software are sincerely recommended to 
# start rightaway with a test, either directly repeating the example given 
# further below or similarly using a different set of 4 text files of his own 
# choice. Examining the coding logic underlying the present software, which is 
# explained in the Epilogue further below, if desired, could for time and 
# efficiency reasons be advantegeously delayed to a later time point, when some
# number of tests are all found to deliver satisfactory results, similar to what
# the present author reported above.

# The test statistics of Maurer's universal test and those of the
# autocorrelation test as well as the full report of the ENT test are printed
# out when the software is run in the way of the example given further below.
# See the print-out in the section "Running the example" further below.

# Since the software utilizes only 3 among a plethora of existing statatistical 
# tests for randomness, one might question the necessity of applying other
# statistical tests. It should be noted however that the goal of this software
# is not the generation of bit sequences to compete with those from the
# CSPRNGs, which have theoretical proofs of their nice qualities. Since the
# techniques employed in the present software are all empirical or heuristic in
# nature, any rigorous proofs of qualities would evidently be futile from the
# outset. On the contrary, our main goal is rather humble: It consists in
# providing a practically acceptable, under circumstances fairly convenient and
# even welcome alternative means of obtaining high-quality pseudo-randomness.
# Analogy: a bicycle for a car owner. A secondary, albeit essential, goal is
# naturally to render concrete support to the estimated value of entropy of
# English as given by Cover and King in their paper.

# Experts in possession of comprehensive and sophisticated test suites for
# randomness are sincerely requested herewith to kindly spend a little time to
# do a couple of tests on the sequences generated by this software and to inform
# the present author via the Internet of their results in case of detection of
# significant deviations from the to-be-expected practically satisfactory
# qualities such that the author may have the chance to attempt to make a
# correspondingly improved version of his software available to the general
# public in the foreseeable future. Thanks in advance from the present author
# who, lacking sufficient knowledge and resources, regretfully doesn't yet have
# such advanced and nice tools at his disposal.



################################################################################################



# Release and version notes.


# Version 1.1, released on 04.09.2017.

# Version Notes:

# Version 1.0: Released on 31.08.2017.

# Version 1.1: 04.09.2017: Modification of testcombinationsub(), correcting
# typing errors.


# Code lines of documents with the same version number are always identical.
# There may be interim modifications of comment lines. 


# This software may be freely used:

# 1. for all personal purposes unconditionally and

# 2. for all other purposes under the condition that its name, version number 
#    and authorship are explicitly mentioned and that the author is informed of
#    all eventual code modifications done.


# A list of present author's software that are currently directly maintained by
# himself is available at http://mok-kong-shen.de. Users are advised to
# download such software from that home page only.


# The author is indebted to TPS and CWL for review and suggestions throughout
# TEXTCOMBINE-REV's development phase. Any remaining deficiencies of the
# software are however the sole responsibility of the author.


# Constructive critiques, comments and suggestions of extensions and
# improvements are sincerely solicited. Address of the author:
#
# Email: mok-kong.shen@t-online.de
#
# Mail: Mok-Kong Shen, Postfach 340238, Munich 80099, Germany
#
# (Sender data optional, if no reply is required.)



#################################################################################



# Codes of the software.


# The functions are listed in a topdown manner in general.


import os


# The main function.

def textcombinerev(textfilenamelist,resultfilename):
  assert len(textfilenamelist)==4
  global gby5
  global glen8
  global testmaurer
  global gby
  global solutionfound,solutionby
  print("Textfilenamelist:")
  print(textfilenamelist)
  print()
  by=bytearray(0)
  writebinaryfile(by,resultfilename)
  solutionfound=False
  gby5=[]  
  for i in range(4):
    by5=gettext(textfilenamelist[i])
    gby5.append(by5)
  glen5=min(len(gby5[0]),len(gby5[1]),len(gby5[2]),len(gby5[3]))
  glen5=(glen5//8)*8
  for i in range(4):
    gby5[i]=gby5[i][:glen5]
  glen8=(glen5*5)//8
  testmaurer=glen8 >= 260000
  if testmaurer:
    print("Selection criteria are based on Maurer's universal test, "\
          "autocorrelation test \n"\
          "and ENT test")
  else:
    print("Seclection criteria are based on autocorrelation test and ENt test")
  lplan=3
  pgby5=[[0,0,0,0] for i in range(lplan)]
  gby=[0,0,0,0]
  lplan=3
  for i in range(4):
    gby[i]=packby5toby8(gby5[i])
  testcombination()
  if solutionfound:
    pass    
  else:
    suc=False
    for ppn in range (6):
      for plan in range(lplan):
        for i in range(4):
          if ppn==0:
            by5=gby5[i][:]
          else:
            by5=pgby5[plan][i]
          by5=preprocess(by5,plan)
          pgby5[plan][i]=by5
          gby[i]=packby5toby8(by5)
        testcombination()
        if solutionfound:       
          suc=True
          break
      if suc:
        break
    if not suc:
      print("No solution could be found with TEXTCOMBINE-REV. Sorry.")
      print("Please inform the author of this software of the textfiles used")
      return
  print("\nTEXTCOMBINE-REV has successfully generated a byte array satisfying "\
        "the\nselection criteria of length",len(solutionby),
        "with the following test results:")
  evaluatebysuc(solutionby)
  writebinaryfile(solutionby,resultfilename)
  print("\nResulting byte array has been written as",resultfilename+".bin")
  by=bytearray(0)
  writebinaryfile(by,"enttest")
  return


# Subfunction of textcombinerev().

def testcombination(): 
  global gby
  global solutionfound,solutionby
  by=xoring(gby[0],gby[1])
  for i in range(2,4):
    by=xoring(gby[i],by)
  evaluateby(by)  
  if solutionfound:
    solutionby=by
    return
  testcombinationsub(0)
  if solutionfound:
    return
  testcombinationsub(1)
  return


# Subfunction of testcombination().

def testcombinationsub(sssopt):
  global glen8 
  global gby
  global solutionfound,solutionby
  mask=0xffffffffff  
  by=bytearray(0)  
  rot0=0
  rot1=1
  rot2=2
  rot3=3
  for ii in range(0,glen8-4,5):
    ss0=ss1=ss2=ss3=0
    for j in range(5):
      ss0 <<= 8
      ss0|=gby[0][ii+j]
      ss1 <<= 8
      ss1|=gby[1][ii+j]
      ss2 <<= 8
      ss2|=gby[2][ii+j]
      ss3 <<= 8
      ss3|=gby[3][ii+j]
    ss0=((ss0 << rot0)&mask)|(ss0 >> (40-rot0))
    ss1=((ss1 << rot1)&mask)|(ss1 >> (40-rot1))
    ss2=((ss2 << rot2)&mask)|(ss2 >> (40-rot2))
    ss3=((ss3 << rot3)&mask)|(ss3 >> (40-rot3))
    rot0=(rot0+1)%40
    rot1=(rot1+1)%40
    rot2=(rot2+1)%40
    rot3=(rot3+1)%40
    if sssopt==0:
      sss=ss0^ss1^ss2^ss3
    else:
      sss=(ss0+ss1+ss2+ss3)&mask
    shf=32
    for j in range(5):
      by.append((sss >> shf)&0xff)
      shf-=8
  evaluateby(by)
  if solutionfound:
    solutionby=by
  return


# Read a text file and encode the letters (reduced to lower case) into integers
# of 5-bits accoring to the natural alphabetical ordering, i.e. into [0, 25].
# This encoding is highly biased, since the space of 5-bits is [0, 31].

def gettext(textfilename):
  fp=open(textfilename+".txt","r")
  textstring=fp.read()
  fp.close()
  byraw=bytearray(textstring,'utf-8')
  lbyraw=len(byraw)
  by5=bytearray(0)
  for i in range(lbyraw):
    g=byraw[i]
    if 65 <= g <= 90:
      u=g-65   
    elif 97 <= g <= 122:
      u=g-97
    else:
      continue
    by5.append(u)
  return(by5)


# xoring of 2 byte arrays.

def xoring(by1,by2):
  lby1=len(by1)
  lby2=len(by2)
  lby=min(lby1,lby2)
  by=bytearray(0)
  for i in range(lby):
    by.append(by1[i]^by2[i])
  return(by)


# Pack a 5-bit byte array to a 8-bit byte array.

def packby5toby8(by5):
  lby5=len(by5)
  by8=bytearray(0)
  count=0
  gg=0
  for i in range(lby5):
    u=by5[i]    
    gg <<= 5
    gg|=u
    count+=1
    if count==8:
      shf=32
      for j in range(5):
        v=(gg >> shf)&0xff
        by8.append(v)
        shf-=8
      count=0
      gg=0
  return(by8)


# A preprocessing on a 5-bit byte array that is inspired by Vigenere
# encryption. Note that it employs the non-secret key: 0, 1, 2, ..... 31.

def v(by5):
  pby=bytearray(0)
  p=0
  for i in range(len(by5)):
    pby.append((by5[i]+p)%32)
    p=(p+1)%32
  return(pby)


# A preprocessing on a 5-bit byte array that is inspired by encryption employing
# the sum of all earlier plaintext letters as auto-key, starting with the
# non-secret IV 0.

def m(by5):
  bys5=bytearray(0)
  p=0
  for i in range(len(by5)):
    s=(p+by5[i])%32
    bys5.append(s)
    p=(p+s)%32
  return(bys5)


# A preprocessing on a 5-bit byte array that employs the count of the current
# letter in the past as an addend (mod 32) to transform the current letter.

def u(by5):
  ucount=[0 for i in range(32)]
  byu5=bytearray(0)
  for i in range(len(by5)):
    t=by5[i]
    u=(t+ucount[t])%32
    byu5.append(u)
    ucount[t]=(ucount[t]+1)%32
  return(byu5)


# Performing preprocessing of a 5-bit byte array according to the value of plan,
# which selects the preprocessing schemes v(), m() and u().

def preprocess(by5,plan):
  if plan==0:
    by5=v(by5)
  elif plan==1:
    by5=m(by5)
  elif plan==2:
    by5=u(by5)
  else:
    print("plan error")
    exit(111)
  return(by5)


# Evaluate an 8-bit byte array to see whether it satisfies Maurer's universal
# test (if applicable because the array is large enough) and the autocorrelation
# test for all d in [1, 16] and has an entropy value of at least 7.99 bits per
# byte according to ENT test. If yes, solutionfound will be set to True.

def evaluateby(by):
  global solutionfound
  global testmaurer
  suc=1
  while True:
    if testmaurer:
      tu=round(testbyarray(by),4)
      if tu < 7.1781 or tu > 7.1892:
        suc=0
        break
    for d in range(1,17):
      ac=round(autocorrelation(by,d),4)
      if abs(ac) > 1.96:
        suc=0
        break
    if suc==0:
      break
    writebinaryfile(by,"enttest")
    enttext=os.popen("ent enttest.bin").read()
    if enttext[10:14]!="7.99":
      suc=0
    break
  solutionfound=(suc==1)
  return


# After a solution is found, print ita test results.

def evaluatebysuc(by):
  global testmaurer
  if testmaurer:  
    print("\nStatistic of Maurer's universal test:\n")
    tu=round(testbyarray(by),4)
    print("{0:12.4f}".format(tu))
  print("\nStatistics of autocorrelation test for d in [1, 16]:\n")
  for d in range(1,17):
    ac=round(autocorrelation(by,d),4)
    print("{0:2d}{1:10.4f}".format(d,ac))
  writebinaryfile(by,"enttest")
  enttext=os.popen("ent enttest.bin").read()
  print()
  print("Report from ENT:")
  print()
  print(enttext)
  return


# A set of functions for performing Maurer's universal test:

import math


# Maurer's Universal Test, see [3].
# [3] J-S. Coron, D. Naccache, An Accurate Evaluation of Maurer's Universal Test.
#     http://www.jscoron.fr/publications/universal.pdf


def maurertest(bb):
  qq=2560
  qqp1=qq+1
  kk=256000
  qqkkp1=qq+kk+1
  eftu=7.1836656
# y1 and y2 are for rho=0.01 and rho=0.001 respectively.
  y1=2.5758
  y2=3.2905
  t=[0 for i in range(256)]
  for i in range(1,qqp1,1):
    t[bb[i]]=i
  sum=0.0
  for i in range(qqp1,qqkkp1,1):
    sum+=math.log10(i-t[bb[i]])
    t[bb[i]]=i
  tu=(sum/kk)/math.log10(2.0)
  c=math.sqrt(0.3732189+(0.3730195*256)/kk)
  sigma=c*math.sqrt(3.2386622/kk)
  t11=eftu-y1*sigma
  t12=eftu+y1*sigma
  t21=eftu-y2*sigma
  t22=eftu+y2*sigma
  return(tu,t11,t12,t21,t22)


def maurertestresult(h,gg):
  global chnum
  qq=2560
  qqp1=qq+1
  kk=256000
  qqkkp1=qq+kk+1
  eftu=7.1836656
  if h*chnum < qqkkp1:
    print("Error in maurertestresult")
    exit(6)
  bb=[0 for k in range(h*chnum)]
  u=0
  k1=-1
  k2=chnum-1
  for i in range(h):
    g=gg[i]
    for k in range(k2,k1,-1):
      bb[u]=g&0xff
      g >>= 8
      u+=1
    k1+=chnum
    k2+=chnum
  tu,t11,t12,t21,t22 = maurertest(bb)
  return(round(tu,4))


# Apply Maurer's Universal Test to byarray. Min length of byarray is 260000.
# The result should lie in [7.1781, 7.1892]

def testbyarray(byarray):
  global chnum
# We actually test only the first 258560 bytes of byarray, see maurertest().
  h=260000
  chnum=1
  assert len(byarray) >= h
  tu=maurertestresult(h,byarray[:h])
  return(tu)


# Autocorrelation test of an 8-bit byte array for d in [1, 16].

def autocorrelation(by,d):
  assert 1 <= d <= 16
  lby=len(by)
# number of 1's in a byte for byte values 0 - 255
  numb1inbyte=bytearray(0)
  for k in range(256):
    s=0
    m=1
    for i in range(8):
      s+=(k&m) >> i
      m <<= 1
    numb1inbyte.append(s)
  u=0xffff0000
  masks=[u >> i for i in range(17)]
# A(d)
  ad=0
  for i in range(0,lby-3,2):
    h1=(by[i] << 8|by[i+1]) << 16
    g=h1|by[i+2] << 8|by[i+3]
    h2=(g&masks[d]) << d
    hh=(h1^h2) >> 16
    ad+=numb1inbyte[hh >> 8]+numb1inbyte[hh&0xff]
  n=(lby-3)*8+d
  x5=2*(ad-(n-d)/2)/math.sqrt(n-d)
  return(x5)


# Functions for input/output of text/binary files.


def writebinaryfile(byarray,binaryfilename):
  fp=open(binaryfilename+".bin","wb")
  fp.write(byarray)
  fp.close()


def readbinaryfile(binaryfilename):
  fp=open(binaryfilename+".bin","rb")
  byarray=bytearray(fp.read())
  fp.close()
  return(byarray)


def writetextfile(textstring,textfilename):
  fp=open(textfilename+".txt","w")
  fp.write(textstring)
  fp.close()
  return


def readtextfile(textfilename):
  fp=open(textfilename+".txt","r")
  textstring=fp.read()
  fp.close()
  return(textstring)


# A utility function to skip head and tail parts of a given text file,
# optionally perform a rotation and shorten the result to a new length.
# The new file is stored with a name which is the original file name extended
# with "sh".

def shortentextfile(textfilename,headskip,tailskip,rotatepercent,newlength):
  assert headskip >= 0 and tailskip >= 0 and 0 <= rotatepercent < 100 and\
         newlength > 0
  textstring=readtextfile(textfilename)
  if tailskip > 0:
    textstring=textstring[headskip:][:-tailskip]
  else:
    textstring=textstring[headskip:]
  if rotatepercent > 0:
    k=(len(textstring)*rotatepercent)//100
    if k > 0:
      textstring=textstring[k:]+textstring[:k]
  assert len(textstring) >= newlength
  writetextfile(textstring[:newlength],textfilename+"sh")
  return



################################################################################



# Software installation and preparation of the text files for the example.


# Python version 3x can be downloaded from http://python.org. The present code
# can be stored in a file named e.g. textcombinerev.py and run in python's
# GUI IDLE. (File --> Open to find and open the file, then in the window
# showing the code Run --> Run Module to run it. The file will be accessible
# in later runs under Recent Files.

# ENT can be downloaded from http://www.fourmilab.ch/random/ The file ent.exe
# has to be made accessible from the directory in which the present code is
# located, e.g. via having it in the same directory.

# 4 text files are to be downloaded and stored in the directory where the
# present code is stored. As example the author chooses to employ the
# following books from http://www.gutenberg.org and store them as austen.txt,
# conrad.txt, collins.txt and hardy.txt respectively.
#
# Jane Austen, Pride and Prejudice, 709 KB.
#
# Joseph Conrad, Load Jim, 734 KB.
#
# Wilkie Collins, The Fallen Leaves, 764 KB.
#
# Thomas Hardy, The Woodlanders, 926 KB.



################################################################################



# Running the example.


# After the present code is run, shorten the 4 textfiles by typing in the main
# GUI window the following commands, which will result in 4 new files
# austensh.txt, conradsh.txt, collinssh.txt and hardysh.txt:
#
# shortentextfile("austen",2000,8000,0,600000)
# shortentextfile("conrad",2000,8000,0,600000)
# shortentextfile("collins",2000,8000,0,600000)
# shortentextfile("hardy",2000,8000,0,600000)

# Now type in the following commands to combine these 4 files:
#
# textfilenamelist=["austensh","conradsh","collinssh","hardysh"]
# textcombinerev(textfilenamelist,"myresult")


# A print-out will come out as follow:

# Textfilenamelist:
# ['austensh', 'conradsh', 'collinssh', 'hardysh']

# Selection criteria are based on Maurer's universal test, autocorrelation test 
# and ENT test

# TEXTCOMBINE-REV has successfully generated a byte array satisfying the
# selection criteria of length 289615 with the following test results:

# Statistic of Maurer's universal test:

#       7.1849

# Statistics of autocorrelation test for d in [1, 16]:

#  1   -1.5202
#  2   -0.4862
#  3   -0.0473
#  4   -0.0447
#  5    0.1537
#  6   -1.3823
#  7   -0.8291
#  8    1.5215
#  9    0.9513
# 10   -0.5295
# 11    1.6490
# 12   -0.2010
# 13   -0.1577
# 14   -0.0696
# 15   -0.2129
# 16   -0.1931

# Report from ENT:

# Entropy = 7.999334 bits per byte.

# Optimum compression would reduce the size
# of this 289615 byte file by 0 percent.

# Chi square distribution for 289615 samples is 267.68, and randomly
# would exceed this value 28.03 percent of the times.

# Arithmetic mean value of data bytes is 127.5758 (127.5 = random).
# Monte Carlo value for Pi is 3.132942468 (error 0.28 percent).
# Serial correlation coefficient is 0.001267 (totally uncorrelated = 0.0).


# Resulting byte array has been written as myresult.bin



################################################################################



# Epilogue.


# After the episode mentioned in the Prologue, the author received from a number
# of persons kind encouragements to attempt to find a good replacement of the
# failed software. In this context he takes the liberty to mention herewith the
# name of Cecilia Tanaka who, having read on the Internet of author's failure
# cared to give encouragements even at a time of own extremely severe illness.

# Initially author's work was concentrated in finding an optimal encoding of a-z
# to the 5-bit space [0, 31] based on the frequency distributions of the letters
# in the indivicual text files. However, due to the high non-linearity of the
# target function for optimization, a global optimum could never be
# successsfully obtained within the realm of his humble knowledge and resources.

# Author's attention was then switched to the possible benefits of certain
# promising "encryption-like" transformations of the naturally/naively encoded,
# hence highly biased, 5-bit sequences before they are xor-ed together. It
# turned out that these can indeed be fairly beneficial, albeit only to some
# limited extent, if they are applied singly. The present software finally
# chooses to jointly make use of three different such transformations which in
# an extensive experiment had shown their superiority among a total of some
# 20 similar transformations.

# Hereafter the said genre of transformations will be denoted as preprocessing.
# Note that they can only be "encryption-like" and not genuine encryptions,
# because the latter employ secret keys which inherently have entropy in them
# and such "additional" entropy would betray our very goal of verifying that
# the text files are, via their entropy contents alone, able to produce bit
# sequences of full entropy. An example of such preprocessing for a 5-bit
# byte sequence is processing with a 32*32 Vigenere table, employing the
# "non-secret" key 0, 1, 2, ..... 31.

# Furthermore, it was found that the preprocessing could be advantageously
# employed in iteration, i.e. being repeatedly applied.

# Another experimental observation that turned out to be also very helpful in
# achieving our goal was that, independent of preprocessing, xor-ing the 4,
# though highly biased, 5-bit sequences from the given text files has already
# a significantly good chance of roughly 2/3 of yielding sequences of
# practically full entropy. In other words, we need consequently only to
# attempt to somehow successfully deal with the remaining 1/3 of the cases.

# A technical sketch of our scheme in the general case, i.e. where the text
# files are sufficiently large so that Maurer's universal test can be applied,
# is as follows:

# The main function textfilecombinerev() obtains from the given 4 text files
# 4 byte arrays containing the 5-bit encoding of the letters in them (all
# reduced to lower case and converted to the range [0, 25] in the natural
# alphabetical ordering), shortens them, if necessary, to a common length
# and decides whether a certain round of preprocessing should be done on them.
# It then packs these into 8-bit byte arrays and pass them to the function
# testcombination(), which first tries to combine the 4 streams as such with
# xor-ing and see whether the result passes Maurer's universal test, the
# autocorrelation test and the ENT test. If yes, success is already achieved.
# If not, it calls the function testcombinationsub(), where attemps (in 2
# cases) are done to obtain from the 4 byte arrays successively 4 blocks of
# 40 bits each, with certain bit rotations of the blocks relative to one
# another (the amount of rotations are dynamically increased from block to block
# as the processing advances) and combine these first (the 1st case) with
# xor-ing to see whether the tests mentioned are satisfied. If yes, success is
# already achieved. If not (the 2nd case), the blocks are summed mod 2**40
# to see whether the tests are satisfied. If yes, success is already achieved.
# If not, failure is reported back upstream to textcombinerev() where an
# additional round of preprocessing will be applied and testcombination() once
# again invoked.

# Since we employ 3 different preprocessing schemes, these are tried one after
# the other in an iteration round and up to 6 iterations of preprocessing will
# be done before textcombinerev() reports failure to perform its work.

# From an experiment with a large set of combinations of 4 text files (already
# mentioned in Prologue), it was found that the number of iterations of
# preprocessing needed for success seems to follow a highly exponentially
# decreasing pattern. In fact no case in the said experiment needed more than
# 2 iterations. It follows that limiting the number of iterations of
# preprocessing to a maximum of 6 is a good enough strategy,

# In cases where the sizes of the text files are not sufficiently large for
# Maurer's universal test to be applied, we are forced to forgo that test, i.e.
# only the autocorrelation test and the ENT test will be the selection criteria
# employed.

# It may be noted that text files in utf-8 format obtained from Project
# Gutenberg may contain in the head and tail parts certain texts, e.g.
# license notes, that do not properly belong to the books. One may desire to
# exclude such foreign materials or perfer to use only certain selected parts
# of the books. The utility function shortentextfiles() may be used for such
# purposes.

# A small number of the downloaded files from Project Gutenberg may contain
# coding errors such that Python may report errors while attempting to read
# them. A forced remedy for such an eventuality is clear: one chooses to
# download a different book.



Return to main page