# TEXTCOMBINE-REV, A software for combining text files to obtain high-quality # pseudo-randomness in practice (replacing an earlier retracted software) # Contents. # # (1) Prologue. # # (2) Release and version notes. # # (3) Codes of the software. # # (4) Software installation and preparation of the text files for the example. # # (5) Running the example. # # (6) Epilogue. ############################################################################### # Prologue. # Shannon, who introduced the concept of entropy of informations, did some # experiments to estimate the entropy of common English texts [1]. A later work # of Cover and King gave an estimate of 1.34 bits per letter [2]. Assuming the # correctness of their estimate, this implies that, if the letters are # appropriately coded into the space of 5 bits, one needs only to combine 4 # text files in order to obtain bit sequences of full entopy, since # 4*1.34 = 5.36 > 5. # # [1] C. E. Shannon, Predication and Entropy of Printed English, Collected # Papers, pp.194-208, IEEE. # # [2] T. M. Cover, R. C. King, A Convergent Gambling Estimate of the Entropy of # English, IEEE Trans. Inf. Theory, vol. 24, 1978, pp. 413-421. # The present software has a predecessor, TEXTCOMBINE-SP, which was soon # retracted after its first release owing to a bug in the code of a utility # function that the author coded and employed during the design phase of that # software to check for the bias of the encoding of the letters into 5-bit # byte sequences. # What has been achieved by the present software can be tersely summarized as # follows, assuming the general case where the text files are sufficiently # large: # # (1) The generated byte sequences pass, via design specifications of the # software, Maurer's universal test and the autocorrelation test for all # d in the range [1, 16] as well as the ENT test with an entropy value # according to it of at least 7.99 bits per byte. The software is namely # coded such that it would give up, reporting failure, after a certain # specified maximum amount of processing has been done without finding # a solution. # # (2) An extensive expermiment of the present author done on all different # combinations, totalling 3060 in number, of 4 source materials (of size # 600 KB each) taken from 18 different books of English literature # downloaded from Project Gutenberg resulted in the following: # # (a) No case of failure was ever encountered. On the contrary, the above # mentioned processing limit, which is in terms of rounds of certain # preprocessing of source materials before they are xor-ed together, # was by far not being approached in the experiment. For details, see # Epilogue. # # (b) The worst case of entropy according to ENT in the experiment was # higher than 7.998 bits per byte and the average CPU-time was less than # 15 sec on author's PC. # The interested potential users of this software are sincerely recommended to # start rightaway with a test, either directly repeating the example given # further below or similarly using a different set of 4 text files of his own # choice. Examining the coding logic underlying the present software, which is # explained in the Epilogue further below, if desired, could for time and # efficiency reasons be advantegeously delayed to a later time point, when some # number of tests are all found to deliver satisfactory results, similar to what # the present author reported above. # The test statistics of Maurer's universal test and those of the # autocorrelation test as well as the full report of the ENT test are printed # out when the software is run in the way of the example given further below. # See the print-out in the section "Running the example" further below. # Since the software utilizes only 3 among a plethora of existing statatistical # tests for randomness, one might question the necessity of applying other # statistical tests. It should be noted however that the goal of this software # is not the generation of bit sequences to compete with those from the # CSPRNGs, which have theoretical proofs of their nice qualities. Since the # techniques employed in the present software are all empirical or heuristic in # nature, any rigorous proofs of qualities would evidently be futile from the # outset. On the contrary, our main goal is rather humble: It consists in # providing a practically acceptable, under circumstances fairly convenient and # even welcome alternative means of obtaining high-quality pseudo-randomness. # Analogy: a bicycle for a car owner. A secondary, albeit essential, goal is # naturally to render concrete support to the estimated value of entropy of # English as given by Cover and King in their paper. # Experts in possession of comprehensive and sophisticated test suites for # randomness are sincerely requested herewith to kindly spend a little time to # do a couple of tests on the sequences generated by this software and to inform # the present author via the Internet of their results in case of detection of # significant deviations from the to-be-expected practically satisfactory # qualities such that the author may have the chance to attempt to make a # correspondingly improved version of his software available to the general # public in the foreseeable future. Thanks in advance from the present author # who, lacking sufficient knowledge and resources, regretfully doesn't yet have # such advanced and nice tools at his disposal. ################################################################################################ # Release and version notes. # Version 1.1, released on 04.09.2017. # Version Notes: # Version 1.0: Released on 31.08.2017. # Version 1.1: 04.09.2017: Modification of testcombinationsub(), correcting # typing errors. # Code lines of documents with the same version number are always identical. # There may be interim modifications of comment lines. # This software may be freely used: # 1. for all personal purposes unconditionally and # 2. for all other purposes under the condition that its name, version number # and authorship are explicitly mentioned and that the author is informed of # all eventual code modifications done. # A list of present author's software that are currently directly maintained by # himself is available at http://mok-kong-shen.de. Users are advised to # download such software from that home page only. # The author is indebted to TPS and CWL for review and suggestions throughout # TEXTCOMBINE-REV's development phase. Any remaining deficiencies of the # software are however the sole responsibility of the author. # Constructive critiques, comments and suggestions of extensions and # improvements are sincerely solicited. Address of the author: # # Email: mok-kong.shen@t-online.de # # Mail: Mok-Kong Shen, Postfach 340238, Munich 80099, Germany # # (Sender data optional, if no reply is required.) ################################################################################# # Codes of the software. # The functions are listed in a topdown manner in general. import os # The main function. def textcombinerev(textfilenamelist,resultfilename): assert len(textfilenamelist)==4 global gby5 global glen8 global testmaurer global gby global solutionfound,solutionby print("Textfilenamelist:") print(textfilenamelist) print() by=bytearray(0) writebinaryfile(by,resultfilename) solutionfound=False gby5=[] for i in range(4): by5=gettext(textfilenamelist[i]) gby5.append(by5) glen5=min(len(gby5[0]),len(gby5[1]),len(gby5[2]),len(gby5[3])) glen5=(glen5//8)*8 for i in range(4): gby5[i]=gby5[i][:glen5] glen8=(glen5*5)//8 testmaurer=glen8 >= 260000 if testmaurer: print("Selection criteria are based on Maurer's universal test, "\ "autocorrelation test \n"\ "and ENT test") else: print("Seclection criteria are based on autocorrelation test and ENt test") lplan=3 pgby5=[[0,0,0,0] for i in range(lplan)] gby=[0,0,0,0] lplan=3 for i in range(4): gby[i]=packby5toby8(gby5[i]) testcombination() if solutionfound: pass else: suc=False for ppn in range (6): for plan in range(lplan): for i in range(4): if ppn==0: by5=gby5[i][:] else: by5=pgby5[plan][i] by5=preprocess(by5,plan) pgby5[plan][i]=by5 gby[i]=packby5toby8(by5) testcombination() if solutionfound: suc=True break if suc: break if not suc: print("No solution could be found with TEXTCOMBINE-REV. Sorry.") print("Please inform the author of this software of the textfiles used") return print("\nTEXTCOMBINE-REV has successfully generated a byte array satisfying "\ "the\nselection criteria of length",len(solutionby), "with the following test results:") evaluatebysuc(solutionby) writebinaryfile(solutionby,resultfilename) print("\nResulting byte array has been written as",resultfilename+".bin") by=bytearray(0) writebinaryfile(by,"enttest") return # Subfunction of textcombinerev(). def testcombination(): global gby global solutionfound,solutionby by=xoring(gby[0],gby[1]) for i in range(2,4): by=xoring(gby[i],by) evaluateby(by) if solutionfound: solutionby=by return testcombinationsub(0) if solutionfound: return testcombinationsub(1) return # Subfunction of testcombination(). def testcombinationsub(sssopt): global glen8 global gby global solutionfound,solutionby mask=0xffffffffff by=bytearray(0) rot0=0 rot1=1 rot2=2 rot3=3 for ii in range(0,glen8-4,5): ss0=ss1=ss2=ss3=0 for j in range(5): ss0 <<= 8 ss0|=gby[0][ii+j] ss1 <<= 8 ss1|=gby[1][ii+j] ss2 <<= 8 ss2|=gby[2][ii+j] ss3 <<= 8 ss3|=gby[3][ii+j] ss0=((ss0 << rot0)&mask)|(ss0 >> (40-rot0)) ss1=((ss1 << rot1)&mask)|(ss1 >> (40-rot1)) ss2=((ss2 << rot2)&mask)|(ss2 >> (40-rot2)) ss3=((ss3 << rot3)&mask)|(ss3 >> (40-rot3)) rot0=(rot0+1)%40 rot1=(rot1+1)%40 rot2=(rot2+1)%40 rot3=(rot3+1)%40 if sssopt==0: sss=ss0^ss1^ss2^ss3 else: sss=(ss0+ss1+ss2+ss3)&mask shf=32 for j in range(5): by.append((sss >> shf)&0xff) shf-=8 evaluateby(by) if solutionfound: solutionby=by return # Read a text file and encode the letters (reduced to lower case) into integers # of 5-bits accoring to the natural alphabetical ordering, i.e. into [0, 25]. # This encoding is highly biased, since the space of 5-bits is [0, 31]. def gettext(textfilename): fp=open(textfilename+".txt","r") textstring=fp.read() fp.close() byraw=bytearray(textstring,'utf-8') lbyraw=len(byraw) by5=bytearray(0) for i in range(lbyraw): g=byraw[i] if 65 <= g <= 90: u=g-65 elif 97 <= g <= 122: u=g-97 else: continue by5.append(u) return(by5) # xoring of 2 byte arrays. def xoring(by1,by2): lby1=len(by1) lby2=len(by2) lby=min(lby1,lby2) by=bytearray(0) for i in range(lby): by.append(by1[i]^by2[i]) return(by) # Pack a 5-bit byte array to a 8-bit byte array. def packby5toby8(by5): lby5=len(by5) by8=bytearray(0) count=0 gg=0 for i in range(lby5): u=by5[i] gg <<= 5 gg|=u count+=1 if count==8: shf=32 for j in range(5): v=(gg >> shf)&0xff by8.append(v) shf-=8 count=0 gg=0 return(by8) # A preprocessing on a 5-bit byte array that is inspired by Vigenere # encryption. Note that it employs the non-secret key: 0, 1, 2, ..... 31. def v(by5): pby=bytearray(0) p=0 for i in range(len(by5)): pby.append((by5[i]+p)%32) p=(p+1)%32 return(pby) # A preprocessing on a 5-bit byte array that is inspired by encryption employing # the sum of all earlier plaintext letters as auto-key, starting with the # non-secret IV 0. def m(by5): bys5=bytearray(0) p=0 for i in range(len(by5)): s=(p+by5[i])%32 bys5.append(s) p=(p+s)%32 return(bys5) # A preprocessing on a 5-bit byte array that employs the count of the current # letter in the past as an addend (mod 32) to transform the current letter. def u(by5): ucount=[0 for i in range(32)] byu5=bytearray(0) for i in range(len(by5)): t=by5[i] u=(t+ucount[t])%32 byu5.append(u) ucount[t]=(ucount[t]+1)%32 return(byu5) # Performing preprocessing of a 5-bit byte array according to the value of plan, # which selects the preprocessing schemes v(), m() and u(). def preprocess(by5,plan): if plan==0: by5=v(by5) elif plan==1: by5=m(by5) elif plan==2: by5=u(by5) else: print("plan error") exit(111) return(by5) # Evaluate an 8-bit byte array to see whether it satisfies Maurer's universal # test (if applicable because the array is large enough) and the autocorrelation # test for all d in [1, 16] and has an entropy value of at least 7.99 bits per # byte according to ENT test. If yes, solutionfound will be set to True. def evaluateby(by): global solutionfound global testmaurer suc=1 while True: if testmaurer: tu=round(testbyarray(by),4) if tu < 7.1781 or tu > 7.1892: suc=0 break for d in range(1,17): ac=round(autocorrelation(by,d),4) if abs(ac) > 1.96: suc=0 break if suc==0: break writebinaryfile(by,"enttest") enttext=os.popen("ent enttest.bin").read() if enttext[10:14]!="7.99": suc=0 break solutionfound=(suc==1) return # After a solution is found, print ita test results. def evaluatebysuc(by): global testmaurer if testmaurer: print("\nStatistic of Maurer's universal test:\n") tu=round(testbyarray(by),4) print("{0:12.4f}".format(tu)) print("\nStatistics of autocorrelation test for d in [1, 16]:\n") for d in range(1,17): ac=round(autocorrelation(by,d),4) print("{0:2d}{1:10.4f}".format(d,ac)) writebinaryfile(by,"enttest") enttext=os.popen("ent enttest.bin").read() print() print("Report from ENT:") print() print(enttext) return # A set of functions for performing Maurer's universal test: import math # Maurer's Universal Test, see [3]. # [3] J-S. Coron, D. Naccache, An Accurate Evaluation of Maurer's Universal Test. # http://www.jscoron.fr/publications/universal.pdf def maurertest(bb): qq=2560 qqp1=qq+1 kk=256000 qqkkp1=qq+kk+1 eftu=7.1836656 # y1 and y2 are for rho=0.01 and rho=0.001 respectively. y1=2.5758 y2=3.2905 t=[0 for i in range(256)] for i in range(1,qqp1,1): t[bb[i]]=i sum=0.0 for i in range(qqp1,qqkkp1,1): sum+=math.log10(i-t[bb[i]]) t[bb[i]]=i tu=(sum/kk)/math.log10(2.0) c=math.sqrt(0.3732189+(0.3730195*256)/kk) sigma=c*math.sqrt(3.2386622/kk) t11=eftu-y1*sigma t12=eftu+y1*sigma t21=eftu-y2*sigma t22=eftu+y2*sigma return(tu,t11,t12,t21,t22) def maurertestresult(h,gg): global chnum qq=2560 qqp1=qq+1 kk=256000 qqkkp1=qq+kk+1 eftu=7.1836656 if h*chnum < qqkkp1: print("Error in maurertestresult") exit(6) bb=[0 for k in range(h*chnum)] u=0 k1=-1 k2=chnum-1 for i in range(h): g=gg[i] for k in range(k2,k1,-1): bb[u]=g&0xff g >>= 8 u+=1 k1+=chnum k2+=chnum tu,t11,t12,t21,t22 = maurertest(bb) return(round(tu,4)) # Apply Maurer's Universal Test to byarray. Min length of byarray is 260000. # The result should lie in [7.1781, 7.1892] def testbyarray(byarray): global chnum # We actually test only the first 258560 bytes of byarray, see maurertest(). h=260000 chnum=1 assert len(byarray) >= h tu=maurertestresult(h,byarray[:h]) return(tu) # Autocorrelation test of an 8-bit byte array for d in [1, 16]. def autocorrelation(by,d): assert 1 <= d <= 16 lby=len(by) # number of 1's in a byte for byte values 0 - 255 numb1inbyte=bytearray(0) for k in range(256): s=0 m=1 for i in range(8): s+=(k&m) >> i m <<= 1 numb1inbyte.append(s) u=0xffff0000 masks=[u >> i for i in range(17)] # A(d) ad=0 for i in range(0,lby-3,2): h1=(by[i] << 8|by[i+1]) << 16 g=h1|by[i+2] << 8|by[i+3] h2=(g&masks[d]) << d hh=(h1^h2) >> 16 ad+=numb1inbyte[hh >> 8]+numb1inbyte[hh&0xff] n=(lby-3)*8+d x5=2*(ad-(n-d)/2)/math.sqrt(n-d) return(x5) # Functions for input/output of text/binary files. def writebinaryfile(byarray,binaryfilename): fp=open(binaryfilename+".bin","wb") fp.write(byarray) fp.close() def readbinaryfile(binaryfilename): fp=open(binaryfilename+".bin","rb") byarray=bytearray(fp.read()) fp.close() return(byarray) def writetextfile(textstring,textfilename): fp=open(textfilename+".txt","w") fp.write(textstring) fp.close() return def readtextfile(textfilename): fp=open(textfilename+".txt","r") textstring=fp.read() fp.close() return(textstring) # A utility function to skip head and tail parts of a given text file, # optionally perform a rotation and shorten the result to a new length. # The new file is stored with a name which is the original file name extended # with "sh". def shortentextfile(textfilename,headskip,tailskip,rotatepercent,newlength): assert headskip >= 0 and tailskip >= 0 and 0 <= rotatepercent < 100 and\ newlength > 0 textstring=readtextfile(textfilename) if tailskip > 0: textstring=textstring[headskip:][:-tailskip] else: textstring=textstring[headskip:] if rotatepercent > 0: k=(len(textstring)*rotatepercent)//100 if k > 0: textstring=textstring[k:]+textstring[:k] assert len(textstring) >= newlength writetextfile(textstring[:newlength],textfilename+"sh") return ################################################################################ # Software installation and preparation of the text files for the example. # Python version 3x can be downloaded from http://python.org. The present code # can be stored in a file named e.g. textcombinerev.py and run in python's # GUI IDLE. (File --> Open to find and open the file, then in the window # showing the code Run --> Run Module to run it. The file will be accessible # in later runs under Recent Files. # ENT can be downloaded from http://www.fourmilab.ch/random/ The file ent.exe # has to be made accessible from the directory in which the present code is # located, e.g. via having it in the same directory. # 4 text files are to be downloaded and stored in the directory where the # present code is stored. As example the author chooses to employ the # following books from http://www.gutenberg.org and store them as austen.txt, # conrad.txt, collins.txt and hardy.txt respectively. # # Jane Austen, Pride and Prejudice, 709 KB. # # Joseph Conrad, Load Jim, 734 KB. # # Wilkie Collins, The Fallen Leaves, 764 KB. # # Thomas Hardy, The Woodlanders, 926 KB. ################################################################################ # Running the example. # After the present code is run, shorten the 4 textfiles by typing in the main # GUI window the following commands, which will result in 4 new files # austensh.txt, conradsh.txt, collinssh.txt and hardysh.txt: # # shortentextfile("austen",2000,8000,0,600000) # shortentextfile("conrad",2000,8000,0,600000) # shortentextfile("collins",2000,8000,0,600000) # shortentextfile("hardy",2000,8000,0,600000) # Now type in the following commands to combine these 4 files: # # textfilenamelist=["austensh","conradsh","collinssh","hardysh"] # textcombinerev(textfilenamelist,"myresult") # A print-out will come out as follow: # Textfilenamelist: # ['austensh', 'conradsh', 'collinssh', 'hardysh'] # Selection criteria are based on Maurer's universal test, autocorrelation test # and ENT test # TEXTCOMBINE-REV has successfully generated a byte array satisfying the # selection criteria of length 289615 with the following test results: # Statistic of Maurer's universal test: # 7.1849 # Statistics of autocorrelation test for d in [1, 16]: # 1 -1.5202 # 2 -0.4862 # 3 -0.0473 # 4 -0.0447 # 5 0.1537 # 6 -1.3823 # 7 -0.8291 # 8 1.5215 # 9 0.9513 # 10 -0.5295 # 11 1.6490 # 12 -0.2010 # 13 -0.1577 # 14 -0.0696 # 15 -0.2129 # 16 -0.1931 # Report from ENT: # Entropy = 7.999334 bits per byte. # Optimum compression would reduce the size # of this 289615 byte file by 0 percent. # Chi square distribution for 289615 samples is 267.68, and randomly # would exceed this value 28.03 percent of the times. # Arithmetic mean value of data bytes is 127.5758 (127.5 = random). # Monte Carlo value for Pi is 3.132942468 (error 0.28 percent). # Serial correlation coefficient is 0.001267 (totally uncorrelated = 0.0). # Resulting byte array has been written as myresult.bin ################################################################################ # Epilogue. # After the episode mentioned in the Prologue, the author received from a number # of persons kind encouragements to attempt to find a good replacement of the # failed software. In this context he takes the liberty to mention herewith the # name of Cecilia Tanaka who, having read on the Internet of author's failure # cared to give encouragements even at a time of own extremely severe illness. # Initially author's work was concentrated in finding an optimal encoding of a-z # to the 5-bit space [0, 31] based on the frequency distributions of the letters # in the indivicual text files. However, due to the high non-linearity of the # target function for optimization, a global optimum could never be # successsfully obtained within the realm of his humble knowledge and resources. # Author's attention was then switched to the possible benefits of certain # promising "encryption-like" transformations of the naturally/naively encoded, # hence highly biased, 5-bit sequences before they are xor-ed together. It # turned out that these can indeed be fairly beneficial, albeit only to some # limited extent, if they are applied singly. The present software finally # chooses to jointly make use of three different such transformations which in # an extensive experiment had shown their superiority among a total of some # 20 similar transformations. # Hereafter the said genre of transformations will be denoted as preprocessing. # Note that they can only be "encryption-like" and not genuine encryptions, # because the latter employ secret keys which inherently have entropy in them # and such "additional" entropy would betray our very goal of verifying that # the text files are, via their entropy contents alone, able to produce bit # sequences of full entropy. An example of such preprocessing for a 5-bit # byte sequence is processing with a 32*32 Vigenere table, employing the # "non-secret" key 0, 1, 2, ..... 31. # Furthermore, it was found that the preprocessing could be advantageously # employed in iteration, i.e. being repeatedly applied. # Another experimental observation that turned out to be also very helpful in # achieving our goal was that, independent of preprocessing, xor-ing the 4, # though highly biased, 5-bit sequences from the given text files has already # a significantly good chance of roughly 2/3 of yielding sequences of # practically full entropy. In other words, we need consequently only to # attempt to somehow successfully deal with the remaining 1/3 of the cases. # A technical sketch of our scheme in the general case, i.e. where the text # files are sufficiently large so that Maurer's universal test can be applied, # is as follows: # The main function textfilecombinerev() obtains from the given 4 text files # 4 byte arrays containing the 5-bit encoding of the letters in them (all # reduced to lower case and converted to the range [0, 25] in the natural # alphabetical ordering), shortens them, if necessary, to a common length # and decides whether a certain round of preprocessing should be done on them. # It then packs these into 8-bit byte arrays and pass them to the function # testcombination(), which first tries to combine the 4 streams as such with # xor-ing and see whether the result passes Maurer's universal test, the # autocorrelation test and the ENT test. If yes, success is already achieved. # If not, it calls the function testcombinationsub(), where attemps (in 2 # cases) are done to obtain from the 4 byte arrays successively 4 blocks of # 40 bits each, with certain bit rotations of the blocks relative to one # another (the amount of rotations are dynamically increased from block to block # as the processing advances) and combine these first (the 1st case) with # xor-ing to see whether the tests mentioned are satisfied. If yes, success is # already achieved. If not (the 2nd case), the blocks are summed mod 2**40 # to see whether the tests are satisfied. If yes, success is already achieved. # If not, failure is reported back upstream to textcombinerev() where an # additional round of preprocessing will be applied and testcombination() once # again invoked. # Since we employ 3 different preprocessing schemes, these are tried one after # the other in an iteration round and up to 6 iterations of preprocessing will # be done before textcombinerev() reports failure to perform its work. # From an experiment with a large set of combinations of 4 text files (already # mentioned in Prologue), it was found that the number of iterations of # preprocessing needed for success seems to follow a highly exponentially # decreasing pattern. In fact no case in the said experiment needed more than # 2 iterations. It follows that limiting the number of iterations of # preprocessing to a maximum of 6 is a good enough strategy, # In cases where the sizes of the text files are not sufficiently large for # Maurer's universal test to be applied, we are forced to forgo that test, i.e. # only the autocorrelation test and the ENT test will be the selection criteria # employed. # It may be noted that text files in utf-8 format obtained from Project # Gutenberg may contain in the head and tail parts certain texts, e.g. # license notes, that do not properly belong to the books. One may desire to # exclude such foreign materials or perfer to use only certain selected parts # of the books. The utility function shortentextfiles() may be used for such # purposes. # A small number of the downloaded files from Project Gutenberg may contain # coding errors such that Python may report errors while attempting to read # them. A forced remedy for such an eventuality is clear: one chooses to # download a different book.