EMAILSTEGANO

Return to main page




# EMAILSTEGANO, a scheme for embedding stego bits in emails and similar text
# files.

# Since emails are commonly frequently exchanged, one has a naturally available
# busy channel for transmitting stego bits. The stego bits embedding rate
# of EMAILSTEGANO is only about 1 bit per line of the resulting stego text
# (covertext). The stego text merely differs slightly from the original message
# text in formatting (i.e. both are identical word for word), which is its
# principal advantage in comparison to the other known text stego schemes,
# whether of syntatic or semantic nature. The low bit rate of the scheme is in
# the majority of cases in practice more or less compensated by the fact that
# the accumulated volume of the cover texts and hence the number of stego bits
# transmitted in a certain time period could nonetheless be substantial.

# Webpages of communication partners can serve similar functions as emails in
# transmitting stego bits. For the texts in HTML source files (accessible via
# the right mouse key) are fairly free from formatting constraints, even 
# though the webpages themselves are always nicely formatted, and thus could
# be processed by EMAILSTEGANO. In this case the recipient needs however to 
# know the avilability of new stego informations either from the new contents
# of the webpages or from notices via an independent channel. Note that there
# are lots of kinds of webpages whose contents are by nature subjected to very
# frequent updates and hence such webpages are very suitable for our use.

# The user has to define maxlinelen which limits the width of the stegotext
# output. 72 seems to be a practically good value for maxlinelen. For e.g.
# the email software Thunderbird a text file having line length <= 72, when cut
# and pasted into the input window, will remain unchanged in the 'visual' 
# appearance for the sender. (Setting maxlinlen=120 has however the benefit of
# leading to less irregularities of the line lengths, if the recipient reads the
# emails in the maximized email window.)

# Assumption: all words of the input file are shorter than maxlinelen/2.
 
# Word is understood as any sequence of characters bounded by spaces or eol,
# i.e. "\n" in the sense of C and Python. (On typing into the editor of Windows,
# eol is generated when the return-key is pressed.) The paragraphs of the given
# raw input text file has to be seperated from one another by two or more eol
# (and additionally any number of spaces). There is otherwise no limitation by 
# EMAILSTEGANO on the format of that file, in particular its lines need not be
# left adjusted and their lengths can be completely arbitrary.

# The number of words in a line of the output file stegotext mod 2 gives the
# stego bit embedded in the line. Note that, as a special convention required 
# by programming logic, for each paragraph of the stegotext output by
# EMAILSTEGANO, the last line contains no embedded stego bit. In particular, any
# single-line paragraph has no embedded stego bit. All lines of stegotext are
# left adjusted. If desired, the sender may re-format the file stegotext by 
# adding a few spaces at the beginning of some of its lines. This doesn't affect
# the stegobits recovered by the recipient (nothing should be done, however, if 
# on cutting and pasting into the email window this re-formatting 'appears' to 
# lead to additional eols).

# The stegobits to be embedded is to be specified by the user as a string of 0/1
# in the variable named stegobits. If there is more input text than needed to 
# process the bits in stegobits, random dummy bits will be used to process the 
# rest of the input text. The random bits from Python's builtin PRNG starts with
# a variable seed, if the user doesn't set a seed. Thus if the processing is 
# repeated (i.e. with the same raw input text and the same stegobits) the dummy 
# stego bits used may differ, which means that the part of the stegofile that 
# corresponds to the dummy bits may not be formatted exactly the same each time.

# Since there are in general dummy bits at the end of the processed covertext in
# the stegofile, the recipient needs to know the length of the given (real)
# stego bits. Length of stegobits could be a constant (or any multiple of it)
# agreed upon by the communication partners or else one could arrange to have
# an eof symbol (e.g. for 5 bit encoding of the alphabet one 5 bit code could be
# chosen to serve as eof, a bof symbol may be similarly employed). Stegobits may
# have at the beginning an agreed upon number of dummy bits to be ignored by the
# recipient. Another means of indicating length of stegobits (including 
# presence/absence of stego bits at all in a given email) is to utilize certain
# keywords, e.g. personal names, city names, etc. The presence or not of a word
# in one keyword list may indicate the presence/absence of stego bits. The words
# of another keyword list may be mapped to [0,9] (there may homophones, i.e. the
# mapping can be many to one) such that one could e.g. use two of them to
# indicate a number in [0,99] that corresponds to the length of the given stego
# bits, i.e. when W1 and W2 of the keyword list occur in the message and either
# other words of the keyword list don't occur or only the first two occurences
# of words of the keyword list are counted. Note that these keywords could, by
# agreement, be located in the header or foot of the email that contains the
# covertext of the stegofile instead of being words in the covertext itself.

# The sender can employ the function recoverstegobits() to verify that the stego
# text generated is ok. Note that computer processing is only a desirable
# convenience, in particular the recipient could without too much difficulty
# manually retrieve the stego bits, if required. 

# It is self-evident that for security the stegobits should stem from a good 
# encryption processing. (At the risk of being blamed for self-promotion, we
# mention here author's SHUFFLE2.)

# We assume only passive wardens, i.e. the stegotext is not modified en route to
# the recipient. Since email writing is commonly done very legerely in respect
# of formatting, the eventual slight unnaturalness in the apprearance of the
# stegotext resulting is deemed to be acceptable and anyway apparently couldn't
# be used as an incriminating fact for the communication partners even under 
# non-democratic regimes that otherwise have severe regulations empowering their
# agencies to arbitrarily demand handing out of encryption keys of encrypted
# materials or even simply outlaw encrypted communications in general. (Maybe
# one day they would have to outlaw emails as such in view of the nice secret
# hiding capability provided by EMAILSTEGANO??) Note that the difficulty facing
# the wardens is that there is barely any practical means to more or less
# reliably discern/decide whether a given piece of email is the result of
# processing by EMAILSTEGANO or not.

# Avoidance of long words tends to improve the appearance of stegotext. In
# extreme cases one could do a little bit rewriting of text input in order to
# obtain a better appearing stegotext.

# Note that the method we use of course can also be applied to arbitrary kinds 
# of hand-written texts (in particular normal letters and copies of certain 
# well-known literatures) where the more or less unsatisfactory ruggedness of
# the end of lines mentioned above may be completely avoided, if corresponding
# care is taken in writing. (EMAILSTEGANO provides the lines to be copied in
# such cases.)

# No attempt is made in coding to do optimization for efficiency etc.

# It may be noted the rather low stego-bit rate of the present scheme wouldn't
# be a problem in practice in cases where the realm of discourse is highly
# limited, e.g. when a few numerical codes from a codebook used by the
# communication partners suffice to form the entire stego-bit sequences to be
# transmitted.

# For higher efficiency (though demanding some more work of the user) we
# recommend a recent linguistic steganographical scheme of the author:
# WORDLISTTEXTSTEGANOGRAPHY.


# Version 1.0, released 29.07.2012.


# This software may be freely used:

# 1. for all personal purposes unconditionally and

# 2. for all other purposes under the condition that its name, version number 
#    and authorship are explicitly mentioned and that the author is informed of
#    all eventual code modifications done.


# A list of present author's software that are currently directly maintained by
# himself is available at http://mok-kong-shen.de. Users are advised to
# download such software from that home page only.


# Concrete comments and constructive critiques are sincerely solicited either
# via the above thread or directly via email.


# Email address of the author: mok-kong.shen@t-online.de



# We presume that the computer, on which this software is run, is free from
# malware infection via software and/or hardware means and that there are no
# emission security risks.

# The following comments have no direct relevance to EMAILSTEGANO but are
# included here as informations for those who may be interested in further
# developments of text-based steganography.

# Elsewhere someone mentioned a known stego scheme of using the first characters
# of words/sentences as stego characters and rightly remarked that the scheme
# can be practically applied ("user-friendly") only when the stego character
# sequence is in natural language (i.e. not encrypted, in which case the scheme
# is however evidently very weak) and not when the stego character sequence is
# a ciphertext of the actual secret message to be transmitted. It seems to me
# that, with an appropriate adaptation/modification, the same classically known
# idea could nonetheless be usefully exploited, if one could accept certain
# corresponding reduction in transmission efficiency. To illustrate with a
# concrete construction: Let the 26 characters of the alphabet be suitably
# divided into 8 groups (in general of different sizes) such that in each group
# there is at least one character that fairly frequently is the first character
# of sentences in natural language communications. Then, given any arbitrary set
# of 3 stego bits, the user wouldn't have too much difficulty to write a
# sentence which is sufficiently natural to the given context of communication
# and which starts with a character that is in one of the said 8 groups that
# corresponds to an ecoding by these 3 stego bits. This way, each sentence of
# the covertext can transmit 3 stego bits which, though not a very high rate, is
# nevertheless something worthy of consideration in the practice IMHO. (Of
# course, such a scheme could be used simultaneously with EMAILSTEGANO to obtain
# a higher total stego bit rate.) The idea is apparently fairly flexible such
# that eventually it could also be applied to clauses and further to subjects,
# verbs or objects of sentences. Further, one could use e.g. the 2nd character
# of the words concerned instead of the 1st character. It is clear that, for
# best performance, careful studies of extensive data of the language concerned
# would be required. Helpful for such investigations would be not only
# dictionaries of synonyms and word frequency lists but also special works like
# Le Robert Dictionnaire des mots croises et mots fleches. (Note that, since one
# exploits only one character of the words and not the entire words, one
# naturally has flexibility/simplicity that is hardly attainable with schemes
# that depend on word substitutions.)

# Rhinedahl suggested in 2010 to exploit the grammatical information of
# sentences to transmit 5 stego bits per sentence as follows:

# 1st bit = number of noun phrases in the sentence modulo 2.
# 2nd bit = number of adjectives modulo 2.
# 3rd bit = number of adverbs modulo 2.
# 4th bit = number of clauses modulo 2.
# 5th bit = was the main verb transitive (=1) or intransitive (=0)?

# Pending accumulation of practical experiences, it seems open whether or not
# certain modifications of this scheme would be required for best performance.

# A well-known stego idea employs word substitutions. Example: Let there be sets
# of words e.g. {John, Bill, ...}, {drive, fly, ...}, {Monday, Tuesday, ...}, 
# {London, Paris, ...}, {visit, see, ...}, {Jane, Mary, ...}, i.e. sets of words
# having the same conceptual categories. Then one could use sentences like "Bill
# flies on Tuesday to Paris to see Mary" to convey a few stego bits. There are
# non-trivial problems to implement a good general-purpose stego scheme based on
# this. However, for a sufficiently narrow particular universe of discourse,
# which often is the case in a crypto context, the difficulties could presumably
# be more easily overcome.

# Obviously the scheme could also be employed manually, including for
# hand written letters.



# Begin of code proper of EMAILSTEGANO.


import random

# An auxiliary function of getparagraphs().
def findh():
  global g,h
  h=-1
  glen1=len(g)-1
  suc=0
  break1=0
  while h < glen1:
    h1=h+1
    if "\n" in g[h1:]:
      d=g[h1:].index("\n")
      hn=h1+d
      break2=0
      for i in range(h1,hn):
        if g[i]!=" ":
          if suc==0:
            h=hn
          else:
            break1=1
          break2=1
          break
      if break1==1: break
      if break2==1: continue
      suc=1
      h=hn

def getparagraphs(filename):
  global g,h,paragraphs
  f=open(filename)
  g=f.read()
  f.close()
  g+="\n\n"
  paragraphs=[]
  while len(g)>0:
    findh()
    paragraphs+=[g[0:h+1].split()]
    g=g[h+1:]
  if paragraphs[0]==[]:
    paragraphs=paragraphs[1:]

def embedbitsintoparagraph(paragraphnumber):
  global paragraphs,stegobits,lenstegobits,bk,dummybits,maxlinelen
  text=paragraphs[paragraphnumber]
  textlen=len(text) 
  lineblock=[]   
  line=[]
  linelen=0
  wordn=0
  maxsize=maxlinelen//2-1
  for i in range(textlen):
    if len(text[i])>maxsize:
      print("word too long: ",text[i])
      exit(1)
    linelennew=linelen+1+len(text[i])     
    if linelennew<=maxlinelen:
      if wordn==0:                
        line+=[text[i]]
        linelen=linelennew-1
      else:
        line=line+[" "]+[text[i]]
        linelen=linelennew
      wordn+=1
    else:
      if bk < lenstegobits:
        sbit=int(stegobits[bk])
      else:
        sbit=random.randint(0,1)
        if sbit==0:
          dummybits+="0"
        else:
          dummybits+="1"
      if sbit!=wordn%2:
        lastword=line[-1]
        line=line[:-2]
        lineblock+=[line]
        line=[lastword]+[" "]+[text[i]]
        linelen=len(lastword)+1+len(text[i])
        wordn=2
      else:
        lineblock+=[line]
        line=[text[i]]
        linelen=len(text[i])
        wordn=1
      bk+=1
  if linelen>0:    
    lineblock+=[line]
  else:
# Convention: last line of a paragraph does not contain stego bit.
    bk-=1
  textout=""
  for i in range(len(lineblock)):
    str=""
    for j in range(len(lineblock[i])):
      str+=lineblock[i][j]
    textout=textout+str+"\n"
  textout+="\n"
  return(textout)

def embedstegobits(rawfile,stegofile):
  global paragraphs,stegobits,lenstegobits,bk,dummybits,maxlinelen
  lenstegobits=len(stegobits)
  dummybits=""
  getparagraphs(rawfile)
# Index of the bit in stegobits that is yet to be embedded
  bk=0
  sumtext=""
  for i in range(len(paragraphs)):
    g=embedbitsintoparagraph(i)
    sumtext+=g
  if bk < lenstegobits:
    print("Rawfile is too small for embedding all stego bits, only",bk,
          "stego bits could be embedded")
    exit(2)
  elif len(dummybits)==0:
    print("All",len(stegobits),"stegobits are embedded")
    print("Stegobits: ",stegobits)
  else:
    print("All",len(stegobits),"stegobits are embedded, there are additionally",
          len(dummybits),"random dummy bits")
    print("Stegobits: ",stegobits)
    print("Dummybits: ",dummybits)
  f=open(stegofile,"w")
  f.write(sumtext)
  f.close()

def recoverstegobits(stegofile):
  recoveredbits=""
  f=open(stegofile)
  g=f.read()
  f.close()
# Note that paragraphs here is defined differently than the global variable
# paragraphs used in getparagraphs() etc.
  paragraphs=g.split("\n\n")
  for i in range(len(paragraphs)):
    paragraph=paragraphs[i]
    lines=paragraph.split("\n")
# Convention: last line of a paragraph does not contain embedded bit.
    for j in range(len(lines)-1):
      line=lines[j]
      words=line.split()
      rb=len(words)%2
      if rb==0:
        recoveredbits+="0"
      else:
        recoveredbits+="1"
  print("Bits recovered from stegofile: ",recoveredbits)
  
  
# An example of use:

# Maximal line length in the output stegofile.
maxlinelen=72

# Stego bits to be transmitted.
stegobits="0101110111"

# Name of the the user-given input text file.
rawfile="rawfile.txt"

# Name of the output file that carries the user-given stegobits.
stegofile="stegofile.txt"

# Sender generates the stegofile.
embedstegobits(rawfile,stegofile)

# The recipient obtains the stegobits or the sender checks the correctness
# of processing.
recoverstegobits(stegofile)



# With the first two paragraphs of "A Tale of Two Cities" as rawfile, one run
# of the example led to the following content (without "# ") in stegofile (with
# two added dummy bits):

# It was the best of times, it was the worst of times, it was the age
# of wisdom, it was the age of foolishness, it was the epoch of belief, it
# was the epoch of incredulity, it was the season of Light, it was the
# season of Darkness, it was the spring of hope, it was the winter
# of despair, we had everything before us, we had nothing before us, we
# were all going direct to Heaven, we were all going direct the other
# way -- in short, the period was so far like the present period, that
# some of its noisiest authorities insisted on its being received, for
# good or for evil, in the superlative degree of comparison only.

# There were a king with a large jaw and a queen with a plain face, on the
# throne of England; there were a king with a large jaw and a queen with
# a fair face, on the throne of France. In both countries it was clearer
# than crystal to the lords of the State preserves of loaves and
# fishes, that things in general were settled for ever.



Return to main page