architecture.rst 9.3 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216
  1. :mod:`email` Package Architecture
  2. =================================
  3. Overview
  4. --------
  5. The email package consists of three major components:
  6. Model
  7. An object structure that represents an email message, and provides an
  8. API for creating, querying, and modifying a message.
  9. Parser
  10. Takes a sequence of characters or bytes and produces a model of the
  11. email message represented by those characters or bytes.
  12. Generator
  13. Takes a model and turns it into a sequence of characters or bytes. The
  14. sequence can either be intended for human consumption (a printable
  15. unicode string) or bytes suitable for transmission over the wire. In
  16. the latter case all data is properly encoded using the content transfer
  17. encodings specified by the relevant RFCs.
  18. Conceptually the package is organized around the model. The model provides both
  19. "external" APIs intended for use by application programs using the library,
  20. and "internal" APIs intended for use by the Parser and Generator components.
  21. This division is intentionally a bit fuzzy; the API described by this
  22. documentation is all a public, stable API. This allows for an application
  23. with special needs to implement its own parser and/or generator.
  24. In addition to the three major functional components, there is a third key
  25. component to the architecture:
  26. Policy
  27. An object that specifies various behavioral settings and carries
  28. implementations of various behavior-controlling methods.
  29. The Policy framework provides a simple and convenient way to control the
  30. behavior of the library, making it possible for the library to be used in a
  31. very flexible fashion while leveraging the common code required to parse,
  32. represent, and generate message-like objects. For example, in addition to the
  33. default :rfc:`5322` email message policy, we also have a policy that manages
  34. HTTP headers in a fashion compliant with :rfc:`2616`. Individual policy
  35. controls, such as the maximum line length produced by the generator, can also
  36. be controlled individually to meet specialized application requirements.
  37. The Model
  38. ---------
  39. The message model is implemented by the :class:`~email.message.Message` class.
  40. The model divides a message into the two fundamental parts discussed by the
  41. RFC: the header section and the body. The `Message` object acts as a
  42. pseudo-dictionary of named headers. Its dictionary interface provides
  43. convenient access to individual headers by name. However, all headers are kept
  44. internally in an ordered list, so that the information about the order of the
  45. headers in the original message is preserved.
  46. The `Message` object also has a `payload` that holds the body. A `payload` can
  47. be one of two things: data, or a list of `Message` objects. The latter is used
  48. to represent a multipart MIME message. Lists can be nested arbitrarily deeply
  49. in order to represent the message, with all terminal leaves having non-list
  50. data payloads.
  51. Message Lifecycle
  52. -----------------
  53. The general lifecycle of a message is:
  54. Creation
  55. A `Message` object can be created by a Parser, or it can be
  56. instantiated as an empty message by an application.
  57. Manipulation
  58. The application may examine one or more headers, and/or the
  59. payload, and it may modify one or more headers and/or
  60. the payload. This may be done on the top level `Message`
  61. object, or on any sub-object.
  62. Finalization
  63. The Model is converted into a unicode or binary stream,
  64. or the model is discarded.
  65. Header Policy Control During Lifecycle
  66. --------------------------------------
  67. One of the major controls exerted by the Policy is the management of headers
  68. during the `Message` lifecycle. Most applications don't need to be aware of
  69. this.
  70. A header enters the model in one of two ways: via a Parser, or by being set to
  71. a specific value by an application program after the Model already exists.
  72. Similarly, a header exits the model in one of two ways: by being serialized by
  73. a Generator, or by being retrieved from a Model by an application program. The
  74. Policy object provides hooks for all four of these pathways.
  75. The model storage for headers is a list of (name, value) tuples.
  76. The Parser identifies headers during parsing, and passes them to the
  77. :meth:`~email.policy.Policy.header_source_parse` method of the Policy. The
  78. result of that method is the (name, value) tuple to be stored in the model.
  79. When an application program supplies a header value (for example, through the
  80. `Message` object `__setitem__` interface), the name and the value are passed to
  81. the :meth:`~email.policy.Policy.header_store_parse` method of the Policy, which
  82. returns the (name, value) tuple to be stored in the model.
  83. When an application program retrieves a header (through any of the dict or list
  84. interfaces of `Message`), the name and value are passed to the
  85. :meth:`~email.policy.Policy.header_fetch_parse` method of the Policy to
  86. obtain the value returned to the application.
  87. When a Generator requests a header during serialization, the name and value are
  88. passed to the :meth:`~email.policy.Policy.fold` method of the Policy, which
  89. returns a string containing line breaks in the appropriate places. The
  90. :meth:`~email.policy.Policy.cte_type` Policy control determines whether or
  91. not Content Transfer Encoding is performed on the data in the header. There is
  92. also a :meth:`~email.policy.Policy.binary_fold` method for use by generators
  93. that produce binary output, which returns the folded header as binary data,
  94. possibly folded at different places than the corresponding string would be.
  95. Handling Binary Data
  96. --------------------
  97. In an ideal world all message data would conform to the RFCs, meaning that the
  98. parser could decode the message into the idealized unicode message that the
  99. sender originally wrote. In the real world, the email package must also be
  100. able to deal with badly formatted messages, including messages containing
  101. non-ASCII characters that either have no indicated character set or are not
  102. valid characters in the indicated character set.
  103. Since email messages are *primarily* text data, and operations on message data
  104. are primarily text operations (except for binary payloads of course), the model
  105. stores all text data as unicode strings. Un-decodable binary inside text
  106. data is handled by using the `surrogateescape` error handler of the ASCII
  107. codec. As with the binary filenames the error handler was introduced to
  108. handle, this allows the email package to "carry" the binary data received
  109. during parsing along until the output stage, at which time it is regenerated
  110. in its original form.
  111. This carried binary data is almost entirely an implementation detail. The one
  112. place where it is visible in the API is in the "internal" API. A Parser must
  113. do the `surrogateescape` encoding of binary input data, and pass that data to
  114. the appropriate Policy method. The "internal" interface used by the Generator
  115. to access header values preserves the `surrogateescaped` bytes. All other
  116. interfaces convert the binary data either back into bytes or into a safe form
  117. (losing information in some cases).
  118. Backward Compatibility
  119. ----------------------
  120. The :class:`~email.policy.Policy.Compat32` Policy provides backward
  121. compatibility with version 5.1 of the email package. It does this via the
  122. following implementation of the four+1 Policy methods described above:
  123. header_source_parse
  124. Splits the first line on the colon to obtain the name, discards any spaces
  125. after the colon, and joins the remainder of the line with all of the
  126. remaining lines, preserving the linesep characters to obtain the value.
  127. Trailing carriage return and/or linefeed characters are stripped from the
  128. resulting value string.
  129. header_store_parse
  130. Returns the name and value exactly as received from the application.
  131. header_fetch_parse
  132. If the value contains any `surrogateescaped` binary data, return the value
  133. as a :class:`~email.header.Header` object, using the character set
  134. `unknown-8bit`. Otherwise just returns the value.
  135. fold
  136. Uses :class:`~email.header.Header`'s folding to fold headers in the
  137. same way the email5.1 generator did.
  138. binary_fold
  139. Same as fold, but encodes to 'ascii'.
  140. New Algorithm
  141. -------------
  142. header_source_parse
  143. Same as legacy behavior.
  144. header_store_parse
  145. Same as legacy behavior.
  146. header_fetch_parse
  147. If the value is already a header object, returns it. Otherwise, parses the
  148. value using the new parser, and returns the resulting object as the value.
  149. `surrogateescaped` bytes get turned into unicode unknown character code
  150. points.
  151. fold
  152. Uses the new header folding algorithm, respecting the policy settings.
  153. surrogateescaped bytes are encoded using the ``unknown-8bit`` charset for
  154. ``cte_type=7bit`` or ``8bit``. Returns a string.
  155. At some point there will also be a ``cte_type=unicode``, and for that
  156. policy fold will serialize the idealized unicode message with RFC-like
  157. folding, converting any surrogateescaped bytes into the unicode
  158. unknown character glyph.
  159. binary_fold
  160. Uses the new header folding algorithm, respecting the policy settings.
  161. surrogateescaped bytes are encoded using the `unknown-8bit` charset for
  162. ``cte_type=7bit``, and get turned back into bytes for ``cte_type=8bit``.
  163. Returns bytes.
  164. At some point there will also be a ``cte_type=unicode``, and for that
  165. policy binary_fold will serialize the message according to :rfc:``5335``.