Add Python 3 'surrogateescape' documentation

2014-05-23 15:24:35 -04:00 · 2014-05-23 15:24:35 -04:00 · 5fc851a1e0
commit 5fc851a1e0
parent 791f070e66
1 changed files with 87 additions and 0 deletions
--- a/Doc/Manual/Python.html
+++ b/Doc/Manual/Python.html
@ -116,6 +116,7 @@
 <li><a href="#Python_nn74">Function annotation</a>
 <li><a href="#Python_nn75">Buffer interface</a>
 <li><a href="#Python_nn76">Abstract base classes</a>
+<li><a href="#Python_nn77">Byte string output conversion</a>
 </ul>
 </ul>
 </div>
@ -5928,6 +5929,92 @@ For details of abstract base class, please see
 <a href="http://www.python.org/dev/peps/pep-3119/">PEP 3119</a>.
 </p>

+<H3><a name="Python_nn77"></a>35.12.4 Byte string output conversion</H3>
+
+
+<p>
+By default, any byte string (<tt>char*</tt> or <tt>std::string</tt>) returned
+from C or C++ code is decoded to text as UTF-8. This decoding uses the
+<tt>surrogateescape</tt> error handler under Python 3.1 or higher -- this
+error handler decodes invalid byte sequences to high surrogate characters
+in the range U+DC80 to U+DCFF.
+
+As an example, consider the following SWIG interface, which exposes a byte
+string that cannot be completely decoded as UTF-8:
+</p>
+
+<div class="code"><pre>
+%module example
+
+%include &lt;std_string.i&gt;
+
+%inline %{
+
+const char* non_utf8_c_str(void) {
+        return "h\xe9llo w\xc3\xb6rld";
+}
+
+%}
+</pre></div>
+
+<p>
+When this method is called from Python 3, the return value is the following
+text string:
+</p>
+
+<div class="code"><pre>
+&gt;&gt;&gt; s = test.non_utf8_c_str()
+&gt;&gt;&gt; s
+'h\udce9llo w&#246;rld'
+</pre></div>
+
+<p>
+Since the C string contains bytes that cannot be decoded as UTF-8, those raw
+bytes are represented as high surrogate characters that can be used to obtain
+the original byte sequence:
+</p>
+
+<div class="code"><pre>
+&gt;&gt;&gt; b = s.encode('utf-8', errors='surrogateescape')
+&gt;&gt;&gt; b
+b'h\xe9llo w\xc3\xb6rld'
+</pre></div>
+
+<p>
+One can then attempt a different encoding, if desired (or simply leave the
+byte string as a raw sequence of bytes for use in binary protocols):
+</p>
+
+<div class="code"><pre>
+&gt;&gt;&gt; b.decode('latin-1')
+'h&#233;llo w&#195;&#182;rld'
+</pre></div>
+
+<p>
+Note, however, that text strings containing surrogate characters are rejected
+with the default <tt>strict</tt> codec error handler. For example:
+</p>
+
+<div class="code"><pre>
+&gt;&gt;&gt; with open('test', 'w') as f:
+...     print(s, file=f)
+...
+Traceback (most recent call last):
+  File "&lt;stdin&gt;", line 2, in &lt;module&gt;
+UnicodeEncodeError: 'utf-8' codec can't encode character '\udce9' in position 1: surrogates not allowed
+</pre></div>
+
+<p>
+This requires the user to check most strings returned by SWIG bindings, but
+the alternative is for a non-UTF8 byte string to be completely inaccessible
+in Python 3 code.
+</p>
+
+<p>
+For more details about the <tt>surrogateescape</tt> error handler, please see
+<a href="http://www.python.org/dev/peps/pep-0383/">PEP 383</a>.
+</p>
+
 </body>
 </html>