Add Python 3 'surrogateescape' documentation
This commit is contained in:
parent
791f070e66
commit
5fc851a1e0
1 changed files with 87 additions and 0 deletions
|
|
@ -116,6 +116,7 @@
|
|||
<li><a href="#Python_nn74">Function annotation</a>
|
||||
<li><a href="#Python_nn75">Buffer interface</a>
|
||||
<li><a href="#Python_nn76">Abstract base classes</a>
|
||||
<li><a href="#Python_nn77">Byte string output conversion</a>
|
||||
</ul>
|
||||
</ul>
|
||||
</div>
|
||||
|
|
@ -5928,6 +5929,92 @@ For details of abstract base class, please see
|
|||
<a href="http://www.python.org/dev/peps/pep-3119/">PEP 3119</a>.
|
||||
</p>
|
||||
|
||||
<H3><a name="Python_nn77"></a>35.12.4 Byte string output conversion</H3>
|
||||
|
||||
|
||||
<p>
|
||||
By default, any byte string (<tt>char*</tt> or <tt>std::string</tt>) returned
|
||||
from C or C++ code is decoded to text as UTF-8. This decoding uses the
|
||||
<tt>surrogateescape</tt> error handler under Python 3.1 or higher -- this
|
||||
error handler decodes invalid byte sequences to high surrogate characters
|
||||
in the range U+DC80 to U+DCFF.
|
||||
|
||||
As an example, consider the following SWIG interface, which exposes a byte
|
||||
string that cannot be completely decoded as UTF-8:
|
||||
</p>
|
||||
|
||||
<div class="code"><pre>
|
||||
%module example
|
||||
|
||||
%include <std_string.i>
|
||||
|
||||
%inline %{
|
||||
|
||||
const char* non_utf8_c_str(void) {
|
||||
return "h\xe9llo w\xc3\xb6rld";
|
||||
}
|
||||
|
||||
%}
|
||||
</pre></div>
|
||||
|
||||
<p>
|
||||
When this method is called from Python 3, the return value is the following
|
||||
text string:
|
||||
</p>
|
||||
|
||||
<div class="code"><pre>
|
||||
>>> s = test.non_utf8_c_str()
|
||||
>>> s
|
||||
'h\udce9llo wörld'
|
||||
</pre></div>
|
||||
|
||||
<p>
|
||||
Since the C string contains bytes that cannot be decoded as UTF-8, those raw
|
||||
bytes are represented as high surrogate characters that can be used to obtain
|
||||
the original byte sequence:
|
||||
</p>
|
||||
|
||||
<div class="code"><pre>
|
||||
>>> b = s.encode('utf-8', errors='surrogateescape')
|
||||
>>> b
|
||||
b'h\xe9llo w\xc3\xb6rld'
|
||||
</pre></div>
|
||||
|
||||
<p>
|
||||
One can then attempt a different encoding, if desired (or simply leave the
|
||||
byte string as a raw sequence of bytes for use in binary protocols):
|
||||
</p>
|
||||
|
||||
<div class="code"><pre>
|
||||
>>> b.decode('latin-1')
|
||||
'héllo wörld'
|
||||
</pre></div>
|
||||
|
||||
<p>
|
||||
Note, however, that text strings containing surrogate characters are rejected
|
||||
with the default <tt>strict</tt> codec error handler. For example:
|
||||
</p>
|
||||
|
||||
<div class="code"><pre>
|
||||
>>> with open('test', 'w') as f:
|
||||
... print(s, file=f)
|
||||
...
|
||||
Traceback (most recent call last):
|
||||
File "<stdin>", line 2, in <module>
|
||||
UnicodeEncodeError: 'utf-8' codec can't encode character '\udce9' in position 1: surrogates not allowed
|
||||
</pre></div>
|
||||
|
||||
<p>
|
||||
This requires the user to check most strings returned by SWIG bindings, but
|
||||
the alternative is for a non-UTF8 byte string to be completely inaccessible
|
||||
in Python 3 code.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
For more details about the <tt>surrogateescape</tt> error handler, please see
|
||||
<a href="http://www.python.org/dev/peps/pep-0383/">PEP 383</a>.
|
||||
</p>
|
||||
|
||||
</body>
|
||||
</html>
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue