What are the function of each string prefix in Python?

Asked

Viewed 1,259 times

5

In Python, we often see strings with a prefix, such as below, where the prefix is used r:

r"""OS routines for NT or Posix depending on what system we're on.
This exports:
  - all functions from posix or nt, e.g. unlink, stat, etc.
  - os.path is either posixpath or ntpath
  - os.name is either 'posix' or 'nt'
  - os.curdir is a string representing the current directory (always '.')
  - os.pardir is a string representing the parent directory (always '..')
  - os.sep is the (or a most common) pathname separator ('/' or '\\')
  - os.extsep is the extension separator (always '.')
  - os.altsep is the alternate pathname separator (None or '/')
  - os.pathsep is the component separator used in $PATH etc
  - os.linesep is the line separator in text files ('\r' or '\n' or '\r\n')
  - os.defpath is the default search path for executables
  - os.devnull is the file path of the null device ('/dev/null', etc.)
Programs that import and use 'os' stand a better chance of being
portable between different platforms.  Of course, they must then
only use functions that are defined by all platforms (e.g., unlink
and opendir), and leave all pathname manipulation to os.path
(e.g., split and join).
"""

Excerpt from module source code os, accessed via official repository.

  • What are all the existing prefixes in Python?
  • What is the function of each prefix?
  • What impact the prefix has on the value, type and size of the string?

It is desirable that the answer address both for version 2 and 3 of the language, as well as describe the differences, if any, between the versions.

  • https://answall.com/q/80545/101

2 answers

4


So - come on - when I say Python, I’m talking about Python3; let’s assume that Python2 is a thing of the past. (I speak of the strings in Python2, but it is just one of the points where there has been the biggest change)

  • Use of ' or " to delimit strings: there is no difference. Use whatever you prefer or whatever is most convenient to delimit your strings of any kind.

  • Use of triple quotes - """ or ''': indicates that the string only ends when you find a corresponding triple quotes and will include everything in the way: other quotes, line changes, etc... it is ideal to put small snippets of other languages, such as SQL or HTML when necessary, or to put comments that include documentation.

  • Strings with no prefix or prefix u": In Python3, the "u" does nothing. It was introduced in Python 3.3 (if memory serves) to make it easier to write code that worked on both Python2 and Python3 without modifications. Already in Python2, a string prefixed with u" indicates that it is a string ""that is, she is treated by Python as text, not as a byte sequence. In a string Python knows that each element is a character, whereas in a string of bytes, you depend on the text encoding to read each element and, in some encodings, a character uses more than one byte. So, in Python2, in a string without the u, if you try to convert a word like "apple" to uppercase on a Unix system (Android, Linux, Mac OS), which use UTF-8 by default, will end with "Apple" instead of "apple" (and even more: Len("apple") returns "6" and not "4"). With strings Nicode, the "ç" and the "ã" are understood as letters. It is very important to understand what is Unicode text - vital for programming, actually. I recommend the reading of this article, written more than 10 years ago by one of the founders of stackoverflow.. But, reframing to be clear: in Python3, all string is Unicode by default, except that it has the prefix "b". In Python2, all string is just a sequence of bytes, except those that have the prefix "u".

A correct program in Python2 should have all strings prefixed "u". In Python 3, occasionally you will have a string bytes, which is an object bytes: generally when using a low-level library such as sockets, or when opening a file for reading in binary mode. In this case, what you should always do before trying to work with the text is decode to string: one string bytes will always be encoded with some encoding - be "latin1", "utf-8", "cp852" for the DOS terminal, "cp1251" for Russian, "cp1253" for Greek, etc. Then you run the method decode of the object itself bytes turning it into a string appropriate text. In Python 2, the same thing - the same method decode, and you have a string "Unicode", which can be used as text. More careful, in fact A lot of care is needed to program correctly in Python2, because most people don’t know about these Unicode or non-Unicode steps. This leads to programs that use text incorrectly. In Python3, most text operations are neither permitted nor work differently with strings of bytes, encouraging the correct use of strings (which are Unicode). For example: in Python2 it is possible to concatenate with + one string bytes and a string text. Only, if the string bytes have only one character outside the ASCII range (0 to 127), for example, an accented character, Python gives an error. This causes intermittent errors in your program if you concatenate two variables and one of them does not contain Unicode, and has already taken a lot of sleep:

>>> u"maç" + "ã"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

(But u"maç" + "a" works). In Python 3 this is always an illegal operation: you have to convert everything to text before trying to concatenate. This makes everyone see the error in any program run and it is easy for the developer to fix it. The way Python 2 does, the error can go unnoticed and only cause problems in production.

  • Prefix b: implies that the object in quotes is not a string text, but an object of the type bytes. It is more or less equivalent to string without any prefix that existed in Python2. In Python3 however, an object bytes is quite different from the object "bytes" in Python2: in particular when you try to use the operator of [ ] Python returns a number between 0 and 255: the value of the byte at that position. (The same thing if you use such an object in a command for ):

     >>> for char in "maçã":
    
     ...    print(char, end=" ")
     ... 
     m a ç ã [in 2] 
     >>> for char in b"maçã":
     ...    print(char, end=" ")
     ... 
     File "<stdin>", line 1
     SyntaxError: bytes can only contain ASCII literal characters.
     >>> for char in b"maca":
    ...    print(char, end=" ")
    ... 
    109 97 99 97
    

This is in Python 3. Python 2 would print the 4 letters. As you can see, Python 3 even allows you to place characters outside the 32-127 range within a byte literal, with the prefix b". (If you need to do this, you must use the exhaust \xHH and put the code, in hexadecimal, directly into string)

  • Prefix r": indicates a literal "raw" (raw): inside that string, the backslash character does not have the special role of acting as an "escape" for special characters. In a string Python normal, some sequences such as " n", "t", "b" do not represent two characters, but rather a single control character (in this case respectively code 10 "new line", "tab", code 9, and code 8 "Backspace"). Prefixed b, the "slash" is always the slash and the next character is always itself. Therefore, when you put file names in Windows with the " " character to separate directories you need to put the prefix r". Otherwise, your file names work for letters that are not set as sequences, but will go wrong if the \ precede a character that forms a special sequence. (These characters denoted by the sequence starting with " " are an inheritance of the language C, incidentally).

Windows Tip: you can, and should, use the "/" character to separate directories - it works inside Python.

Another frequent use for strings with r" is when defining regular expressions. Because these use sequences starting with \ for its own purposes - if Python transforms the sequence " n" into a single code character 10, the code of the regular expressions would not even see the \.

We can make a parenthesis and mention three very important types of exhaust with \: \xHH (replace each "H" by a hexadecimal digit) allows you to place any code from 0 to 255 within a string. In text objects, characters in this range coincide with the latin1 encoding - that is, any number placed after the \x will generate a valid character. Another is \uHHHH: 4 hexadecimal digits to specify any Unicode character with code 65536. A web search, or rather, printing the characters in the Python console, allows you to find the letters of several different alphabets, dozens of emojis and hundreds of other symbols (including, for example, chess pieces). The prefix \UHHHHHHHH allows specifying a code of up to 8 hexadecimal digits and allows the use of code characters above 65536.

  • Combination of "u" or "b" with "r": both "u" and "b" can be used in conjunction with "r": ur"maçã \b", because "r" does not change the internal content of string, just how the characters in quotes will be interpreted. The combination does not bring surprises in Python2 or Python3.

  • Strings with f": novelty in Python 3.6, does not exist in any previous version: creates a special object that is solved at the time of compiling the program and allows inserting inside the program itself string, variable values, numerical accounts, or any Python expression, without calling the method format;

Example:

nome = "João"; 
print(f"Meu nome é {nome}")  # Meu nome é João
  • Objects of the type f" can also be combined with the prefix "r": fr"...".

These are the types that exist as a prefix of the quotation marks. Recap, in Python 3: b": object of type "bytes", u": same as no prefix, only exists to be easy to write code compatible between Python 2 and 3, r" for strings "raw" where the " " is always a "\", strings f" which may include other objects and expressions directly, and combinations of u, b or f with r;

2

In Python 2.7, you have the following options:

  • s = 'dog': String ASCII
  • s = u'dog': Unicode string that accepts special characters and accents (á,º,§, etc)
  • s = r'cao n t r': String Raw, which interprets characters the way they are.

Possible examples:

>>> s = 'cão' # erro, pois 'ã' não é ASCII
>>> s = u'cão' # OK, pois 'ã' é unicode

>>> s = 'cachorro\n' # interpreta o \n como quebra de linha
>>> print s
cachorro

>>> s = r'cachorro\n' # interpreta o \n como '\n'
>>> print s
cachorro\n

>>> s = u'cão\n' # interpreta o \n como quebra de linha
>>> print s
cão

>>> s = ur'cão\n' # interpreta o \n como '\n'
>>> print s
cão\n
  • What about Python 3? What about their combinations? What is their impact on the string?

Browser other questions tagged

You are not signed in. Login or sign up in order to post.