Python modules – how do they work? |
1. IntroductionAs we start writing larger Python programs, the amount of names of our variables, functions, classes, etc. grows so, that it becomes necessary to organise them into some categories or subsets, commonly called namespaces. The following language structures offer such facilities:
In this article we will not talk about classes, as they deserve a separate discussion. We will try, however, to describe the functionality offered by Python modules and packages, paying special attention to the module importing mechanism. I hope that this will help the Python programmers to understand better the usage of such useful packages, as wxPython or NumPy. This article has been inspired by questions, which I was asked by the users of the shipbuilding system Tribon M3, made by AVEVA Solutions Ltd, where the embedded Python 2.3 interpreter with more than 600 specialised functions is used as the customisation tool. The functionality discussed here is, however, a feature of the Python language in general, valid not only for Python 2.3, but also for the newer releases (2.4 and 2.5). 2 Modules2.1 What are they?In short, modules are just collections of definitions of variables, functions, and classes. They also can contain some directly executable code, but we will discuss this feature later. In general, we can distinguish three types of modules:
They are always accessible, as they are built into the Python interpreter in use. Examples: sys, math, time. The names of the built-in modules are kept in a list(sys.builtin_module_names). The following code prints out the names of all built-in modules available in the running implementation of the Python interpreter: import sys#对内建模块虽然是在解释器中但任然需要使用import方法引入 #内建模块在python库中没找到文件
Here we need to have either the source code of the module (written in the Python language), or at least a special binary version (file extension .pyc), containing the precompiled bytecodes of the module. To be available for the program, this file needs to be located in one of the folders defined in the list sys.path. Examples: os, re, csv, types. #对标准模块可以在python库中找到对应py文件
The use of fully compiled programming languages improves a performance over a Python-only solution, and allows to use even a very low-level API calls. In order to make the DLL available as a module to the Python interpreter, the code must follow certain conventions described in details in the manual 'Extending and Embedding the Python Interpreter' written by Guido van Rossum, the author of the Python language. By convention, the compiled DLL file has the extension .pyd, not .dll, and should be located in one of the folders defined in the list sys.path, as any other standard Python module. Examples: _tkinter, _sre. 2.2 How they are used?In order to use a definition from any module, the module must be imported first. This is done through the import statement, which can use any of the following syntax variants: · import moduleName The general workflow of the module import is presented below: 1. First, the interpreter searches for the given module name in the dictionary sys.modules, which contains the names of the modules that have been so far loaded and initialised. If the module is found, the system proceeds directly to step 3, skipping the step 2. 2. If the module has not been loaded yet, the system first searches the list of built-in modules, and if unsuccessful, continues by searching for the matching file in the folders listed in the sys.path list. If the module is found, it is loaded, and initialised. Finally, if there was no error during the module loading or initialising, it isregistered in the sys.modules dictionary. If the module is not found, the ImportError exception is raised. Note, that if a module has been found in the sys.modules dictionary, any subsequent 'import' statements trying to import the same module WILL NOT initialise the module again! 3. At this stage, the activities performed by the Python interpreter depend on the syntax variant of theimport statement that has been used. A. import moduleName The name of the module is bound in the local namespace. All identifiers defined in the module become available to the program by using the names qualified with the module's name prefix (see section 2.7). Example: import myModule B. import moduleName as alternateName The alternate name of the module is bound in the local namespace. All identifiers defined in the module become available to the program by using the names qualified with the module's alternate name prefix (see section 2.7). Example: import sys #Define the module 'utils' by importing either if sys.platform == 'win32': C. from moduleName import * Here, the name of the module is not bound in the local namespace, but instead the system searches the namespace of the loaded module, and bounds to the local namespace the names of all PUBLIC identifiers in the module. How Python determines, which identifiers in a module are PUBLIC, and which are not? You will find an answer to this question in section 2.3. D. from moduleName import id1, id2,… Here, the name of the module is also not bound in the local namespace, but the system searches the namespace of the loaded module, and bounds to the local namespace the names of the LISTED identifiers (or their alternate names). Of course, if some of the listed names are not found, an ImportError exception is raised. Note, that when using the syntaxes C or D, there is a danger of rebinding some existing names to the imported items, effectively loosing access to the previously bound items. Example: doIt = True The above code simply assumes that the identifier doIt still refers to a Boolean variable defined in the first line. We are in trouble, however, if the module myModule defines a public identifier with the same name. Then, our Boolean variable is rebound (overwritten) with the definition of the identifier imported from the module myModule. When we try later to print out the original variable, we find out, that doIt is no longer a Boolean variable, but e.g. a function or a string, and we have effectively lost the access to the original variable (it might even have been garbage-collected!). Of course, it would not happen, if we used a different syntax of theimport statement: doIt = True Here, the identifier doIt still refers to the Boolean variable defined in the first line: doIt, and myModule.doIt are simply two different objects! Of course, a good naming convention for identifiers in our modules would also minimise the risk of such name conflict. 2.3 Public identifiersIn the previous section we have mentioned, that the statement: from myModule import * imports all public identifiers from the given module. How Python recognises, which names are public, and which are not? First, Python checks, if the imported module defines the global variable __all__. If it does, this variable should be a sequence of strings, defining the names of the identifiers, which are considered public. This feature of the language allows to prevent certain names from being imported, especially those, that are private to the module. You can imagine having a large module, containing many functions, but only one main function, that is called from the other modules. The other functions are then just the auxiliary functions called internally within the module only. This is an ideal situation to use the __all__ variable! Example: The module: myModule.py def fun1(): #Private function def main(): #Main (public) function __all__ = ['main'] Now, we can try to execute the following commands: from myModule import * No problem here … Let's try something else: res = fun1() Oops! The NameError exception is raised! What has happened? Even though the function fun1 and main both exist in the imported module, only one of them is really accessible. The exception is raised, because the name 'fun1' does not appear in the__all__ list, causing this function to be ignored by the import statement. If the __all__ variable does not exist in the module, all names in the module's namespace that do not start with an underscore ('_') are considered public. 2.4 Module initialisationIf Python needs to initialise a module, it does it either by running a special module initialisation function (for modules not written in Python language), or by executing the module's body (modules written in Python). In the latter case, all the module's definitions are parsed, and the directly executable code is executed. As already said in section 2.2, a module is initialised only once – when it is not found in the sys.modules dictionary. This creates an important implication to the behaviour of the program importing a module, which contains a directly executable code. Since this code is executed only during the module initialisation, it is executed only ONCE, no matter how many times we import the given module in our program! Therefore, it is a bad design practice to put directly executable code in a module, which is imported several times in the program. We simply cannot assume that this code will be executed on every module's import! Instead, we should rather put this code inside a function (e.g. run()), and use the following pattern: import myModule calling the run() function explicitly. In certain situations, however, we need to write a module that is sometimes imported, and sometimes just run directly as the main program. This is especially useful during the development of the module, as we can run the module to execute some unit tests. if __name__ == '__main__': The special string variable __name__ is assigned the name of the module, if it is imported, and the value '__main__', if the module is run. Therefore, the above if statement will execute the runTest() function only if the module is run, and not when it is imported. In order to overcome the limitation of the one-time initialisation of a module, the Python language offers the reload() function. reload(module_name) This causes the module to be reinitialised, and returns the module object as a result. 2.5 .py, .pyc, and .pyo filesIf a module is written in Python, its source code is stored in the file with the '.py' extension (e.g.'myModule.py' for the module 'myModule'). Whenever this file is successfully compiled, an attempt is made to write the compiled version to the file with the same base name, and the '.pyc' extension (e.g. 'myModule.pyc'). It is not an error if this attempt fails; if for any reason the file is not written completely, the resulting '.pyc' file will be recognized as invalid and thus ignored later. The contents of the '.pyc' files are platform independent, so a Python module directory can be shared by machines of different architectures. If the Python interpreter is invoked with the –O flag (or –OO), Python performs some optimisation on the compiled source code, and instead of the '.pyc' file, it creates a file with the extension '.pyo'. It is used in the same way, as the '.pyc' file – the only difference is that '.pyc' files are used when no optimisation is requested, whereas the '.pyo' files are used when Python interpreter works in the optimising mode. The compiled version ('.pyc' or '.pyo') has the modification date of the corresponding '.py' file stored within the file. Python loads the compiled version of the module (without recompiling the '.py' file):
Otherwise, Python ignores the '.pyc' file, and parses the source file. This automates the development process, as you don't have to worry about explicitly updating your compiled files – Python will do it for you … usually. When developing a Python program, you may sometimes notice, that Python 'does not see' the changes you have made to your module, unless you quit and restart your development environment. This can be explained as follows:
Fortunately, most development environments provide facilities to request the reloading of an already imported module (some kind of the 'Reload Modules' button). If not, you could always put a temporary reload(moduleName) statement in your program, forcing the immediate reloading of the module after import. 2.6 sys.path variableThe variable sys.path defines a list of folders searched for imported module files. It is built during the interpreter's initialisation, and can be also customised at run-time. It is important to know, how this list is built, and how we can add our own path to the list, to let our scripts find their modules. First of all, this variable is initialised from the Windows environment variablePYTHONPATH. Both the Python interpreter itself and various other software systems using Python interpreter (e.g. Tribon M3), can define or update this variable accordingly, setting it to a list of paths separated by semi-colons. Then, the special site module is imported. Note, that there is no need to issue the statement import site – Python interpreter will do this for you. This module is a standard Python module, which can be customised by the user to add specific changes to the environment, e.g. adding new folders to the sys.path list. By default (if you install a standalone Python interpreter), it adds some standard folders, like: '/Python23/lib/site-packages'. Additionally, the site module searches the folders in the sys.path list for the*.pth files (path configuration files). If found, they are all read, and the paths defined therein are automatically added to the sys.path list, extending it. Such files are used by some Python packages, like e.g. wxPython, to define the location of the wxPython package modules. Further system customisation can be placed in an optional sitecustomize module, which the site module attempts to import. Then, we can leave the site module unchanged, and put all the customisation in the sitecustomise module. Finally, the module search path (sys.path) can be customised at run-time. Example: path = 'E://PRIVATE//MODULES' where the file my_test_module.py is located in the folderE:/PRIVATE/MODULES. In the above example, the user-defined folder is placed at the end of the sys.path list. If you prefer to place it at the beginning of this list, just replace the statement: sys.path.append(path) by sys.path.insert(0, path) Why it is important? Let's imagine that the module my_test_module is located not only in the folder E:/PRIVATE/MODULES, but also in some of the other folders listed in the sys.path variable. It might be the same module (a copy), another (maybe older!) version of the same module, or even a completely unrelated module, only by coincidence having the same name, as our module. No matter, what is the reason of the existence of this duplicate, the rule is simple: The first matching file found is selected, when searching the folders from the sys.path list in the order, as defined by this list. So, without a warning, you might import a different module, than the one, you wanted to import … Of course, a good naming convention for modules can minimise the risk of such ambiguities. 2.7 Accessing the imported dataAs discussed in section 2.2, the access to the identifiers imported from a module, depends on how they were imported. The performance of your program also depends on the namespace, where the imported identifier is bound. Let's analyse the following two import statements: 1. import math -> math.sin(…) In the first case, the module name itself is registered in the current namespace, but the name 'sin' is registered in the module's namespace. Therefore, we need to use the qualified name math.sin here. In the second example, the identifier sin itself is registered in the current namespace, which allows using this name directly. You cannot use here the qualified name math.sin, because the module's name (math) does not exist in the current namespace. Summing up, the from version of theimport statement allows to write shorter identifiers, possibly also improving performance, but also clutters the current namespace with many new identifiers. This negative effect takes place especially for the statement: from moduleName import * as here there can be really many names imported into the current namespace. 2.8 Namespaces and the name resolutionIn order to understand, how Python resolves the names (qualified or not), translating them into the memory addresses of some variables, functions, etc., we need to be aware of the namespaces available for searching. In general, there are three namespaces, which are searched in the sequence given below:
The local namespace contains all names available in the current scope (e.g. local variables in a function). The global or module's namespace contains all names available in the current module (the main program is also considered to be a module here). Here you will find all 'global' variables, functions, classes, etc. The last, built-in namespace contains the names defined in the module__builtin__, which is always accessible. It contains such names, like:
Understanding the order, in which the namespaces are searched, when resolving a name, may help to write more efficient programs. Example: def codes(name): The above function produces a list of ASCII codes of the characters in the string name. If this is a really long string, it may be worthwhile to optimise the for loop inside this function. Let's consider here the names that are resolved within the loop:
codeList.append is a qualified name, which requires two searches. First, Python needs to find codeList, which fortunately happens to be defined in the local namespace. Then it finds out, that it is a list, and searches for the identifier append in the list object's definition. This two-level search has to be done for each iteration of the for loop. We can reduce the overhead, by defining a local alias to the bound methodcodeList.append, and using it inside a loop. The next candidate for optimisation is the ord function. It is a built-in function, so Python will find it after spuriously searching the local and global namespaces. By defining a local alias to this function, we let Python find it during the first pass – in the local namespace. Summing up, the optimised code looks as follows: def codes(name): Of course, it is not yet the fastest version of our function. There is yet some room for optimisation, but it goes beyond our topic of namespaces and the name resolution. Therefore, I leave the finding of a better code as an exercise to the reader. If we have understood the basic principles of the name resolution, it becomes clear, how Python deals with qualified names. For example, the identifier csv.DictWriter.writerow refers to the method writerow(…) of the class DictWriter, defined in the module csv. Let's analyse the following code: def fun(): When executing the function fun(), after successfully importing the module csv, Python places the name 'csv' in the local namespace of the function fun(). When executing the next statement, Python performs the following activities:
Finally, we must understand, how Python handles the assignments, bounding values to the old or new identifiers. By default, Python binds the value to a variable in the local namespace, possibly hiding the other objects with the same name, existing in other namespaces. Example: abs = 5 After executing the above assignment, we have simply lost an access to the built-in function abs(). As long, as the newly created variable abs lives in the current scope, the built-in function abs() is hidden. We can recover from this situation by deleting the variable abs. This will unhide the function abs(). del abs The alternative is to define the alias BEFORE hiding the function: locAbs = abs Of course, the best solution is to avoid such ambiguities at all. It is possible also to request, that the assignment should go to the global namespace, and not to the local one. Example: lastX = None def fun(x): In the above code, the assignment lastX = x does not create a local variable lastX, but instead updates the global variable – thanks to theglobal declaration. 2.9 The __import__ functionThe import statement, as discussed so far, offers the static import facilities. The name of the imported module is hardcoded in the source code of your program. Python language offers a function, which enables to import modules dynamically, where the name of the imported module is not known in advance. Example: def listGlobals(modName): The above function returns a list of global identifiers defined in the module, whose name is passed as the argument. The__import__ function is invoked internally by the import statement. It imports the given module, and returns the module object, which can be then accessed in the same way, as through the module name. For example, we could write: res = module.run() using the module variable obtained as a result of the__import__ function call. The __import__ function supports additional, optional arguments: module = __import__(modName, globalDict, localDict, fromList) The globalDict dictionary contains the global identifiers (you may use the globals() function here), and thelocalDict – the local identifiers (available through the locals() function). ThefromList argument is used to simulate thefrom modName import id1, id2, … syntax – it contains the list of names to import. Sometimes we replace the built-in __import__ function by our own implementation with the compatible interface to support some special way of importing modules. The imp module is useful, if you need to write your own__import__ function. 2.10 Nested functionsThis topic does not concern modules explicitly, but it is strongly related to our recent discussion of namespaces and the name resolution. Python language allows defining functions inside other functions. Example: def fun(x): def step2(a): step1(x) Here, the argument x is passed to the internal functions in the usual way. Please note, however, that the internal functions have also access to the variables defined in the outer scope (status) – in the main function fun(). This allows passing fewer variables as arguments to the internal functions. This is, however, not 100% foolproof. You may safely read such variables, but if you attempt to modify this variable inside a nested function, you will instead create a local version of it, valid inside this nested function only, hiding the outer variable. Summing up, nested functions can help you hide some private functions. It is sometimes useful, but only for short, simple functions, not modifying the local variables from the outer scope. There can be a temptation to use this approach to reduce the number of externally 'visible' functions in a module, but this can rather lead to difficult to find errors, and decrease the readability of the source code. Therefore, I recommend using this feature rather sparingly. 2.11 Is my module available?Sometimes, we write the code assuming, that the given module will be available on the target system. If it is possible, that this module may be missing, we may want to write some alternative for this case. This requires the ability to detect, if the given module can be imported, or not: try: The Boolean variable myModuleOK will tell, if the module myModule could be imported, or not. Additionally, you may want to import an alternative module instead, offering similar functionality, but coded using some alternative methods or resources. Thanks to the as clause, the rest of the code can just assume, that the modulemyModule is present – it does not need to know, that a replacement has been provided instead of the original version. Note that we should only react to the ImportError exception. The other exceptions indicate rather an error in the code, not the inability to import the module. 3 PackagesIn general, packages are hierarchical structures of modules. Therefore, most of what we have said about modules, applies also to the packages. In this section we will focus on the features specific to packages. When using packages, the module names are composed from a few names separated by dots, e.g. win32com.client (from the pywin32 Python for Windows extensions package). 3.1 Package folder structureWe will analyse the package folder structure on an example of the pywin32 package. After you install it, you will find in the folder /Python23/Lib/site-packages the filepywin32.pth, which extends the sys.path list by adding the following folders (see section 2.6):
These are standard Windows folders with additional modules supplied by the package. But that's not all! In the site-packages folder we can find some more folders coming from the pywin32 package, but not listed in the pywin32.pth folder. One of them is the /Python23/Lib/site-packages/win32com folder. What's special in this folder? You will find out, that it contains the file__init__.py, commented at the top as 'Initialization for the win32com package'. This folder contains a few other Python source files (e.g. 'util.py'), but also a few subfolders. One of them is 'client'. What's interesting, it also contains the__init__.py Python file. It turns out, that Python's import statement considers the subfolders containing the __init__.py file as modules, so that the following statements work fine:
This approach is then applied recursively, if the imported element is also a subfolder with the__init__.pyfile:
The __init__.py file must exist (it may be empty!) in the subfolder to be considered as a subpackage. If you look closely, you will find, that the function Dispatch is defined in the file__init__.py in the subfolder 'client', but apart from this file, the 'client' subfolder contains also some other Python source files. They are submodules of the win32com.client package. Summing up, in order to use the function win32com.client.Dispatch(), we should first execute one of the following import statements:
As you can see from the above examples, the statement from package importitem is able to import either a subpackage, submodule, or some other name defined in the package. 3.2 from package import *Python interpreter cannot find by itself the submodules of the given package or subpackage. We need to help here by providing the__all__ variable (see also section 2.3), being a sequence defining the names of the available public submodules. We should define this variable in the__init__.py file of the particular package. Example: Let's assume, that we add the following definition to the file__init__.py in the 'client' subfolder: __all__ = ['build', 'util'] Then, the statement from win32com.client import * would import into the current namespace the following submodules only:
even though the 'client' folder contains some more modules. If the __all__ definition is missing, theimport statement does NOT import all the submodules of the package. It only ensures, that the package has been successfully loaded, and then imports into the current namespace only the following identifiers:
3.3 Intra-package referencesSubmodules often need to reference some other submodules of the package. If the other submodule is defined within the same subpackage, then we can use the simple non-qualified reference. Example: If we had a submodule named 'test' defined in thewin32com.client subpackage, it might import the util submodule (also from thewin32com.client subpackage) using the simple statement: import util without having to use the qualified name: import win32com.client.util It works, because for packages, the module search sequence is modified to include the current subpackage as the first place to search. If a reference is made to a submodule of another subpackage, the fully qualified name must be used. Example: Here we are still considering the fictious submodule 'test' in thewin32com.client subpackage. We would like to import the submodule 'policy' from the packagewin32com.server. The following statement is required: import win32com.server.policy 3.4 A few comments about wxPythonOne of the common questions asked about wxPython is the one about its dual naming convention. You can use either one of them, although the old style is becoming deprecated, and any use of it is discouraged: Old style (first variant): from wxPython import wx Old style (second variant): from wxPython.wx import * The first variant required to type the 'wx' prefix twice, which was quite annoying. The second variant had quite a high possibility to cause some name conflicts, therefore the 'wx' prefix was imposed on all names in the wx module. This style should not be used anymore. The new style looks as shown below: import wx Here you import not the wxPython module, but the wx module, and the identifiers in the 'new' wx module have lost their 'wx' prefix, which improves the readability of the source code. How does it work? In the Python23/Lib/site-packages/wx folder we can find the__init__.py file, which is an evidence, that wx is a Python package. Further investigation of this file reveals, how the wx package renames the identifiers found in the original wxPython package. It defines the _rename() function, which translates the names from one dictionary into another dictionary, dropping the 'wx' prefix according to certain rules. It also calls this function to perform the renaming of the identifiers found in the namespace of the wxPython.wx module itself, storing the new identifiers in the dictionary referenced by the globals() function, which naturally exposes the new names to the programs importing the new wx module. This is a very good example of a low-level module manipulation, which also proves, how flexible the Python language is, enabling such a translation to be done in a way practically invisible to the programmer. When using the new style discussed above, we should remember to use the statement import wx as the first import from wxPython package. Afterwards, you can do other imports, e.g.: from wx import html If you don't perform import wx first, the other imports would not have the identifiers properly renamed, which will cause trouble. 4 Final notesI really hope, that this article will help Python developers to use the full potential of this language. I take also full responsibility for any errors or omissions. In fact, I would be grateful for any feedback, that would help me make this article better. I would like to thank very much for many discussions, and questions I was asked by the members of the MBM Project Tribon forum and the participants of the Tribon Vitesse trainings, that I have delivered during the past few years.
Comments and discussions on this article. Please do not hesitate to send your feedback or ask any question. |