22

In the context of a complex application, I need to import user-supplied 'scripts'. Ideally, a script would have

def init(): blah def execute(): more blah def cleanup(): yadda 

so I'd just

import imp fname, path, desc = imp.find_module(userscript) foo = imp.load_module(userscript, fname, path, desc) foo.init() 

However, as we all know, the user's script is executed as soon as load_module runs. Which means, a script can be something like this:

def init(): blah yadda 

yielding to the yadda part being called as soon as I import the script.

What I need is a way to:

  1. check first whether it has init(), execute() and cleanup()
  2. if they exist, all is well
  3. if they don't exist, complain
  4. don't run any other code, or at least not until I know there's no init()

Normally I'd force the use the same old if __name__ == '__main__' trick, but I have little control on the user-supplied script, so I'm looking for a relatively painless solution. I have seen all sorts of complicated tricks, including parsing the script, but nothing really simple. I'm surprised it does not exist.. or maybe I'm not getting something.

Thanks.

6
  • I'd just tell your users to use if __name__ etc. If they don't do it, that's just their lookout. Commented Dec 18, 2011 at 13:56
  • @DavidHeffernan yes, that would be the usual way. But I'm surprised there's no actual way to do this without having to talk to a human being Commented Dec 18, 2011 at 13:57
  • 1
    Due to Python's nature, you can't tell what's in there until it's executed. The closest you can get it using find_module and then inspecting it manually... which still will be flawed as it can pull in code from other places or use a strange encoding like rot13 or other such fun stuff. Commented Dec 18, 2011 at 13:58
  • Precisely. Python modules can (and many do, to perfectly legitimate ends - right now, I'm working on a few modules that call collections.namedtuple which builds a string and execs it) execute arbitary code. Just like you generally can't check types or check whether it does not use certain functions, you cannot determine what a module does without importing it. You'll have to trust the user to some degree, or execute the code in a fully-blown sandbox (likely not what you want, and likely overkill). Commented Dec 18, 2011 at 14:15
  • 1
    @robus not at all, the module is still executed no matter how you import it. Different imports just influence name binding. Commented Dec 18, 2011 at 14:53

7 Answers 7

10

My attempt using the ast module:

import ast # which syntax elements are allowed at module level? whitelist = [ # docstring lambda x: isinstance(x, ast.Expr) \ and isinstance(x.value, ast.Str), # import lambda x: isinstance(x, ast.Import), # class lambda x: isinstance(x, ast.ClassDef), # function lambda x: isinstance(x, ast.FunctionDef), ] def validate(source, required_functions): tree = ast.parse(source) functions = set() required_functions = set(required_functions) for item in tree.body: if isinstance(item, ast.FunctionDef): functions.add(item.name) continue if all(not checker(item) for checker in whitelist): return False # at least the required functions must be there return len(required_functions - functions) == 0 if __name__ == "__main__": required_funcs = [ "init", "execute", "cleanup" ] with open("/tmp/test.py", "rb") as f: print("yay!" if validate(f.read(), required_funcs) else "d'oh!") 
Sign up to request clarification or add additional context in comments.

7 Comments

I like this, but I'm not sure if this will work all the time (means: I am certain that someone will work out to to circumvent this ;) )
Neat, this is a great use of AST. As an academic exercise, how can we circumvent this (so it validates a file which runs arbitrary code on import)? So far, I see two ways: function calls in default arguments (def f(x=dothings()):) and writing a metaclass to run code when a class is defined.
@Thomas K: We could fix that by forbidding classes and functions with parameters. But then again, there are other possibilities like importing custom modules etc... I don't think you can really get much better.
Very pretty. Actually, I am not worried at all about mis-use: if a user can circumvent this, they can just as well write the initial script according to my requests (with exec(), init(), etc.) and save us the hassle. I was just worried about users inserting code not in those functions in the first place, then complaining. But the more I work with python, the more I appreciate its "made for programmers" philosophy. Thanks!
@Niklas: It's definitely a good solution here. I'm just interested in how secure this approach could be made. I think it's possible to do quite well, without many false negatives.
|
6

Here's a simpler (and more naive) alternative to the AST approach:

import sys from imp import find_module, new_module, PY_SOURCE EXPECTED = ("init", "execute", "cleanup") def import_script(name): fileobj, path, description = find_module(name) if description[2] != PY_SOURCE: raise ImportError("no source file found") code = compile(fileobj.read(), path, "exec") expected = list(EXPECTED) for const in code.co_consts: if isinstance(const, type(code)) and const.co_name in expected: expected.remove(const.co_name) if expected: raise ImportError("missing expected function: {}".format(expected)) module = new_module(name) exec(code, module.__dict__) sys.modules[name] = module return module 

Keep in mind, this is a very direct way of doing it and circumvents extensions to Python's import machinery.

Comments

4

I'd first of all not require some functions, but a class that conforms to a specified interface, using either the abc module, or zope.interface. This forces the maker of the module to supply the functions you want.

Secondly, I would not bother looking for module-level code. It's the module-makers problem if he does this. It's too much work with no actual benefit.

If you are worried about security issues, you need to sandbox the code somehow anyway.

5 Comments

Thanks. My worry wasn't about security, more about correct working of the application. The last thing I want is to find myself explaining to an end-user why there can't be any code outside of the requested functions...
@lorenzog: An end-user will not program Python scripts. A programmer might, but he will understand why. See also "Consenting adults". :-)
@lorenzog: What? How on earth would you write software without realizing it!? That's absurd. Install yes. But there, once again, you then need protection against malicious scripts via some sort of sandboxing.
as I said I do not need protection against malicious scripts. If this were the case I would have designed the entire app differently (and anyways, the script interacts with a C++ app via embedded python, so I already have protection). What I need is to guard against mis-use. Worst thing that can happen, however, is a software crash (and somebody wasting my time)
@lorenzog: Sure, and checking for module-level code will not help against that. Sure, it means he can do things on import, but maybe he needs and wants to. :-) Module-level code can, for example, check for the operating system and use different definitions of a function dependning on if it's Windows or OSX. So there are valid use-cases for it that are unproblematic, whilc most crashes and time-wasting will be done in non-module-level code anyway.
2

Not sure if you'll consider this elegant, but it is somewhat intelligent in the sense that it recognizes when def init are tokens and not just part of a tricky multi-line string:

''' def init does not define init... ''' 

It will not recognize when init is defined in tricky alternate ways such as

init = lambda ... 

or

codestr='def i'+'nit ...' exec(codestr) 

The only way to handle all such cases is to run the code (e.g. in a sandbox or by importing) and inspect the result.


import tokenize import token import io import collections userscript = '''\ def init(): blah """ def execute(): more blah """ yadda ''' class Token(object): def __init__(self, tok): toknum, tokval, (srow, scol), (erow, ecol), line = tok self.toknum = toknum self.tokname = token.tok_name[toknum] self.tokval = tokval self.srow = srow self.scol = scol self.erow = erow self.ecol = ecol self.line = line class Validator(object): def __init__(self, codestr): self.codestr = codestr self.toks = collections.deque(maxlen = 2) self.names = set() def validate(self): tokens = tokenize.generate_tokens(io.StringIO(self.codestr).readline) self.toks.append(Token(next(tokens))) for tok in tokens: self.toks.append(Token(tok)) if (self.toks[0].tokname == 'NAME' # First token is a name and self.toks[0].scol == 0 # First token starts at col 0 and self.toks[0].tokval == 'def' # First token is 'def' and self.toks[1].tokname == 'NAME' # Next token is a name ): self.names.add(self.toks[1].tokval) delta = set(['init', 'cleanup', 'execute']) - self.names if delta: raise ValueError('{n} not defined'.format(n = ' and '.join(delta))) v = Validator(userscript) v.validate() 

yields

ValueError: execute and cleanup not defined 

2 Comments

For certain more elegant than my solution, but I wonder if it offer a better performance...
This is exactly what I consider "too much hassle", despite being extremely elegant and concise. At this point I might as well enforce the user script to contain init(), execute() and cleanup(), and warn of undefined behaviour otherwise. However, you answer my question. I'll see if anything better comes up and otherwise mark this as accepted. Thanks.
0

One very simple solution could be to check the first characters of every line of code: The only permitted should be:

  • def init():
  • def execute():
  • def cleanup():
  • lines starting with 4 spaces
  • [optionally]: lines starting with #

This is very primitive but it fulfills your requirements...

Update: After a second though about it I realized that it isn't so easy after all. Consider for example this piece of code:

def init(): v = """abc def ghi""" print(v) 

This means that you'd need a more complex code parsing algorithm... so forget about my solution...

2 Comments

When I said 'simple' I meant actually 'elegant'.. yours is functional but is too much of a hack :)
@lorenzog: You will need a parser then.
0

Something I came up with here for those on google where this was the top result, needed to just rip out a dictionary that was defined in the module's global namespace. Instance type can be easily changed for different use cases or even parameterized similar to the top answer.

import ast def open_python_as_ast(path): with open(path) as f: tree = ast.parse(f.read(), mode="exec") return tree def parse_ast_for_variable(tree, variable_name): for item in tree.body: if not isinstance(item, ast.Assign): continue for targ in item.targets: if targ.id != variable_name: continue expr = ast.Expression(item.value) var = ast.literal_eval(expr) return var raise NameError(f"variable {variable_name} was not found in the source") def parse_python_for_variable(path, variable_name): tree = open_python_as_ast(path) var = parse_ast_for_variable(tree, variable_name) return var 

Comments

-1

A solution to 1 to 3, ( not the yadda part ) is to hand out "generic_class.py" with all the methods that you need. So,

class Generic(object): def __init__(self): return def execute(self): return # etc 

You can then check for the existence of "generic" in what you've imported. If it doesn't exist you can ignore it and if it does then you know exactly what's there. Anything extra will never be called unless it's called from within one of your pre-defined methods.

3 Comments

When you import a module then you execute it... malicious code already executed before you can check "generic" import...
@gecco I said that it solves 1-4 ( I meant 3 ) - edited - not the execution part. It's a better way of enforcing specific methods being in script than anything I can think of but you're right it doesn't stop malicious code. But I wasn't trying to answer that. Off the top of my head your answer would be the simplest way of stopping malicious execution.
This does not really help solve anything. If I have to enforce the existence of an object that subclasses 'Generic', I might as well enforce the presence of init(), execute() and cleanup(). Plus, __init()__ in your code refers to the object initialization, which is conceptually different from the initialization of the script: one might set up per-object things in init, which are not relevat to the overall script.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.