37
$\begingroup$

C/C++ has an entrypoint int main(int argc, char **argv);, which provides the program with the arguments passed to it and a way to signal back the result:

#include <stdio.h> int main(int argc, char **argv) { // Or it could be int main(void); if the arguments are not needed for (int i = 0; i < argc; i++) printf("%s\n", argv[i]); return 1; } 

More "modern" languages such as Java and C# still provide the arguments, but hide away the result behind an explicit call to exit with a specific exit code:

public class Main { public static void main(String[] args) { // C# main is usually void Main(string[] args), but any combination of // int|void Main(string[] | void) is allowed. for (String arg : args) System.out.println(arg); System.exit(1); } } 

Rust and Go go even further and hide the argv, which makes me go out of my way to get the arguments and return the result:

fn cmain(argv: Vec<String>) -> i32 { for arg of &argv { println!("{}", arg); } return 1; } fn main() { // Rust main is usually fn main() -> (); // It can be fn main() -> Result<(), E: std::fmt::Debug>; // but that still doesn't allow user-specified exit codes // and more about supporting Rust's func()? syntax std::process::exit(cmain(std::env::args().collect())); } 

Some languages such as Python and Javascript do not have the concept of a main function, thus they need explicit calls to get arguments and return results. I find myself writing a main function anyway when using those languages:

function main(argv: string[]): number { for (const arg of argv) console.log(arg); return 1; } process.exit(main(process.argv.slice(2))); 

Why do modern languages stop exposing the arguments and exit code through the main function? It's clear they do support the concept of arguments and exit codes, so I guess it's the design decision reasonings that I'm curious about and cannot find.

$\endgroup$
14
  • 25
    $\begingroup$ In what sense is std::env::args() less convenient than String[] args? You are only going "out of your way" in Rust because you wrote a separate cmain function and called it from the real main function. Your simple program would be a lot simpler if you just wrote for arg of std::env::args() in the main function. Likewise I'm not sure how std::process::exit(...) is significantly less convenient than return. These are things you will write at most once per project; it's more inconvenient to type #include <stdio.h> in C, when modern languages don't hide console output behind ... $\endgroup$ Commented Sep 9, 2024 at 7:12
  • 5
    $\begingroup$ ... an import. And arguably std::env::args() and std::process::exit(...) are more convenient than declaring the args as a parameter of main and returning from it, because you can call them from anywhere in a program. $\endgroup$ Commented Sep 9, 2024 at 7:13
  • 1
    $\begingroup$ std::env::args() is longer than argv, but not longer once you add the parameter declaration String[] argv. As for parsing the args and passing them around, absolutely, but your argument parser needs to access the arguments, and it's more convenient for it to fetch them from std::env::args() itself than for you to have to pass them along yourself. Particularly, you are probably using a library like arg to do the parsing, and that library's API is simpler if it doesn't require you to be a middleman. $\endgroup$ Commented Sep 9, 2024 at 11:32
  • 1
    $\begingroup$ You can easily emulate the C-style main in these languages, but you can't (to my knowledge) easily do the reverse. That seems like an advantage to me. $\endgroup$ Commented Sep 9, 2024 at 14:51
  • 9
    $\begingroup$ Maybe the question is rather "why are modern languages wordier than C", and the answer is "because computers have a million times more RAM and disk space than 50 years ago, so we don't need to sacrifice readability any more to save memory". $\endgroup$ Commented Sep 9, 2024 at 18:47

7 Answers 7

66
$\begingroup$

The arguments and return code for main() are handled by the C implementation, not by your operating system. On some level, it's an arbitrary decision that C works this way and there is no particular reason why other languages should work this way.

Here is what your C implementation may actually be doing, more or less (not real code):

// _start is the ACTUAL entry point. void _start(void) { int argc = get_argc(); char *argv = get_argv(); int status_code = main(argc, argv); exit(status_code); } 

How this actually works depends on the operating system, and every operating is different.

On Linux, the value of argc and argv are stored directly on the stack, so the entry point just has to read those values from the right location and then forward them to main(). You still have to call exit() after main returns.

On Windows, it's a little more complicated. The OS doesn't actually have argc and argv, it has a single string containing the command-line arguments. To get that string, you can call the function GetCommandLineA() or GetCommandLineW() (depending on which encoding you want to use). You can use this string directly, parse it yourself, or parse it using the standard functions CommandLineToArgvA() or CommandLineToArgvW(). So, Windows' version may look something like this:

void _start(void) { wchar_t *command_line = GetCommandLineW(); int argc; wchar_t **argv = CommandLineToArgvW(command_line, &argc); int status_code = main(argc, argv); ExitProcess(status_code); } 

However, there is a pair of significant problems here:

  1. On Windows, the command line does not include the program name. In other words, argv[0] will be missing.
  2. On Windows, you may want the wchar_t version of the command line. (This is another reason why Rust has std::env::args() and std::env::args_os()--the args_os() version will let you work with wchar_t strings that cannot be encoded as UTF-8. This is a pathological case to be sure, but Rust is designed to give you the tools to handle pathological cases correctly.)

Conclusion

You can see that with all these differences between different operating systems, it doesn't really make sense to copy the way C does things. C doesn't do it very well.

  • In order for your process to exit, on common operating systems, you have to call a system call like _exit() or ExitProcess(). If main returns a status code, it just means that there is some code in your C runtime that takes the result of main() and calls that function.

  • The format for program arguments is different on different operating systems, and there is not a simple way to unify it all.

$\endgroup$
6
  • 3
    $\begingroup$ Great answer. I would just add that the C language was originally created for implementing utilities in the early Unix system. I see that as the reason why the Unix process calling interface is so integrated into C. $\endgroup$ Commented Sep 10, 2024 at 7:23
  • 1
    $\begingroup$ I have to downvote as the answer is factually incorrect. On linux, you create a process using fork/exec. The exec variants actually do take an argc/argv; the process has an exit status you can query when calling "wait" variants. So the POSIX (i.e. OS standard) process creation API perfectly aligns with the C main API (by design). Why this fact emerges and whether it justifies this API for C programs is debatable (more probably bad design separation as stated in other answers). On windows the command line is indeed a single line, so the answer is more realistic/correct. $\endgroup$ Commented Sep 10, 2024 at 11:47
  • 4
    $\begingroup$ @YannTM: There is no argc parameter to any of the exec() family functions. $\endgroup$ Commented Sep 10, 2024 at 16:11
  • 7
    $\begingroup$ @YannTM: The POSIX standard describes the C interface. That C interface is provided by the C implementation on your operating system—it is not necessary to use the C interface when you are not using C. $\endgroup$ Commented Sep 10, 2024 at 16:12
  • $\begingroup$ There is no argc, but the argv parameter in C is guaranteed to have a sentinel 0x0 last argument, so argc is actually not necessary, it's a feature. The args to exec are passed as is to main, so the point stands. The C interface is what the system implements, whether you acknowledge it or not, any process created on posix comes from this system call, you must indeed go through it to create a process. $\endgroup$ Commented Sep 12, 2024 at 19:33
28
$\begingroup$

Different languages are useful for writing very different kinds of programs, in different domains entirely.

Command-line interface might just not be the main way to interact with programs in a given language.

As you point out, most or all "general purpose" languages do have a way to access argv/argc.

But they don't have to be at the center of attention all the time, especially if most programs are not expected to use them.

Also, many languages (like C# you mentioned) actually let you choose whether to use int main(args) or void main().


(edit: adding examples from the comments)

When Java was originally released, it was popular for writing applets that run in the browser. When the program is an applet, the platform is the browser, and it doesn't expect a return value like a regular OS.

Even languages like Go, which seems to be geared for low-level programming and/or CLI/headless applications, can also be compiled to JS or wasm. Putting args and exit into a os package is a way to avoid tightly coupling one specific entry point design to the language. When designing a language it's usually a good thing to delegate concerns to libraries. When compiling for an environment without argv/exit, it's nicer to just have a missing os package than to have superfluous handling of fake empty arguments (from the perspective of both the language implementer and of the language user).

$\endgroup$
10
  • 2
    $\begingroup$ Surely it doesn't cost much to keep access to them more convenient? The OS speaks C's int main(int argc, char **argv);, the language just doesn't expose a similarly convenient form. Also, C# is the only language that I know of that allows users to choose, and I only learned about it while researching this question - I've only ever known void Main(string[] args);. It's cousin Java, enforces void main(String[] args); for main. $\endgroup$ Commented Sep 9, 2024 at 6:42
  • 3
    $\begingroup$ When Java was originally released, it was popular for writing applets that run in the browser. When the program is an applet, the platform is the browser, and it doesn't expect a return value like a regular OS. $\endgroup$ Commented Sep 9, 2024 at 7:04
  • 12
    $\begingroup$ Rust and Go can also be compiled to JS. Putting args and exit into a os package is a way to avoid tightly coupling one specific entry point design to the language. When designing a language it's usually a good thing to delegate concerns to libraries. When compiling for an environment without argv/exit, it's nicer to just have a missing os package than to have superfluous handling of fake empty arguments (from the perspective of both the language implementer and of the language user). $\endgroup$ Commented Sep 9, 2024 at 10:11
  • 8
    $\begingroup$ @404NameNotFound: The reason for not passing dummies around is not performance, it's semantics. The human developer has enough concerns that they shouldn't need to, on top, wade through dummies. $\endgroup$ Commented Sep 9, 2024 at 13:26
  • 13
    $\begingroup$ @404NameNotFound: The OS does not speak C's int main(int argc, char** argv). That's a UNIX thing, which co-evolved with C. Windows uses GetCommandLineW, which doesn't include argc, is UTF-16 encoded, and retains whitespace. It's Visual C++ which wraps that up as int main(int argc, char** argv) $\endgroup$ Commented Sep 9, 2024 at 15:45
26
$\begingroup$

Disclaimer: I will not attempt to go spelunking into the archives of each language's design to figure out the reasons for why each of them decided NOT to follow in C's shoes. Instead, I'll focus on the downsides.

Let's answer a slightly different question: what's wrong with C's design?

Locality

From a user point of view, passing the command line arguments to main is mostly useful if those arguments are handled "close" to main.

If, instead, parsing the arguments involves a complex routine, and requires threading said arguments through multiple layers, then a user may prefer Python's or Rust's approach of being able to access said arguments from anywhere.

Character Encoding

The problem of char is that... it's not a character. It's a byte. It can adequately represent ASCII characters, but anything further is a mess.

Unix uses... something close to UTF-8, so it works relatively well.

Windows uses something close to UTF-16 instead, and thus would need wchar_t instead of char for passing exactly the arguments passed.

And let's not even talk about embedded NUL bytes...

If you look closer at Rust you'll notice there are two APIs:

  • std::env::args() returns UTF-8 strings, which is convenient, but not guaranteed to be able to map all arguments perfectly.
  • std::env::os_args() returns OsString, which perfectly capture the OS-level argument, even on Windows, but may not be convertible to UTF-8.

Thankfully, both deal with 0.

Performance

The signature of main in C forces every argument to be (1) converted to a char array and (2) stored in memory.

For example, if Rust used fn main(argv: &'static [&'static str]), then it would have to create an array of &'static str:

  1. This would occupy argv.len() x 16 bytes (8 bytes length & 8 bytes pointer).
  2. This would require validating that all arguments are, indeed, correct UTF-8 (which Unix doesn't guarantee).

If the user only ever checks the first argument, that's a lot of needless work, and needless memory used.

And of course, it gets worse on Windows, where the arguments are not stored in anything close to UTF-8 in the first place.

Note: It's not unusual for C++ compilers, or linkers, to be invoked with hundreds to thousands of arguments.

Conclusion

C's design emerged over 40 years ago, in a very different environment.

Since then the world has evolved, and introduced challenges that the design copes... badly with. Band-aid upon band-aid to kinda make it work.

It's rather expected than 40 years later, with different requirements and the benefit of hindsight, new languages pick a different design.

$\endgroup$
12
  • 4
    $\begingroup$ "Unix uses... something close to UTF-8" -- I'm not sure what you mean with "something close", but I don't think too many unix-likes enforce UTF-8 or anything here, but just pass the bytes around. Except that there can't be embedded NUL-bytes, since API uses that as the string terminator. $\endgroup$ Commented Sep 10, 2024 at 7:38
  • 2
    $\begingroup$ @ilkkachu: They don't enforce UTF-8 AFAIK, indeed, hence my specific choice of words (and phrasing). $\endgroup$ Commented Sep 10, 2024 at 11:30
  • $\begingroup$ Re: memory, can you elaborate on how not using C’s design avoids keeping arguments in memory? If a process could request them at any time, that memory has to be allocated somewhere. (But maybe you just meant the extra memory of the pointers and arrays, not strings? If so, that point wasn’t clear to me.) $\endgroup$ Commented Sep 10, 2024 at 17:13
  • 1
    $\begingroup$ Global variables aren't really a problem. The problem is shared mutable state. argv is not mutated (unless you are doing something incredibly sketchy) so it is not problematic. $\endgroup$ Commented Sep 11, 2024 at 2:02
  • 1
    $\begingroup$ @OscarSmith: You are correct that global mutable state is particular problematic, and global constants are not. argv is in the middle though: while it cannot be modified, its value is unknown until runtime -- which means you ideally ought to want to test with various values, but as you mentioned cannot really go and modify argv to do so. So it's not as clear cut, really. $\endgroup$ Commented Sep 11, 2024 at 7:04
11
$\begingroup$

Even in the olden days, main and args varied.

Interpreted languages and functional languages often don't follow main(char** argv) style. Bash uses $1 $2 $3 and executes everything in the outermost code block, like Python does. Perl uses @ARGV at global scope. Lisp/Scheme don't have a main, and don't have unified ways of getting argv. (Because why aren't you running them on a Lisp machine?) Fortran doesn't have a main fn, and argv access is added via functions like getarg and get_command_argument. Haskell's entry point takes only an IO monad, but provides System.Environment.GetArgs.

$\endgroup$
10
$\begingroup$

Most Programs are not Launched From Command Prompts

Those that are, are mainly running on Windows, Linux or UNIX, where other languages already fill the niche of command-line utilities. Since this interface exists, you will occasionally see GUIs configure their shortcuts to pass options on the command line, if it might be invoked different ways depending on which one the user clicks, but even that is rare these days.

Modern languages normally provide some way to get the command-line arguments, but it’s more like how C programs look up environment variables. It’s no longer assumed that every program is launched by an operator typing at a console or from a script saved as a text file. Command-line arguments are no longer so important that a general-purpose language would have special syntax sugar for them.

The argc and argv Data Structures are Terrible

I often find myself telling C programmers that it’s unfortunate everyone learns argv first, because a pointer to pointers to null-terminated strings is a terrible data structure and almost never the best kind of “two-dimensional array” to use.

Because they’re so old, they also let coders shoot themselves in the foot with a lot less warning than even modern C would give them, especially the type system not tracking the size of or preventing modification of any of the arguments or strings.

Modern languages all have some kind of native String type and some kind of generic Vector container, so a modern API would return the list either as a vector of strings or a vector of string views.

It’s an Awkward Fit for Type Systems

Languages sometimes allow function overloading, but they don’t normally have a feature that lets a function be called with different arguments depending on how it’s implemented in another file. That’s not useful in real-world programs. Even in C, written in part to be low-level enough to implement its own standard library (see “Why Pascal is not my Favorite Language”), there’s never been any demand for something like, hypothetically:

#if __extern_function_args(foo, int, char**) foo(argc, argv); #elif __extern_function_args(foo, void) foo(); 

Even when ANSI C added variadic functions to make it possible to write printf() in C, they didn’t see any need to make it flexible enough to call either int foo(void) or int foo(int argc, char **argv), even in the same source file. C and C++ make main() the one special exception to how every other function works.

There’s a lot of weird stuff in C because something happened, by coincidence, to work on the DEC PDP-11 back in its formative years and changing it now would break backward compatibility There’s not much reason for other languages to do things the same way, other than to imitate C.

Some Newer Languages do Have Error Returns

For example, Rust allows main() to return any type that implements the std::process:Termination trait. The Standard Library allows you to return an ErrorCode of SUCCESS or FAILURE, like EXIT_SUCCESS and EXIT_FAILURE in C. But this feature really exists to allow the ? operator to work in short demo programs that do I/O (by short-circuiting on error and printing a debug message).

$\endgroup$
8
  • 1
    $\begingroup$ re: point 2, I don't think the issue is e.g. Java's argv being a String[] rather than a null-terminated char**. OP uses the example of writing a custom cmain(argv: Vec<String>) -> i32 in Rust $\endgroup$ Commented Sep 9, 2024 at 20:55
  • 2
    $\begingroup$ Another bit of weirdness is that the Standard didn't want to require that programs targeting platforms that don't recognize any concept of return codes must nonethelss end with return 0;, but didn't want such programs to malfunction on platforms that require that programs return zero to indicate success, so the Standard added a special rule for main's return value. $\endgroup$ Commented Sep 9, 2024 at 22:09
  • 2
    $\begingroup$ @supercat This is another one of those things that just happened to work on the DEC PDP-11 running UNIX. It’s also another way that main() is a precious snowflake. No other function declared as returning int is allowed to omit a return statement. $\endgroup$ Commented Sep 10, 2024 at 1:33
  • 2
    $\begingroup$ "not preventing modification of any of the arguments " - why is that an issue? It's perfectly legal in C to modify your arguments. It is a true char**. There are reasonable designs which do modify their arguments as they're being parsed, to keep track of parsing state. $\endgroup$ Commented Sep 10, 2024 at 7:26
  • 5
    $\begingroup$ For some reason, the essay Why Pascal is not my favourite language is not as easy to find as one would expect, so here's a link. $\endgroup$ Commented Sep 10, 2024 at 9:56
8
$\begingroup$

"Explicit is better than implicit"

Philosophically for Python, giving an exit code back to the OS is doing something fundamentally different from passing a result from a called function back to the caller - especially since that code can only be a machine-level integer (of whatever platform-defined size), not a Python object. Indeed, sys.exit already needs to translate whatever's passed to it. Expecting a specific function to have the special behaviour of communicating back to the operating system, just because "it was called first", is a bridge too far. (Especially given that code actually starts running outside of any function - you could maybe argue that the top-level code "returns" a module object once it's done, but how do you translate that into success or failure, let alone allowing for multiple numbered failure status codes?)

"Special cases aren't special enough to break the rules"

Similarly on the input side: the command line has to be parsed, or at least tokenized, and then used somewhere. But there's no specific function earmarked to receive them, because the general rule in Python is that functions are only called when you call them (another kind of explicitness). The only "automatic" running of code is from the top of the file downward - which is necessary e.g. so that imports can happen at runtime.

We might instead imagine the Python startup process assigning command-line arguments to a builtin name, which we could then treat as if it were a "parameter name" for the top-level code in the __main__ module. But presumably we'd reject this idea on the same aesthetic grounds as already described. (After all, sys is a builtin module implemented directly by the interpreter rather than by Python code; one might equally well argue for dumping its entire contents into the builtin namespace preemptively.)

$\endgroup$
4
$\begingroup$

Unix was designed to minimize the amount of information a program launching another would need about the program being launched, and vice versa. The mechanism used when launching programs made it easy to pass a plurality of pointers to strings to a called program, and having main accept those as arguments avoided the need to have anything else in the "permanently loaded' operating system know or care about them. Programs that would need to access argc and argv outside of main() could have main store them to an external-scope object if desired.

Since that time, the use of general-purpose libraries became more widespread; some such libraries would need to access command-line arguments. If one such library would expect argv to have be stored into char ***my_arg_values; and another would expect it to have been stored into char **my_arguments;, then main() would have to ensure that argv was written to both such locations. Having to patch main() in order to use a library was rather a nuisance, so designers of later languages and systems established conventions for global-scope symbols for places where information about arguments would be stored at program invocation. Different languages and systems would use different symbol names, but a language designer could offer a consistent name for a function that could use whatever system-specific symbols or other means would be needed to retrieve the information in platform-independent fashion.

As for return value, it would have been apparent fairly early on that situations may arise where a program would discover in the middle of a nested function call that it wasn't going to be do anything useful. While it would have been possible to have main() use a couple of global symbols to perform a setjmp() and allow called code to store the return status before doing a longjmp() back to main(), which could then return the appropriate status code, this would require the called function to know what symbols the author of main() had chosen for that purpose. Having a call to main() wrapped in a platform/implementation-specific way which can then be exploited by a platform-specific implementation of a function with a standard exit() interface avoids the need for such coordination between the author of main() and code needing to exit.

Once the usable-from-anywhere means of accessing command-line arguments were standardized, the ability of main() to accept arguments directly and have its return value passed back to the system became redundant. While it made sense for C implementations to continue supporting them, there was no reason to carry that convention into other languages.

$\endgroup$
4
  • $\begingroup$ "anything else in the "permanently loaded' operating system know or care about them" - I have a feeling there's a word or two missing in that sentence, but I can't figure what it means either way. Are you referring to some sort of static memory allocation? In the program or in the operating system? $\endgroup$ Commented Sep 10, 2024 at 18:46
  • $\begingroup$ Maybe you can offer better wording--the basic idea is that once the parent process had built up the argument strings and the pointers to them, and called main(), the parent process could be jetissoned from memory leaving behind nothing except those strings and pointers. Having a pointer to the arguments live in a static-duration object would require that the calling program and called program both agree on where that object should be, and having a called program use a function to retrieve arguments would require spending precious RAM to hold that function. $\endgroup$ Commented Sep 10, 2024 at 21:32
  • $\begingroup$ So are those strings and pointers living in the memory of the parent process, in the memory of the operating system, or in the memory of the child process? (Or is it all just shared memory?) Or is your point that it doesn't matter? I don't get what this has to do with the object lifetime though - surely no matter where they are placed, they must be held in memory until the child process ends (if it hasn't shared them further). $\endgroup$ Commented Sep 10, 2024 at 23:03
  • $\begingroup$ @Bergi: I think the strings would have originally been placed at the top of what would become the stack of the new process. In the first Unix system, each process would have almost the entire memory space of the computer available, save for a small amount reserved for the operating system. Switching tasks would always mean storing the memory state of the current process to disk, and loading the memory state of the process to switch to from disk; there was no arbitration of memory usage between applications. $\endgroup$ Commented Sep 11, 2024 at 14:41

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.