# Foreword I agree with you that this whole thing sounds like a daunting - but fascinating - task, and that there's a lot of ground to cover. So I'm humbly going to suggest what I think could be a rather comprehensive guide to use for your team, with pointers to appropriate tools (and alternatives) and appropriate reading or educational material to share. <sub>_I'm planning a rather large answer to this, and this is only a work-in-progress at the moment, so check back again later for more as I won't have time to finish it now. Apologies for that._ </sub> ----- # Executive Summary for the Impatient * Define a **rigid project structure**, with: * **project templates**, * **coding conventions**, * familiar **build systems**, * and sets of **usage guidelines** for your infrastructure and tools. * Install a good **SCM** and make sure they know how to use it. * Point them to good **IDEs** for their technology, and make sure they know how to use them. * Implement **code quality checkers** and **automatic reporting** in the build system. * Couple the build system to **continuous integration** and **continuous inspection** systems. * With the help of the above, identify **code quality "hotspots"** and **refactor**. _Now for the long version... Caution, brace yourselves!_ ---- # Rigidity is (Often) Good _This is a rather controversial opinion, as rigidity is often seen as a force working against you and slowing you down. It's true for some phases of some projects. But once you see rigidity as a structure, a framework that takes away the guesswork, it greatly reduces the amount of wasted time and effort. Make rigidity work for you, not against you._ ## Rigidity of the Project Structure If each project comes with its own structure, you are lost and need to pick up from scratch every time you look at it, and the same applies to each newcomer. You don't want this in a professional software engineering shop, and you don't want this in a research lab either. ## Rigidity of the Build Systems As mentioned above, if each project **looks** different, there's a good chance they also **build differently**. A project's build shouldn't require too much research or too much guesswork. In general, you want to be able to do the canonical thing and not need to worry about specifics: `configure; make install`, `ant`, `mvn install`, etc... A quick `README` at the root to point to things that differ, but that's all there should be (in an ideal world). Plus, this also greatly facilitates other parts of your build infrastructure, namely: * [continuous integration][1], * [continuous inspection][2]. It also helps to ensure that all projects are built to the same level of quality, but re-using the same build system for all of them and making it evolve over the time. Not only do you keep it (and all your projects) up to date, you also make it stricter over time, and more efficient at reporting potential mistakes and enhancements. Do no reinvent the wheel for each project, and reuse what you have already done. **Recommended Reading:** * _[Continuous Integration: Improving Software Quality and Reducing Risk][3]_ (Duval, Matyas, Glover, 2007) ## Rigidity in the Choice of Programming Languages You probably can't expect, especially in a research environment, to have all teams (and even less individual developers) use the same language and technology stack. However, you can identify a set of "officially supported" languages and frameworks, and encourage their use. The rest of other languages, without a good rationale, shouldn't be permitted beyond prototyping. It is essential to keep your build system simple, and the maintenance and breadth of required skills to a bare minimum, a core of technologies and tools. ## Rigidity of the Coding Conventions and Guidelines Coding conventions and guidelines are what allow you to develop both an identity as a team, and a shared _linguo_. You don't want to err into _terra incognita_ every time you open a source file. There's no use trying to enforce non-sensical rules that will make things harder or to forbid things to the extent that commits would be refused based on a single violation. However is takes away a lot of the whining and of the thinking if you identify a clear, concise set of ground rules that **nobody** should break under no circumstances, and a set of recommended rules that are advised to be followed. I am fairly aggressive when it comes to coding conventions, some even say _nazi_ <sup>(without wanting to offend anyone with the evocation)</sup>, because I do believe in having a _lingua franca_ and a recognizable style for my team. When crap code gets checked-in, it stands out like a cold sore on the face of an hollywood star, which helps you to identify that a quick review and action are required. In fact, I've sometimes gone as far as to advocate the use of pre-commit hooks to reject commits if they do not satisfy some common rules. As mentioned before, it shouldn't be overly crazy and get too much in the way, especially as you try to introduce these measures. But it may be well-worth it if you spend so much time reviewing and dealing with crap code that you can't work on real issues. Some languages enforce some rules by design. Java was meant to reduce the amount of dull crap you can write with it (though no doubt it can be done, as evidenced here and on SO), for instance. Python's block structure by indentation is another idea in this sense. Or the Go programming language with its `gofmt` tool, which completely takes away any styling work - **and ego!!** - out of coding effort: if you run it before every commit, things are sure to be always looking fine for everybody. Be sure to make it so that **critical code gore** cannot slip through. **Code conventions**, **continuous integration** and **continuous inspection**, and **pair programming** and **code reviews** are your best weapon against this demon. Plus, as you'll see below, **code is documentation**, and that's another area where your conventions should encourage proper readability and clarity. ## Rigidity of the Documentation Documentation goes hand in hand with code. Code itself can be documentation. But there must be clear-cut instructions on how to build things, how to use things, and how to maintain things. Using a single point of control for documentation (like a WikiWiki or DMS) is a good thing. Create separates spaces for projects, separate spaces for more random banter and experimentation. And make sure that each of these spaces reuses a set of common rules, and that people take care of following them when they edit it. In fact, most of the instructions that apply to code and tooling here apply to documentation as well. ### Rigidity in Code Comments Code comments, as mentioned above, are also documentation. Developers like to express their feelings about their code (mostly pride and frustration, if you ask me). So it's not unusual for them to express these in no uncertain terms in comments (or even code), when a more formal piece of text could have conveyed the same meaning with less expletives or drama. It's OK to let a few slip through for fun and historical reasons: it's also part of **developing a team culture**. But it's very important that everybody knows what is acceptable and what isn't, and that comment noise is just that: **noise**. ### Rigidity in Commit Logs Commits logs are not this annoying thing of the usage lifecycle of an SCM that you just need to skip to get home on time or get on with the next task, or to catch up with the buddies who left for lunch. They matter, and, like (most) good wine, the more time passes the more value they have. So make sure they are done right. I'm always flabbergasted when I see co-workers writing one-liners for giant commits, or for non-obvious things. All commits are done for a reason, and that reason may not be clearly expressed in the one line of code you added and the one line of commit log you entered. There's more to it than that. **Each line of code has a story, and a history**. The diffs can tell its history, but you have to write its story. > Why did you need to update this line? Because the interface changed. > > Why did the interface changed? Because the library that provides it > was updated. > > Why was this library updated? Because it's a dependency on another > library that we needed to implement feature X. > > And what's feature X? All about it is in `TASK_KEY_HERE`. `Git` actually gets this right in that it is more geared towards providing good logs than any other SCM. Though it's not my SCM of choice, and not necessarily the best one for your lab either; but it gets this right. It lets you provide a short log, and long log. Leave the general update to the `shortlog`, with the reference task IDs to link to your issue tracker (yes, you need one), and expand in the long log. Write the changeset's **story**. <sub>For crying out loud, if you can do it on a blog, you can do it in a log. It's the same origin for (We)Blogs, after all: just keeping track of things.</sub> Really ask yourself the question: > If I were searching for something about this change later, would > this log answer my questions? ### Documentation and Code, and Projects as a Whole, Are ALIVE You need to keep them in sync, otherwise they do not form that symbiotic entity anymore. That's why it works wonders when you have: * clear commits logs in your SCM, with links to task IDs in your issue tracker, * where this tracker's tickets themselves link to the changesets in your SCM, and possibly to the builds in your continuous integration system, * and a documentation system that links to all of these. Code and documentation need to be cohesive. ## Rigidity in Testing Any new code shall come with (at least) unit tests. Any refactored legacy code shall come with unit tests. Period. Of course, these tests need to actually test something valuable, and to not be just a waste ot time and energy. They need to be well written and commented, just like any other code you check in. They are documentation as well, and they help to outline the contract of your code. Especially if you use [Test Driven Development][4]. But even if you don't, you need them for your peace of mind. They are your safety net for the future (for maintenance, for future enhancements) and your antibiotic against normal code rot. And of course, you should go further and have [integration tests][5], and [regression tests][6] for each reproducible bug you fix. ## Rigidity in the Use of the Tools Sure, it's OK for the occasional developer/scientist to want to try some new static checker on the source, generate some graph or model using another, or implement a new module using a DSL. But it's best if there's a canonical set of tools that **all** team members are expected to know about and to use. I regard it as generally OK to **recommend** a default working environment with these tools, but to let each developer use their IDE or editor of choice, **as long as they are productive** AND **do not require regular assistance** to adjust to your general infrastructure AND **do not modify the common areas (code, build system, documentation...) in ways that affect other developers**. If that's not the case, then it's fair to enforce that they fallback to your defaults. **Note:** Of course, some flexibility is good. Letting someone occasionally use a shortcut, a quick-n-dirty approach, or a favorite pet tool because it **gets the job done** is fine... But **never** let it become a habit, and don't let this snippet of code or prototype become the actual codebase to support. ---- # Team Spirit Matters ### Develop a Sense of Pride in Your Codebase * Develop a sense of Pride in Code * Use wallboards * leader board for a continuous integration game * wallboards for issue management and defect counting * Use an [issue tracker][7] / [bug tracker][8] ### Avoid Blame Games * DO use Continuous Integration / Continuous Inspection games: it fosters good-mannered and [productive competition][9]. * DO keep track defects: it's just good house-keeping. * DO **identifying root causes**: it's just future-proofing processes. * BUT DO NOT [assign blame][10]: it's counter productive. ### It's About the Code, Not About the Developers The whole point to make developers be conscious of the quality of their code, but to see it as a detached entity and not as an extension of themselves (and react badly when a part of this extension is criticized. Encourage [ego-less programming][11] for a healty workplace but do rely on ego for motivation. ---- # From Scientist to Programmer You can't expect people who do not value and take pride in code to produce good code. They need to discover how valuable (and fun) it can be, for this property to emerge. Sheer professionalism and desire to do good is not enough: good code needs passion. So you need to turn your scientists into **programmers** (in the large sense). ----- # Code Maintenance is Part of Research Work Nobody wants to read a crappy research paper. They are proof-read, refined, rewritten, resent for re-approval countless times until they reach this final state that's deemed good enough for publication. The same applies to a thesis. And **the same applies for a codebase!** You want to make it clear that constant refactoring and refreshing of a codebase is what prevents code rot and technical debt, and what facilitates future re-use and adaptation of the work for other projects. ---- # Why All This??! In the end, why do we need all the above? For the Holy Grail: **code quality**. Or is it **quality code**...? All of the above aims at driving your team towards this goal. Some aspects of it does it by genuinely wanting them do it themselves (which is much better) and others by slightly taking them by the hand (but that's how you educate people and develop habits). But how do you know if you have found the Holy Grail, and not some cheap knock-off (which might make you turn to dust quickly, which is unpleasant)? ## Quality is Measurable Not always quantitatively, but it **is measurable**. As mentioned above, you need to develop a sense of pride in your team(s), and showing progress and good results is key. Measure code quality at point T, and show progress between intervals. Show how it matters. Do retrospectives to reflect on what has been done, and how it made things better or worse. There are great tools out there for **continuous inspection**. [Sonar][12] being one of them, quite popular in the Java world, but it can adapt to other technologies; and there are many others. Keep your code under the microscope and look for these pesky annoying bugs and microbes. ---- # But What if My Code is Already Crap?? Of course, all of the above is fun and cute like a trip to Never Land, but it's not that easy to do when you already have (a big pile of steamy and smelly) crap code. Here's the secret: **you need to start somewhere**. > **Personal anecdote:** In our current project, we are working with a > codebase that originally was more than 650,000 lines of Java code, > more than 200,000 lines of JSPs, more than 40,000 lines of > JavaScript, and more than 400 MBs of binary dependencies on > external projects and libraries. > > Today, after about 18 months, we have 500,000 lines of **(MOSTLY > CLEAN)** Java code, around 150,000 lines of JSPs, and still about > 38,000 lines of JavaScript, and our dependencies are down to barely > more than 100MBs (and these dependencies are not in our SCM > anymore!). > > **How did we do it?** _We just did all of the above, or we try to._ > > It's a huge team effort, but we slowly **inject** new regulations > and new tools that help us to monitor the heart-rate of our product, > while we hastily **slash** away the fat of the crap code and useless > dependencies we can find. We didn't stop all development to do > that. We have occasional periods of relative peace and quiet where > we are more or less free to go crazy on the codebase and tear it > apart, but most of the time we just do it all by defaulting to sort > of a "review and refactor" mode every chance we get: when things > build, over lunch, during team bug fixing sessions, when Friday > afternoons get drowsy... > > We did have a few big construction sites... Switching our build > system from a giant Ant build of more than 8500 lines of code to a > multi-module Maven build was one of them. We now have clear-cut > modules (or at least it's already a lot better than before, and we > still have big plans for the future), automatic dependency > management (which allows for easy maintenance and updates, and > allowed to remove lots of them), and faster builds that are easier > to get started with and to reproduce on demand, and to integrate > with code quality tools. > > Injecting some "utility tool-belts" into the codebase, even though > we were trying to reduce dependencies, was another: Google Guava and > Apache Commons can help you code slim down to a much smaller size, > and reduce surface for bugs in **your** code a lot. > > Persuading our IT department that maybe using the tools we use today > (JIRA, Fisheye, Crucible, Confluence, Jenkins) was better than the > ones in place. We still need to deal with a few ones we despise (I'm > looking at you, QC, Sharepoint and SupportWorks), but it's still > been a huge improvement, and we believe there's still room for > improvement. > > And every day, there's a trickle of between one to dozens of commits > that deal only with fixing and refactoring things. We do > occasionally break stuff (remember, you need unit tests kids, and > better write them **before** you refactor stuff away), but overall > the benefit for our morale AND for the product has been enormous. We > get there one fraction of a code quality percentage at a time. **And > it's fun to see it increase!!!** _It's indeed important to note that every once in a while, rigidity (here, the one of our IT department and other development teams in the company) needs to be shaken to make room new and better things. But you need to prove that they are indeed better, and will boost your productivity. Trial runs and prototypes are here for this._ ## The Iterative Spaghetti Code Refactoring List Once you have some quality tools at your toolbelt: 1. Run checkers 2. Identify hotspots. 3. Fix critical hotspots and violations first. 4. Fix minor violations for which fixes can be automated with a large sweep. <sup>(It reduces noise so you are be able to see significant violations when they appear on the radar.)</sup> 5. Go back to 1 and repeat until you're satisfied with your code. <sup>(Which, ideally, you should never be, if this is still an active product.)</sup> [1]: http://en.wikipedia.org/wiki/Continuous_integration [2]: http://www.ibm.com/developerworks/java/library/j-ap08016/index.html [3]: http://www.amazon.com/Continuous-Integration-Improving-Software-Signature/dp/0321336380 [4]: http://en.wikipedia.org/wiki/Test-driven_development [5]: http://en.wikipedia.org/wiki/Integration_testing [6]: http://en.wikipedia.org/wiki/Regression_testing [7]: http://en.wikipedia.org/wiki/Issue_tracking_system [8]: http://en.wikipedia.org/wiki/Bug_tracking_system [9]: http://www.codinghorror.com/blog/2009/05/how-to-motivate-programmers.html [10]: http://programmers.stackexchange.com/questions/83038/who-is-responsible-for-defects-found-during-development [11]: http://www.codinghorror.com/blog/2006/05/egoless-programming-you-are-not-your-job.html [12]: http://www.sonarsource.org/