What do to right from the beginning
I was talking with my coworker Paul today about what we would do differently from the beginning of a big software project. The platform that we work on is great and getting better every day, and some of these decisions were made right from the start, but some weren’t, and we’ve had to suffer through fixing them.
So, what’s important from the very start? Most of these items are applicable to any large software project, but some of them are specific to large web sites run by PHP. In no particular order:
- Internationalization. If your project is successful at all, someone will eventually want to use it in a language other than English. This will probably happen sooner than you think, because even though you’ve never met them, there are a billion Chinese people with internet access. This means that every bit of English text you put into your application needs to have hooks built-in to translate it. Thankfully, the foundation to do this right is there, with GNU gettext and the locale setting in the environment, but you have to be aware of this from the beginning, even if you have no plans to internationalize. It costs almost nothing to put those hooks in now, just in case.
- UTF-8. If you don’t use UTF-8, you make it so that no one can post content to your site in any language other than one that uses Latin characters, and you’ll eventually be forced into a messy, annoying, never-quite-done transition from ISO-8859 to UTF-8. If you use UTF-8 from the beginning, you’ll never really think about the fact that user 375209 posts his blogs mostly in Korean but sometimes in Thai. The fact that pretty much everything out there defaults to 8859 is like a bad joke. Do you not understand what I’m talking about? Read this.
- HTML filtering (XSS prevention). If you are making a site that allows non-authenticated users to post text that other users will see (which is just about every dynamic web site ever), not doing complex, difficult, slow filtering on every last bit of textual user input means that a malicious user can redirect anyone who would have seen their text to a site of their choosing. And yes, in real life, people do this, usually with redirects to porn. It happened to one of my clients this week (not my fault). This is so difficult to get right that even huge sites like Myspace get this wrong regularly. Again, there are some great open source libraries out there, such as the HTML Purifier. And don’t think you can get away with filtering out all HTML; it’s the future, people want to post rich text. When the first user posts the first porn redirect on your front page, you need a solution you can implement in a few hours, and if you haven’t done filtering at all, that’s just not possible. Don’t use a BBCode-style solution; it’s not good for a variety of reasons.
- Model-view-controller. We don’t use a particular framework for this, but it’s the philosophy that’s important. In any request, there should be a controller that reads user input, a model that does data transformation/processing, and a view that controls the format of the output sent to the user. We use Smarty for the view. Don’t make your own templating system or use simple PHP files, use Smarty. With Smarty, you aren’t tempted to do complex stuff.
- Code standards. Decide on spaces or tabs (important for source control). Use PHPDoc religiously. Decide on camel case or underscores. Don’t use functions (well, sometimes), use objects. Write these decisions down.
- Scalability. Maybe you’ll be successful and you’ll get to the point where it all can’t run on a single rented box, and you’ll start to have to worry about performance. At that point, you need to start caching your models (this is one of the reasons you need MVC). We use memcache. One of the things you can do now is keep all methods that change data in the same place. Later you can put code there to kill the cache entry for this object so it can be regenerated with new data.
- Don’t worry about performance. The number of CPU cycles you take up processing your text is no longer important up to a pretty absurd point in today’s world of cheap fast parallel web servers. Your DB server is what’s important, since it’s pretty difficult to have more than one of them, and extrordinarily difficult to have more than one that you can write to. And if you notice someday that one particular page is really slow, well, then cache it.
- Database abstraction. Do not write SQL queries in your code. This is probably the most controversial item in this list, but I stand by it. Never write SQL queries. Generate them programmatically. We use DB_DataObject. Today, I’d probably use MDB_QueryTool, but they’re basically equivalent. It can be a pain, especially when you’re talking multi-join, but this is also where it’s most useful. A big query with a bunch of variables which could be filled with where clauses and joins (but maybe not, make sure and put an and at the beginning of your additional condition unless you’re the first condition and spaces at the beginning and end of all variables, just to be sure!) is impossible to maintain, and the business rules in it are easy to overlook. SQL is a 30-year-old language and sucks hard in some pretty basic ways; let a library deal with the pain for you.
- No stored procedures. No triggers. No views. No foreign keys. The world has moved on from the days of the all-powerful DBMS whose status at any moment can be relied upon by a court of law. Your database should just be tables that you can add and delete from. Simple. There’s nothing you can do in a stored procedure that you can’t do in your main code, but there are plenty of things you can’t do from a stored procedure.
- Security. Security is important. Write every piece of code with an eye toward who the user is and whether they can do this. This probably means roles, etc. It’s a pain in the ass. It’s important. Do periodic security audits.
- SQL injection. SQL injection prevention is important and hard to get right every time, so you need to do it systematically. Here, more than any other security concern, is where one bad line will totally screw you.
- Standardized HTML/CSS. This is our next big project, since we didn’t do this one right at all. I don’t know the right answer to this one yet, but I will soon. YUI?
- URL rewriting. As strange as this seems, people really do care about what the URLs of pages look like, even if they’re never going to use them. We use Net_URL_Mapper, but I’m not sure what the best answer is here either.
- No code generation. Also a controversial statement. My experience with code generation has been universally negative. My primary argument is that changes in the code that transforms your configuration into your implementation should have real-time effects, not effects that show up months later when you want to change the configuration but the regeneration you just did broke something else because configurations like this were phased out six months ago, except for this one… you get the idea. If the performance of this transformation is too slow for real-time, then just cache it forever in memcache, but don’t keep it around on the disk.
- All warnings on in development, all fatal errors logged and examined. If you do this, you will increase your software reliability significantly.
- No configuration in the database. There are many reasons for this, but the best one is that it’s pretty much impossible to usefully version control the contents of database tables. Configuration should go in files that are version controlled. We use JSON. Simple, human editable, machine editable.
- Open source. Use all open source software. Do you really trust your business to the whims of some other company? What if there’s some bug in their software that’s really screwing you up, but you can’t fix it? There’s plenty about web development you can’t control without voluntarily using closed software in your architecture, especially when there’s such good stuff out there.
- Database layer isolation. That’s not a good phrase for what I’m talking about, but I’m not sure what the industry word is for this. To put it simply, don’t have queries for a specific table all over your code, call one piece of code that queries that table. What if you want to add a hidden flag to that table? Then you’ve got to go add a where clause to all those queries, and you’ll miss one, and then the hidden thing will show up on the front page, and people will be pissed.
This is just what’s important to do right from the beginning. There’s a whole other set of processes that are just as important, but it’s the kind of thing you can do as you grow. That’s for next time.
About this entry
You’re currently reading “What do to right from the beginning,” an entry on Scott Martin
- Published:
- 12.14.07 / 10pm
- Category:
- Uncategorized
2 Comments
Jump to comment form | comments rss [?] | trackback uri [?]