Sunday, November 28, 2010

Top 10 improvements for PHP developers

Craig Buckler's post over on sitepoint.com : Top 10 MySQL Mistakes Made By PHP Developers got me thinking. Are they really mistakes, and are the answers giving people all the options, or is there more that can be added?

So this post is and extension and part reply to that blog post.

Maybe the 10 points are just things we can improve on, in various ways depending on what it is we're doing. Or is it just shifting processing & logic from one area to another.

My caveat would be - that if every beginning PHP dev read the 10 points in Craig's blog and learnt; the dev world would be a better place. I'm thinking beyond the original points.

I'll just mirror the 10 points he makes, to give my 2c to the discussion :


1. Using MyISAM rather than InnoDB

I mostly agree with the premise of "Oh just use InnoDB", given what Facebook are doing with it, and the level of active development going on, it's a safe bet at the moment.

Though technically it's not a database engine, that's MySQL, it's a storage engine that's being talked about, and there a few of them, not just two.

There are a few out there. Just reading though a list to see what they are built for, and why, has to be a good idea (even if you just default back to InnoDB).

Reading about the current developments and options out there is always a good idea.

Given your designing a data store, how are you going to use the data? What is the use pattern, is it just an archive or an in memory look up, or something in between?

Why use just one? Can you log to one database using one storage engine and use the other for live data (data relations allowing).

Also given that Oracle owns MySQL now, is what you're doing compatible with other options like MariaDB if you have to move ?

What about using XtraDB or PBXT ?

Do you need full text search (supported in MyISAM) or can you use Sphinx or Lucene ?

Sure, for ~95% of people (including most of the things I'm doing now) InnoDB is fine, though keeping up with what is going on and making informed choices about technology is no bad thing.

Remember the main reasons for InnoDB over MyISAM, is row level locking and transactions. Choose data integrity over speed, and the speed argument has mostly gone away now anyway.

2. Using PHP's MySQL functions

I think this is assuming that you're putting SQL directly into your logic, i.e. Controller in the world of MVC. Yes, you should use the mysqli functions and weigh the costs of PDO and prepared statements

This also comes back to the idea of "where is the logic" in this  system, which I'll really go into in point 5.

Suffice to say, if you have PHP on one level (business logic) producing functionality, and MySQL on another level (data store) providing more functionality; can they both scale to meet each others needs ?

You can tune MySQL to some insane numbers, especially when you're on version 5.5 with lots of cores and against SSD like Fusion-I/O ...... though you can keep putting in web servers till you saturate the network with PHP.

Also, this all predisposes, you're running against one database, or even a master-slave pair for some kind of redundancy, or read-write balancing.

If we put aside the scaling of databases issue for a moment, and just be mindful of :
  • web servers running well encapsulated code = cheap & easy
  • scaling databases = not so cheap & a pain
The more you can get the dumb web servers to do, and take load of the database, the better. Just have a load balanced web farm and chuck more servers in it, or "use the cloud" etc.


3. Not sanitizing user input

Agree 100%, a no-brainer, and it should be point #1. When we built vBulletin 3 we enforced a single point of access for all user data, the $vbulletin->GPC array.

A single point of getting the incoming data, which was cleaned and checked, the gatekeeper as it were.

Personally I assume every piece of data coming in is malicious and the user is trying to break the system ........ does wonders for the stability of the code I write, though not so much for my trust in people and faith in humanity ........


4. Not using UTF-8

Again agree 100%.

Two things I would add are :

1) Ensuring you know what UTF-8 is, and why it's a good thing

http://en.wikipedia.org/wiki/UTF-8

2) The difference between the collation types of tables and their uses

http://stackoverflow.com/questions/2344118/utf-8-general-bin-unicode


5. Favouring PHP over SQL

Now this is where I disagree ......... a lot.

But not with the "don't do SQL in loops, and write good SQL" that's another no-brainer.

I cringe when I see SQL loops .... as I did on a recent project,it must be the digital equivalent of S&M with MySQL in a gimp suit and PHP with a riding crop ...... though I digress.


I'll use two (of many) examples, as to why "PHP over SQL".

1) Architecture, scaling and system design.

Firstly you are splitting the logic between SQL and PHP as to what is making the decisions with your data, what happens if you have to move away from SQL?

I'm not siding with the NoSQL gang and just looking for excuses, I'm ensuring that I'm giving myself choices and using a database as a data store first.

You can doing averaging in PHP near as well as you can in MySQL AVG() with good data structures.

Also you can point a fair few of web servers at one default set up database, but you can point a LOT of web servers at a very well tuned database when the PHP is doing more of the grunt work.

So more over all "work" is done, and it gives you more room before MySQL bottlenecks.

2) Using time - NOW().

When you have more machines and the system grows, PHP and MySQL are going to be using different system clocks for time() and NOW(). If any of these get out of sync, which time is correct ?

What happens if the data store (for writes) is on the end of a queue, i.e. JMS or WebSphere MQ, the time the data is processed is important, not when it's written.

Also if there are multiples of the same data with different time stamps, you can process them in order or enforce a kind of vector locking if needed.

Personally I usually store times as epoch and do any time calculations in PHP to make it easier, selecting on INT opposed to DATE. The actual year, month, day can be denormalised into columns if really necessary.

(An upcoming blog post about reporting on a shoe string does that)


6. Not optimizing your queries

Yes another good one, though a lot of over lap with #5.

Having very complicated SQL also supports my argument from #5, just don't do it.

Get the data in and out as fast and cleanly as possible, and do the rest in PHP.

If you're tying up your database with queries because you've tried to be "too clever" or were "showing off with your l33t SQL skillz" then next time have a think about clean software architecture, and what is workable and maintainable.

A short while ago I worked on a project with the worst SQL I've ever seen, even. Even an ex MySQL-AB consultant called it "A master piece ....... but from the dark side, they couldn't of made this worse if they tried; I don't think I could".

One of my favourite quotes.

I guess the original developers thought they were being rather cool and snazzy at the time, or more likely that it was a road crash of a development journey.

Though if it wasn't for some late nights be me and the gang, and throwing a lot of hardware at the problem (while we re-engineered it on the fly), it would of ended the project.

So bad SQL can be really bad.

15 way JOINs and 10 inner SELECTs means you have something wrong, not that you're doing well .........


7. Using the wrong data types

The opposite of what is said in the original supports what I'm saying, yes make MySQL a dumb data store if that is easier.


There are temptations to store INT for epoch as well as DATE or denormalized day, month, year INTs, though unless you're sure the data never changes or isn't supposed to, be very wary of this.

Again it's about where are you performing the logic. You can put a lot of the logic of what you assume or would of done in SQL, into a decent data layer written in PHP itself. i.e the Model in MVC.

Also writing up a decent data layer (access layer, abstraction, etc) or using an existing one, adds the ability to put in caching.

In a recent game back end I wrote, there are 9 SQL selects when the game stars and 2 each time a player turns up, and that's it. After that it's all INSERT and UPDATE, because of memcache.

Baring memcache going down (then the data layer just fails over to the database ..... and you're back to the PHP on MySQL S&M abuse scenario ! .......) or a restart all reads should be dealt with by cache.

Even if your data is on SSD not disk, there is no need for it (for reads that is, I understand the need for persistence in storage !).


8. Using * in SELECT queries

Very much so, I'd fire people for less .........


9. Under or over Indexing

This very much is relative to the app and how the data is used, though yes it is a balancing act. This takes time and reviewing the data use pattern, slow query log etc.

It can also be something of a dark art, and you can take a lot of time reading up on key buffer sizes and how they effect your query speed.

Though one good thing to think about for general learning, is index only scans.

Here is some reading from the world of PostgreSQL  :

http://rhaas.blogspot.com/2010/11/index-only-scans.html

Talk from the MySQL world :

http://mysqlha.blogspot.com/2010/11/how-are-index-only-scans-implemented-in.html


10. Forgetting to backup

Once again I disagree ....... not with backups, they are vital. Because they are needed for the one thing that I do advocate that is what is really important ........ proven restores!

--------------

Another top 10 list for programmers, that I like :


http://www.techradar.com/news/software/applications/10-mistakes-every-programmer-makes-909424

 --------------

As ever take care and think of what you're doing and why, not only how.

@JeremyHutchings

2 comments:

  1. You promised us 10 programmers you like!

    ReplyDelete
  2. Well another top 10 list *for* programmers ;)

    ReplyDelete