Voice recognition software at a more fundamental level takes speech and breaks it into recognizable phoenemes (ie specific chunks of sound that make up parts of syllables) and then puts them back together into syllables and words.

But you should be able to make a good guess as to what language is being spoken by just taking the raw phoenemes and doing a frequency analysis on them. That combined with some basic analysis of rhythmic patterns is probably what you do when you can tell that it sounds like someone you aren't listening to closely sounds like they are talking in German, Italian, etc.
