1) Well, first of all let me mention that the problem is complex. It requires an understanding of the Hidden Markov Models, beam search, language modelling and every other technology involved. I really recommend you to read the book on speech recognition first. This one is very good:
In theory it will be possible to build a speech recognition system without any extensive knowledge as listed above, but it's not the case now. We are working on easy to use system but we are still in the early beginning.
Probably you don't have time to read and study all this. Then think if you have time to implement speech recognition system at all. At least learn basic concepts.
Please also learn how to do programming before you are starting. I think it's obvious you need to have software development experience. We can't teach you Java really.
2) Next step is to setup the basic example of the system and estimate it's accuracy while first part is quite obvious, second is often ignored. Don't do that. It's critical to test accuracy on realistic conditions during whole development process.
Decide what kind of system are you going to implement - large vocabulary dictation, medium vocabulary names recognition or small vocabulary command and control or some other task. Probably you need IVR system. For each system there is a demo already. Use it as a base. Please don't try to build dictation from a command and control demo, it's just not suitable.
Now the important task, the estimation of the accuracy. The examples how to do that could be found in decoder sources, in tutorial and in many more places. Collect test database, recognize it and compute the exact accuracy estimation.
It should be the following:
Command and control 5% WER (word error rate)
Medium vocabulary 15% WER
Large vocabulary 30% WER
Large vocabulary short utterances 50% WER
if you have noisy audio or some accent multiply this number on 2.
4) Compare the actual accuracy with the expected value. If the accuracy is mostly the expected, proceed to the next step. If not, search for the bug. Most likely you've made a mistake in system setup. Check the configuration if it is suitable for the task, check sound speech quality with sound editing, find out the spectrum range, check the sampling rate, accent, dictionary and other parameters. If your WER is 90%, you made a mistake for sure.
For example of task-dependent training consider acoustic database. If you train a small vocabulary acoustic model you need word-based acoustic model with word-based phoneset in Sphinxtrain case. If you are training large vocabulary database make sure your phoneset is not large and make sure you have selected the proper number of senones/mixtures.
5) Once you've reached a baseline, it will take a lot to improve it. Think if it's enough for you and if you can build your application with such accuracy. It's unlikely you'll get significantly more. But if you are brave enough:
- Use MLLT/VTLN feature space adaptation
- Use MLLR and other type of online speaker adaptation
- Adapt language model, use context-sensetive language models
- Tune beams - try different values and experiment with them
- Implement the rejection for OOV (out-of-vocabulary) words and other noise sounds
- Implement noise cancellation.
- Adapt acoustic models and dictionary to your speakers/their accent.
If you are short of ideas here, join the mailing list, we have a lot of features to implement.