Forced download of the Russian package to the phone. Ultra-fast speech recognition without servers using a real example

13.11.2020

Advice

Products and technologies:

Visual Studio, C #, .NET Speech Libraries

The article discusses:

adding support for speech recognition to the console application;
processing of recognized speech;
installation of libraries for speech recognition;
comparison of Microsoft.Speech and System.Speech;
adding speech recognition support to a Windows Forms application.

With the introduction of Windows Phone Cortana, a speech-activated personal assistant (as well as a fruit-company counterpart that shouldn't be mentioned in vain), speech-enabled apps have become increasingly prominent in software development. In this article, I'll show you how to get started with speech recognition and speech synthesis in Windows console applications, Windows Forms applications, and Windows Presentation Foundation (WPF).

Note that you can also add speech capabilities to Windows Phone apps, ASP.NET web apps, Windows Store apps, Windows RT, and Xbox Kinect, but the techniques are different from those discussed in this article.

A good way to understand what exactly is going to be discussed in this article is to take a look at the screenshots of two different demos at fig. 1 and 2 ... After launching the console application on fig. 1 immediately says the phrase "I am awake". Of course, you won't be able to hear the demo application while reading this article, so it displays the text of what the computer says. Then the user says the command "Speech on". The demo application responds with the recognized text, then listens internally and responds to requests to add the two numbers.

Figure: 1. Recognition and synthesis of speech in a console application

Figure: 2. Speech Recognition in Windows Forms Application

The user asked the app to fold one and two, then two and three. The application recognized the spoken commands and gave answers by voice. I will describe more useful ways to use speech recognition later.

Then the user said "Speech off" - a voice command that deactivates listening to commands for adding numbers, but does not completely disable speech recognition. After this speech command, the next addition of 1s and 2s was ignored. Finally, the user turned on listening for commands again and uttered a meaningless command "Klatu barada nikto", which the application recognized as a command to completely deactivate speech recognition and terminate itself.

On fig. 2 shows a Windows Forms application with dummy speech-enabled enabled. This application recognizes spoken commands, but does not respond to them with voice output. When the application was first launched, the Speech On checkbox was not selected, indicating that speech recognition is not active. The user checked the box and then said "Hello." The application responded by displaying the recognized text in the ListBox control at the bottom of the window.

Then the user said: “Set text box 1 to red”. The application recognized the speech and responded: "Set text box 1 red", which is almost (but not quite) exactly what the user said. Although on fig. 2 you can't see it, the text in the TextBox at the top of the window is really red.

Then the user said: "Please set text box 1 to white". The application recognized it as "set text box 1 white" and did just that. Finally, the user said, "Good-bye," and the application displayed this text, but did nothing with Windows Forms, although it could, for example, clear the Speech On checkbox.

Using the synthesizer object is pretty straightforward.

In the following sections, I will walk you through the process of creating both demos, including installing the required .NET speech libraries. This article assumes that you have at least intermediate programming skills, but know nothing about speech recognition and synthesis.

Adding Speech Recognition Support to a Console Application

To create the demo shown in fig. 1, I started Visual Studio and created a new C # console application called ConsoleSpeech. I've used Speech Aids successfully with Visual Studio 2010 and 2012, but any relatively recent version should work. After loading the template code into the editor, I renamed the Program.cs file in the Solution Explorer window to the more descriptive ConsoleSpeechProgram.cs, and Visual Studio renamed the Program class for me.

Next, I added a link to the Microsoft.Speech.dll file, which is located in C: \\ ProgramFiles (x86) \\ Microsoft SDKs \\ Speech \\ v11.0 \\ Assembly. This DLL was missing from my computer and I had to download it. Installing the files required to add speech recognition and synthesis to an application is not so trivial. I'll explain the installation process in detail in the next section, but for now let's say you have Microsoft.Speech.dll on your system.

By adding a reference to the speech DLL, I removed all using statements from the top of the code except for the top-level System namespace. Then I added using statements for the Microsoft.Speech.Recognition, Microsoft.Speech.Synthesis, and System.Globalization namespaces. The first two namespaces are mapped to a speech DLL. Note that there are also namespaces such as System.Speech.Recognition and System.Speech.Synthesis, which can be confusing. I'll explain the difference between the two shortly. The Globalization namespace was available by default and did not require adding a new reference to the project.

The entire source code for the demo console application is shown at fig. 3and is also available in the source package accompanying this article. I have removed all standard error handling to avoid clouding the main ideas as much as possible.

Figure: 3. Source code of the demo console application

using System; using Microsoft.Speech.Recognition; using Microsoft.Speech.Synthesis; using System.Globalization; namespace ConsoleSpeech (class ConsoleSpeechProgram (static SpeechSynthesizer ss \u003d new SpeechSynthesizer (); static SpeechRecognitionEngine sre; static bool done \u003d false; static bool speechOn \u003d true; static void Main (string args) (try (ss.SetOutputToDriteLine) ConsoleDevice ("\\ n (Speaking: I am awake)"); ss.Speak ("I am awake"); CultureInfo ci \u003d new CultureInfo ("en-us"); sre \u003d new SpeechRecognitionEngine (ci); sre.SetInputToDefaultAudioDevice ( ); sre.SpeechRecognized + \u003d sre_SpeechRecognized; Choices ch_StartStopCommands \u003d new Choices (); ch_StartStopCommands.Add ("speech on"); ch_StartStopCommands.Add ("speech off"); ch_StartStopCommandus. barada; \u003d new GrammarBuilder (); gb_StartStop.Append (ch_StartStopCommands); Grammar g_StartStop \u003d new Grammar (gb_StartStop); Choices ch_Numbers \u003d new Choices (); ch_Numbers.Add ("1"); ch_Numbers.Add ("2" Add ("3"); ch_Numbers.Add ("4"); GrammarBuilder gb_WhatI sXplusY \u003d new GrammarBuilder (); gb_WhatIsXplusY.Append ("What is"); gb_WhatIsXplusY.Append (ch_Numbers); gb_WhatIsXplusY.Append ("plus"); gb_WhatIsXplusY.Append (ch_Numbers); Grammar g_WhatIsXplusY \u003d new Grammar (gb_WhatIsXplusY); sre.LoadGrammarAsync (g_StartStop); sre.LoadGrammarAsync (g_WhatIsXplusY); sre.RecognizeAsync (RecognizeMode.Multiple); while (done \u003d\u003d false) (;) Console.WriteLine ("\\ nHit< enter > to close shell \\ n "); Console.ReadLine ();) catch (Exception ex) (Console.WriteLine (ex.Message); Console.ReadLine ();)) // Main static void sre_SpeechRecognized (object sender, SpeechRecognizedEventArgs e ) (string txt \u003d e.Result.Text; float confidence \u003d e.Result.Confidence; Console.WriteLine ("\\ nRecognized:" + txt); if (confidence< 0.60) return; if (txt.IndexOf("speech on") >\u003d 0) (Console.WriteLine ("Speech is now ON"); speechOn \u003d true;) if (txt.IndexOf ("speech off")\u003e \u003d 0) (Console.WriteLine ("Speech is now OFF"); speechOn \u003d false;) if (speechOn \u003d\u003d false) return; if (txt.IndexOf ("klatu")\u003e \u003d 0 && txt.IndexOf ("barada")\u003e \u003d 0) (((SpeechRecognitionEngine) sender). RecognizeAsyncCancel (); done \u003d true; Console.WriteLine ("(Speaking: Farewell) "); ss.Speak (" Farewell ");) if (txt.IndexOf (" What ")\u003e \u003d 0 && txt.IndexOf (" plus ")\u003e \u003d 0) (string words \u003d txt.Split (" "); int num1 \u003d int.Parse (words); int num2 \u003d int.Parse (words); int sum \u003d num1 + num2; Console.WriteLine (" (Speaking: "+ words +" plus "+ words +" equals "+ sum +") "); ss.SpeakAsync (words +" plus "+ words +" equals "+ sum);)) // sre_SpeechRecognized) // Program) // ns

After the using statements, the demo code starts like this:

namespace ConsoleSpeech (class ConsoleSpeechProgram (static SpeechSynthesizer ss \u003d new SpeechSynthesizer (); static SpeechRecognitionEngine sre; static bool done \u003d false; static bool speechOn \u003d true; static void Main (string args) (...

The SpeechSynthesizer class-level object enables an application to synthesize speech. The SpeechRecognitionEngine object allows an application to listen to and recognize spoken words or phrases. The boolean done determines when the entire application ends. The boolean variable speechOn controls whether the application listens for any commands other than the command to exit the program.

The idea here is that the console application does not accept keyboard input, so it always listens for commands. However, if speechOn is false, only the command to exit the program is recognized and executed; other commands are recognized but ignored.

The Main method starts like this:

try (ss.SetOutputToDefaultAudioDevice (); Console.WriteLine ("\\ n (Speaking: I am awake)"); ss.Speak ("I am awake");

The SpeechSynthesizer object was instantiated when it was declared. Using the synthesizer object is pretty straightforward. The SetOutputToDefaultAudioDevice method sends output to speakers connected to your computer (output can also be sent to a file). The Speak method takes a string and then speaks it. That's how easy it is.

Speech recognition is much more difficult than synthesizing it. The Main method continues by creating a recognizer object:

CultureInfo ci \u003d new CultureInfo ("en-us"); sre \u003d new SpeechRecognitionEngine (ci); sre.SetInputToDefaultAudioDevice (); sre.SpeechRecognized + \u003d sre_SpeechRecognized;

First, the CultureInfo object specifies the language to recognize, in this case United States English. The CultureInfo object is in the Globalization namespace that we referenced with the using statement. Then, after calling the SpeechRecognitionEngine constructor, the voice input is assigned to the default audio device — most often the microphone. Note that most laptops have a built-in microphone, but desktop computers will require an external microphone (this is often combined with headphones these days).

The key method for the recognizer object is the SpeechRecognized event handler. With Visual Studio, if you enter "sre.SpeechRecognized + \u003d" and wait a split second, IntelliSense will automatically end your expression with the event handler name sre_SpeechRecognized. I advise you to press the Tab key to accept the suggested option and use this default name.

Choices ch_Numbers \u003d new Choices (); ch_Numbers.Add ("1"); ch_Numbers.Add ("2"); ch_Numbers.Add ("3"); ch_Numbers.Add ("4"); // from a technical point of view, // this is Add (new string ("4")); GrammarBuilder gb_WhatIsXplusY \u003d new GrammarBuilder (); gb_WhatIsXplusY.Append ("What is"); gb_WhatIsXplusY.Append (ch_Numbers); gb_WhatIsXplusY.Append ("plus"); gb_WhatIsXplusY.Append (ch_Numbers); Grammar g_WhatIsXplusY \u003d new Grammar (gb_WhatIsXplusY);

The three main objects here are the Choices set, the GrammarBuilder template, and the Grammar manager. When I create a grammar for recognition, I start by listing some specific examples of what I need to recognize. Let's say "What is one plus two?" and "What is three plus four?"

Then I define an appropriate generic template, for example "What is plus ? ". The template is a GrammarBuilder, and the specific values \u200b\u200bthat are passed to the template are the Choices set. The Grammar object encapsulates a template and Choices.

In the demo program, I limit the addition to 1 to 4 and add them as strings to the Choices set. More efficient approach:

string numbers \u003d new string ("1", "2", "3", "4"); Choices ch_Numbers \u003d new Choices (numbers);

I present to you a less efficient approach to creating a Choices set for two reasons. First, adding one line at a time was the only approach I have seen in other speech recognition examples. Second, you might think that adding one row at a time shouldn't work at all; Visual Studio IntelliSense shows in real time that one of the Add overloads takes a params string phrases parameter. If you haven't noticed the params keyword, you may have thought that the Add method only accepts arrays of strings, and one string does not. But this is not so: he accepts both. I recommend passing an array.

The creation of a set of Choices from sequential numbers is somewhat of a special case and allows for a programmatic approach like:

string numbers \u003d new string; for (int i \u003d 0; i< 100; ++i) numbers[i] = i.ToString(); Choices ch_Numbers = new Choices(numbers);

After creating the Choices to fill the slots of the GrammarBuilder, the demo creates a GrammarBuilder and then controls the Grammar:

GrammarBuilder gb_WhatIsXplusY \u003d new GrammarBuilder (); gb_WhatIsXplusY.Append ("What is"); gb_WhatIsXplusY.Append (ch_Numbers); gb_WhatIsXplusY.Append ("plus"); gb_WhatIsXplusY.Append (ch_Numbers); Grammar g_WhatIsXplusY \u003d new Grammar (gb_WhatIsXplusY);

The demo uses a similar pattern to create a Grammar for start and stop commands:

Choices ch_StartStopCommands \u003d new Choices (); ch_StartStopCommands.Add ("speech on"); ch_StartStopCommands.Add ("speech off"); ch_StartStopCommands.Add ("klatu barada nikto"); GrammarBuilder gb_StartStop \u003d new GrammarBuilder (); gb_StartStop.Append (ch_StartStopCommands); Grammar g_StartStop \u003d new Grammar (gb_StartStop);

You can define grammars very flexibly. Here the commands “speech on”, “speech off” and “klatu barada nikto” are placed in one grammar, since they are logically related. These three commands could be defined in three different grammars, or the “speech on” and “speech off” commands could be put in one grammar and the “klatu barada nikto” command in the second.

After creating all the Grammar objects, you place them in the speech recognizer and speech recognition is activated:

sre.LoadGrammarAsync (g_StartStop); sre.LoadGrammarAsync (g_WhatIsXplusY); sre.RecognizeAsync (RecognizeMode.Multiple);

The RecognizeMode.Multiple argument is required when you have more than one grammar, which will be in all but the simplest programs. The Main method ends as follows:

While (done \u003d\u003d false) (;) Console.WriteLine ("\\ nHit< enter > to close shell \\ n "); Console.ReadLine ();) catch (Exception ex) (Console.WriteLine (ex.Message); Console.ReadLine ();)) // Main

A strange looking empty while loop keeps the shell of a console application running. The loop will end when the class-level boolean done is set to true by the speech recognition event handler.

Recognized speech processing

The code for handling events related to speech recognition begins like this:

static void sre_SpeechRecognized (object sender, SpeechRecognizedEventArgs e) (string txt \u003d e.Result.Text; float confidence \u003d e.Result.Confidence; Console.WriteLine ("\\ nRecognized:" + txt); if (confidence< 0.60) return; ...

The recognized text is stored in the Result.Text property of the SpeechRecognizedEventArgs object. Alternatively, you can use the Result.Words set. The Result.Confidence property holds a value between 0.0 and 1.0, which is a rough estimate of how well the spoken text matches any of the grammars associated with the recognizer. The demo instructs the event handler to ignore low confidence text in the recognized text.

Confidence values \u200b\u200bare highly dependent on the complexity of your grammars, microphone quality, and other factors. For example, if the demo only needs to recognize numbers from 1 to 4, then the confidence values \u200b\u200bon my computer are usually at 0.75. But, if the grammar is to recognize numbers from 1 to 100, the confidence values \u200b\u200bdrop to about 0.25. In a nutshell, you usually have to experiment with confidence values \u200b\u200bto get good speech recognition results.

if (txt.IndexOf ("speech on")\u003e \u003d 0) (Console.WriteLine ("Speech is now ON"); speechOn \u003d true;) if (txt.IndexOf ("speech off")\u003e \u003d 0) (Console .WriteLine ("Speech is now OFF"); speechOn \u003d false;) if (speechOn \u003d\u003d false) return;

While it may not be entirely obvious at first, this logic should make sense when you think about it. Then the secret exit command is processed:

if (txt.IndexOf ("klatu")\u003e \u003d 0 && txt.IndexOf ("barada")\u003e \u003d 0) (((SpeechRecognitionEngine) sender) .RecognizeAsyncCancel (); done \u003d true; Console.WriteLine ("(Speaking: Farewell) "); ss.Speak (" Farewell ");)

Note that the speech recognition engine can actually recognize meaningless words. If a Grammar object contains words that are not in that object's built-in vocabulary, Grammar attempts to identify those words as much as possible using semantic heuristics, and is usually quite successful. That's why I used "klatu" instead of the correct "klaatu" (from an old sci-fi movie).

Also note that you do not have to process all of the grammar recognized text ("klatu barada nikto") - you just need to have enough information to uniquely identify the grammatical phrase ("klatu" and "barada").

If (txt.IndexOf ("What")\u003e \u003d 0 && txt.IndexOf ("plus")\u003e \u003d 0) (string words \u003d txt.Split (""); int num1 \u003d int.Parse (words); int num2 \u003d int.Parse (words); int sum \u003d num1 + num2; Console.WriteLine ("(Speaking:" + words + "plus" + words + "equals" + sum + ")"); ss.SpeakAsync (words + "plus" + words + "equals" + sum);)) // sre_SpeechRecognized) // Program) // ns

Note that the text in Results.Text is case sensitive ("What" and "what"). Once you recognize a phrase, you can parse it into specific words. In this case, the recognized text is of the form "What is x plus y", so "What" is put into words, and the two added numbers (as strings) are stored into words and words.

Installing Libraries

The explanation of the demo program assumes that all the required speech libraries are installed on your computer. To build and run demo programs, you need to install four packages: the SDK (provides the ability to create demos in Visual Studio), the runtime (runs the demos after they are created), and the recognizable and synthesized (spoken by the program) languages.

To install the SDK, search the Internet for “Speech Platform 11 SDK”. This will take you to the correct page in the Microsoft Download Center ( fig. 4). By clicking the Download button, you will see the options shown in fig. five... The SDK comes in 32-bit and 64-bit versions. I highly recommend using the 32-bit version regardless of the bitness of your system. The 64-bit version does not interact with some applications.

Figure: 4. Main page of SDK installation in Microsoft Download Center

Figure: 5. Installing the Speech SDK

You don't need anything but one .msi file for x86 (for 32-bit systems). With this file selected and clicking Next, you can launch the installer right from here. Speech libraries don't give much feedback on when the installation is complete, so don't look for any success messages.

Figure: 6. Installing the runtime environment

It is extremely important to select the same platform version (in the demo - 11) and bitness (32 or 64) as the SDK. Again, I strongly recommend the 32-bit version, even if you are on a 64-bit system.

Then you can set the recognition language. The download page is at fig. 7... The demo program uses the file MSSpeech_SR_en-us_TELE.msi (English-U.S.). SR stands for speech recognition, and TELE stands for telephony; this means that the language being recognized is designed to work with low quality audio input such as a phone or desktop microphone.

Figure: 7. Setting the recognized language

Finally, you can set language and voice for speech synthesis. The download page is at fig. eight... The demo uses the MSSpeech_TTS_en-us_Helen.msi file. TTS (text-to-speech) is essentially a synonym for speech synthesis. Notice the two available voices English, U.S. There are other English voices, but not U.S. Creating synthesis language files is a complex task. However, other voices can be purchased and installed from a variety of companies.

Figure: 8. Setting the voice and synthesis language

Curiously, while speech recognition and voice / speech synthesis are actually completely different things, both packages are options on the same download page. The Download Center UI allows you to mark both the recognition language and the synthesis language, but trying to install them at the same time proved disastrous for me, so I recommend installing them separately.

Comparing Microsoft.Speech to System.Speech

If you are new to speech recognition and synthesis for Windows applications, you can easily get confused by the documentation, because there are several speech platforms. Specifically, in addition to the Microsoft.Speech.dll library used by the demos in this article, there is the System.Speech.dll library, which is part of the Windows operating system. The two libraries are similar in the sense that their APIs are almost, but not completely identical. Therefore, if you search for speech processing examples on the Internet and see snippets of code, not complete programs, then it is not at all obvious whether this example belongs to System.Speech or Microsoft.Speech.

If you're new to speech processing, use the Microsoft.Speech library rather than System.Speech to add speech support to your .NET application.

While both libraries share a common core codebase and similar APIs, they are definitely different. Some of the key differences are summarized in tab. 1.

Tab. 1. The main differences between Microsoft.Speech and System.Speech

The System.Speech DLL is part of the OS, so it is installed on every Windows system. The Microsoft.Speech DLL (and its associated runtime and languages) must be downloaded and installed on the system. Recognition using System.Speech usually requires training for a specific user, when the user reads some text, and the system learns to understand the pronunciation of this user. Recognition using Microsoft.Speech works immediately for any user. System.Speech can recognize almost any word (this is called free dictation). Microsoft.Speech will only recognize words and phrases that are in the Grammar object defined in the program.

Adding Speech Recognition Support to a Windows Forms Application

The process of adding speech recognition and text-to-speech support to a Windows Forms or WPF application is similar to that of a console application. To create the demo program shown in fig. 2, I started Visual Studio, created a new C # Windows Forms Application, and renamed it to WinFormSpeech.

After loading the template code into the editor, I added a link to the Microsoft.Speech.dll file in the Solution Explorer window - just as I did in the console program. At the top of the source code, I removed the unnecessary using statements, leaving only references to the System, Data, Drawing, and Forms namespaces. Then I added two using statements for the Microsoft.Speech.Recognition and System.Globalization namespaces.

The Windows Forms demo does not use speech synthesis, so I am not linking to the Microsoft.Speech.Synthesis library. Adding text-to-speech to a Windows Forms application is the same as adding text-to-speech to a console application.

In Visual Studio, in Design Mode, I've dragged TextBox, CheckBox, and ListBox controls onto the Form. I double-clicked the CheckBox and Visual Studio automatically created a skeleton for the CheckChanged event handler method.

Recall that the demo console program immediately began listening to the spoken commands and continued to do so until it was terminated. This approach can be used in a Windows Forms application as well, but instead I decided to let the user turn speech recognition on and off using a CheckBox control (that is, using a checkbox).

The source code in the Form1.cs file of the demo program where the partial class is defined is shown in fig. nine... A speech engine object is declared and instantiated as a member of Form. In the constructor of the Form, I hook up the SpeechRecognized event handler and then create and load two Grammars objects:

public Form1 () (InitializeComponent (); sre.SetInputToDefaultAudioDevice (); sre.SpeechRecognized + \u003d sre_SpeechRecognized; Grammar g_HelloGoodbye \u003d GetHelloGoodbyeGrammar (); Grammar g_SetTextBox \u003d GetTextBox1TextGrammar (); sre.LoadGrammarAsync (g_HelloGoodbye); sre.LoadGrammarAsync (g_SetTextBox); // sre.RecognizeAsync () is // in the CheckBox event handler)

Figure: 9. Adding Speech Recognition Support to Windows Forms

using System; using System.Data; using System.Drawing; using System.Windows.Forms; using Microsoft.Speech.Recognition; using System.Globalization; namespace WinFormSpeech (public partial class Form1: Form (static CultureInfo ci \u003d new CultureInfo ("en-us"); static SpeechRecognitionEngine sre \u003d new SpeechRecognitionEngine (ci); public Form1 () (InitializeComponent (); sre.SetInputToDefaultAudioDevice (); sre.SetInputToDefaultAudioDevice (); sre.SetInputToDefaultAudioDevice .SpeechRecognized + \u003d sre_SpeechRecognized; Grammar g_HelloGoodbye \u003d GetHelloGoodbyeGrammar (); Grammar g_SetTextBox \u003d GetTextBox1TextGrammar (); sre.LoadGrammarAsync (g_HelloGoodbye); sre.LoadGrammarAsync (g_SetTextBox); // sre.RecognizeAsync () // stored in the event handler CheckBox) static Grammar GetHelloGoodbyeGrammar () (Choices ch_HelloGoodbye \u003d new Choices (); ch_HelloGoodbye.Add ("hello"); ch_HelloGoodbye.Add ("goodbye"); GrammarBuilder gb_result \u003d new ch_HelloGoodbye \u003d new ch_HelloGoodbye (return) g_result;) static Grammar GetTextBox1TextGrammar () (Choices ch_Colors \u003d new Choices (); ch_Colors.Add (new string ("red", "white", "blue")); GrammarBuilder gb_r esult \u003d new GrammarBuilder (); gb_result.Append ("set text box 1"); gb_result.Append (ch_Colors); Grammar g_result \u003d new Grammar (gb_result); return g_result; ) private void checkBox1_CheckedChanged (object sender, EventArgs e) (if (checkBox1.Checked \u003d\u003d true) sre.RecognizeAsync (RecognizeMode.Multiple); else if (checkBox1.Checked \u003d\u003d false) // disabled sre.RecognizeAsyncCancel ();) void sre_SpeechRecognized (object sender, SpeechRecognizedEventArgs e) (string txt \u003d e.Result.Text; float conf \u003d e.Result.Confidence; if (conf< 0.65) return; this.Invoke(new MethodInvoker(() => (listBox1.Items.Add ("I heard you say:" + txt);))); // WinForm specifics if (txt.IndexOf ("text")\u003e \u003d 0 && txt.IndexOf ("box")\u003e \u003d 0 && txt.IndexOf ("1")\u003e \u003d 0) (string words \u003d txt.Split ( ""); this.Invoke (new MethodInvoker (() \u003d\u003e (textBox1.Text \u003d words;))); // specifics of WinForm))) // Form) // ns

I could have created two Grammar objects directly, as in a console program, but instead, to make the code a little clearer, I defined two helper methods (GetHelloGoodbyeGrammar and GetTextBox1TextGrammar) that do the job.

static Grammar GetTextBox1TextGrammar () (Choices ch_Colors \u003d new Choices (); ch_Colors.Add (new string ("red", "white", "blue")); GrammarBuilder gb_result \u003d new GrammarBuilder (); gb_result.Append ("set text box 1 "); gb_result.Append (ch_Colors); Grammar g_result \u003d new Grammar (gb_result); return g_result;)

This helper method will recognize the phrase "set text box 1 red". However, the user is not obliged to accurately pronounce this phrase. For example, he could say: "Please set the text in text box 1 to red," and the speech recognition engine would still recognize the phrase as "set text box 1 red" - albeit with a lower confidence value than an exact match with the Grammar template. In other words, when creating Grammar objects, you do not have to take into account all variations of the phrase. This drastically simplifies the use of speech recognition.

The event handler for the CheckBox is defined like this:

private void checkBox1_CheckedChanged (object sender, EventArgs e) (if (checkBox1.Checked \u003d\u003d true) sre.RecognizeAsync (RecognizeMode.Multiple); else if (checkBox1.Checked \u003d\u003d false) // disabled sre.RecognizeAsyncCancel ();)

A speech recognition engine (sre) object always exists for the life of a Windows Forms application. This object is activated and deactivated by calls to the RecognizeAsync and RecognizeAsyncCancel methods when the user toggles the CheckBox accordingly.

The SpeechRecognized event handler definition starts with:

void sre_SpeechRecognized (object sender, SpeechRecognizedEventArgs e) (string txt \u003d e.Result.Text; float conf \u003d e.Result.Confidence; if (conf< 0.65) return; ...

In addition to the more or less consistently used properties Result.Text and Result.Confidence, the Result object has several other useful but more complex properties that you might want to explore; for example Homophones and ReplacementWordUnits. In addition, the speech recognition engine provides several useful events like SpeechHypothesized.

this.Invoke ((Action) (() \u003d\u003e listBox1.Items.Add ("I heard you say:" + txt)));

In theory, in this situation, using the MethodInvoker delegate is slightly more efficient than Action, since MethodInvoker is part of the Windows.Forms namespace, which means it is specific to Windows Forms applications. The Action delegate is more versatile. This example shows that you can completely manipulate a Windows Forms application through speech recognition — an incredibly powerful and useful feature.

Conclusion

The information provided in this article should get you started right away if you want to explore speech synthesis and recognition in .NET applications. Mastering the technology itself is easy once you get past the bumps in initial training and component setup. The real challenge with speech synthesis and recognition is knowing when it's really useful.

In the case of console programs, you can create interesting reciprocal dialogues where the user asks a question and the program responds, resulting in essentially a Cortana-like environment. You must use some caution because when speech comes from your computer's speakers, it will be picked up by the microphone and can be recognized again. I myself got into pretty funny situations where I asked a question, the application recognized it and answered, but the spoken response triggered the next recognition event, and in the end I got a funny endless speech loop.

Another possible use of speech in a console program is to recognize commands like Launch Notepad and Launch Word. In other words, such a console program can be used on your computer to perform actions that would otherwise require a lot of keyboard and mouse manipulation.

James McCaffrey(Dr. James McCaffrey) works for Microsoft Research in Redmond, Washington. Contributed to the creation of several Microsoft products, including Internet Explorer and Bing. You can contact him at [email protected].

Thanks to the following Microsoft Research experts for reviewing this article: Rob Gruen, Mark Marron, and Curtis von Veh.

This phone has speech recognition or voice input, but it only works over the Internet, connecting to Google services. But the phone can be taught to recognize speech without the Internet, we will look at how to enable Russian language recognition in offline... For this method to work, you must have two applications installed Voice Search and Google search, although these programs are already present in the factory firmware.

For firmware

Go to the phone settings and select

Choose Russian and download it.

For firmware 2.8B

In the new firmware, the " Offline speech recognition" absent.

If you had offline packages installed before the firmware update, and you did not wipe (reset the settings) during the update, then they should have been saved. Otherwise, you will have to roll back to the firmware 2.2 , install voice packs, and only then update the system to 2.8B.

For Rev.B devices

We install the update through the recovery and enjoy voice recognition in the oiline.

2. Download the base for Russian speech, and copy it to the SD card

Download Russian_offline.zip 1301

3. Enter the recovery by holding down (Volume + and On) with the phone turned off.

4. Select Apply update from external storage and select the downloaded archive.

No program can completely replace the manual work of transcribing recorded speech. However, there are solutions that can significantly speed up and facilitate the translation of speech into text, that is, simplify transcription.

Transcription is the recording of an audio or video file in text form. There are paid paid tasks on the Internet, when a certain amount of money is paid to the performer for the transcription of the text.

Speech-to-text translation is useful

students to translate recorded audio or video lectures into text,
bloggers leading websites and blogs,
writers, journalists for writing books and texts,
info business people who need text after their webinar, speech, etc.,
people who find it difficult to type - they can dictate a letter and send it to family or friends,
other options.

Let's describe the most effective tools available on PCs, mobile applications and online services.

1 Website speechpad.ru

It is an online service that allows you to translate speech to text through the Google Chrome browser. The service works with a microphone and ready-made files. Of course, the quality will be much higher if you use an external microphone and dictate yourself. However, the service does a pretty good job even with YouTube videos.

Click "Enable recording", answer the question about "Using the microphone" - for this click "Allow".

A long instruction on using the service can be collapsed by clicking on button 1 in Fig. 3. You can get rid of ads by going through a simple registration.

Figure: 3. Speechpad service

The finished result is easy to edit. To do this, you need to either manually correct the selected word, or dictate it again. The results of the work are saved in your personal account, they can also be downloaded to your computer.

List of video tutorials on working with speechpad:

You can transcribe videos from Youtube or from your computer, however, you need a mixer, more details:

Audio transcription video

The service works in seven languages. There is a small minus. It lies in the fact that if you need to transcribe a finished audio file, then its sound is distributed to the speakers, which creates additional noise in the form of an echo.

2 Service dictation.io

A wonderful online service that will allow you to translate speech to text for free and easily.

Figure: 4. Service dictation.io

1 in fig. 4 - Russian can be selected at the end of the page. In the Google Chrome browser, the language is selected, but in Mozilla for some reason there is no such possibility.

It is noteworthy that the ability to autosave the finished result is implemented. This will prevent accidental deletion by closing a tab or browser. This service does not recognize finished files. Works with a microphone. You need to name punctuation marks when dictation.

The text is recognized quite correctly, there are no spelling errors. You can insert punctuation marks yourself from the keyboard. The finished result can be saved on your computer.

3 RealSpeaker

This program makes it easy to translate human speech into text. It is designed to work on different systems: Windows, Android, Linux, Mac. With its help, you can convert speech sounding into a microphone (for example, it can be built into a laptop), as well as recorded in audio files.

Can perceive 13 languages \u200b\u200bof the world. There is a beta version of the program that works as an online service:

You need to go to the above link, select the Russian language, upload your audio or video file to the online service and pay for its transcription. After transcription, you can copy the resulting text. The larger the file for transcription, the more time it will take to process it, in more detail:

In 2017, there was a free transcription option using RealSpeaker, in 2018 there is no such option. It is very embarrassing that the transcribed file is available to all users for downloading, perhaps it will be improved.

The developer's contacts (VKontakte, Facebook, Youtube, Twitter, e-mail, phone) of the program can be found on the page of his site (more precisely, in the basement of the site):

4 Speechlogger

An alternative to the previous app for mobile devices running on Android. Available for free in the app store:

The text is edited automatically, punctuation marks are placed in it. Very handy for dictating notes or making lists. As a result, the text will be of a very decent quality.

5 Dragon Dictation

This application is distributed free of charge for mobile devices from Apple.

The program can work with 15 languages. It allows you to edit the result, select the desired words from the list. It is necessary to clearly pronounce all sounds, not to make unnecessary pauses and to avoid intonation. Sometimes there are mistakes in the endings of words.

The Dragon Dictation application is used by owners, for example, to dictate a shopping list in a store while moving around the apartment. When I get there, you can look at the text in the note, and you don't need to listen.

Whatever program you use in your practice, be prepared to double-check the result and make certain adjustments. This is the only way to get a flawless text without errors.

Also useful services:

Get the latest computer literacy articles straight to your inbox.
Already more 3.000 subscribers

Updated: Monday, July 31, 2017

What does the semi-fantastic idea of \u200b\u200btalking to a computer have to do with professional photography? Almost none, if you are not a fan of the idea of \u200b\u200bthe endless development of the entire technical environment of a person. Imagine for a moment that you are giving voice commands to your camera to change the focal length and make exposure compensation by half a stop plus. Remote control of the camera has already been implemented, but there you need to silently press the buttons, and here the hearing foamer!

It has become a tradition to cite a science fiction film as an example of a person's voice communication with a computer, well, at least "A Space Odyssey 2001" directed by Stanley Kubrick. There, the on-board computer not only conducts a meaningful dialogue with the astronauts, but can read lips like a deaf person. In other words, the machine has learned to recognize human speech without errors. Perhaps some remote voice control of the camera will seem superfluous, but many would like this phrase "Take us off, baby" and the picture of the whole family on the background of a palm tree is ready.

Well, here I paid tribute to tradition, fantasized a little. But, speaking from the heart, this article was difficult to write, and it all started with a gift in the form of a smartphone with the Android 4 OS. This HUAWEI U8815 model has a small four-inch touchscreen and an on-screen keyboard. It is somewhat unusual to type on it, but it turned out to be not particularly necessary. (image01)

1. Voice recognition in a smartphone running on Android OS

While mastering a new toy, I noticed a graphic image of a microphone in the search bar Google and on the keyboard in Notes. Previously, I was not interested in what this symbol stands for. I had conversations in Skype, and typed letters on the keyboard. Most Internet users do this. But as they explained to me later, in a search engine Google voice search in Russian was added and programs appeared that allow you to dictate short messages when using the browser "Chrome".

I uttered a phrase of three words, the program identified them and showed them in a cell with a blue background. There was something to be surprised at, because all the words were spelled correctly. If you click on this cell, then the phrase appears in the text field of the android notebook. So he said a couple more phrases and sent a message to the assistant via SMS.

2. A brief history of voice recognition programs.

It was not a discovery for me that modern advances in voice control make it possible to give commands to household appliances, a car, a robot. Command mode was introduced in past versions of Windows, OS / 2 and Mac OS. I've seen talk programs, but what's the use? Perhaps this is my peculiarity that it is easier for me to speak than to type on the keyboard, and on my cell phone I cannot type anything at all. I have to write down contacts on a laptop with a normal keyboard and transfer them via a USB cable. But to just speak into the microphone and the computer itself typed the text without errors - it was a dream for me. The atmosphere of hopelessness was fueled by forum discussions. There was such a sad thought in them everywhere:

“However, in fact, until now, programs for real speech recognition (and even in Russian) practically do not exist, and they will obviously not be created soon. Moreover, even the task opposite to recognition - speech synthesis, which would seem to be much easier than recognition, has not been fully resolved. " (ComputerPress # 12, 2004)

“There are still no normal speech recognition programs (not only Russian), since the task is pretty difficult for a computer. And the worst thing is that the human word recognition mechanism has not been realized, so there is nothing to start from when creating recognition programs. " (Another discussion on the forum).

That said, reviews of English-language voice-entry programs indicated clear success. For instance, IBM ViaVoice 98 Executive Edition had a basic vocabulary of 64,000 words and the ability to add the same number of her own words. The percentage of word recognition without training the program was about 80% and with subsequent work with a specific user reached 95%.

Of the Russian language recognition programs, it is worth noting "Gorynych" - an addition to the English language Dragon Dictate 2.5. About searches, and then "the battle with five Gorynychs" I will tell in the second part of the review. The first I found the "English Dragon".

3. Program of recognition of continuous speech "Dragon Naturally Speaking"

The modern version of the company program "Nuance" ended up with my old friend from the Minsk Institute of Foreign Languages. She brought it from a trip abroad, and bought it, thinking that she could be a "computer secretary." But something did not work, and the program remained almost forgotten on the laptop. Due to the lack of any intelligible experience, I had to go to my friend myself. All this lengthy introduction is necessary for a correct understanding of the conclusions I have drawn.

The full name of my first dragon sounded like this: ... The program is in English and everything in it is clear even without a manual. The first step is to create a profile of a specific user to determine the peculiarities of the sound of words in his performance. What I did - what matters is the speaker's age, country, pronunciation features. My choice is: age 22-54, English UK, standard pronunciation. Next up are several windows where you set up your microphone. (image04)

The next stage for serious speech recognition programs is training for the pronunciation of a particular person. You are asked to choose the nature of the text: my choice is a short instruction on dictation, but you can also "order" a humorous story.

The essence of this stage of working with the program is extremely simple - the text is displayed in the window, above it there is a yellow arrow. When pronounced correctly, the arrow moves through the phrases, and at the bottom there is a progress bar of the workout. Conversational English was pretty much forgotten by me, so I hardly advanced. Time was also limited - the computer was not mine, and I had to interrupt my workout. But a friend said she took the test in less than half an hour. (image05)

Refusing to adapt my pronunciation program, I went to the main window and launched the built-in text editor. He spoke individual words from some texts that he found on the computer. The words that it pronounced correctly, the program printed, those that it said poorly, replaced with something "English". Having pronounced the command "erase a line" in English clearly - the program executed it. This means that I read the commands correctly, and the program recognizes them without preliminary training.

But it was important for me how this "dragon" writes in Russian. As you understood from the previous description, when training the program, you can select only the English text, there is simply no Russian there. It is clear that it will not work to train the recognition of Russian speech. In the next photo you can see what phrase the program typed when pronouncing the Russian word "Hello". (image06)

The result of communication with the first dragon turned out to be slightly comical. If you carefully read the text on the official website, you can see the English "specialization" of this software product. In addition, when loading, we read in the program window "English". So why was all this necessary. It is clear that forums and rumors are to blame ...

But there is also a useful experience. My friend asked to see the condition of her laptop. Somehow slowly he began to work. This is not surprising - the system partition had only 5% free space. Removing unnecessary programs, I saw that the official version was over 2.3 GB. This figure will be useful to us later. (image.07)

Recognition of Russian speech, as it turned out, was not a trivial task. In Minsk, I managed to find a friend's "Gorynych". For a long time he searched for the disc in his old rubble and, according to him, this is the official publication. The program was installed instantly, and I learned that its dictionary contains 5,000 Russian words plus 100 commands and 600 English words plus 31 commands.

First you need to set up the microphone, which I did. Then I opened the dictionary and added the word "Check" because it was not in the program dictionary. I tried to speak clearly, monotonously. Finally, I opened the program "Gorynych Pro 3.0", turned on the dictation mode and got this list of "similar-sounding words." (image.09)

The result obtained puzzled me, because it clearly differed for the worse from the work of an android smartphone, and I decided to try other programs from " online store Google Chrome "... And he put off to deal with the "mountain snakes" for later. It seemed to me postponement action in the original Russian spirit

5. Google's voice capabilities

To work with voice on a regular Windows computer, you need to install a browser Google chrome... If you use it on the Internet, you can click on the link of the software store at the bottom right. There I found two programs and two extensions for voice text input for free. The programs are called "Voice Notebook" and "Voisnot - Voice to Text"... After installation, they can be found on the tab "Applications" your browser "Chromium". (image. 10)

The extensions are called "Google Voice Search Hotword (Beta) 0.1.0.5" and "Voice text input - Speechpad.ru 5.4"... After installation, they can be turned off or deleted on the tab "Extensions". (image. 11)

VoiceNote... On the application tab in the Chrome browser, double-click the program icon. A dialog box will open like the picture below. By clicking on the microphone icon, you speak short phrases into the microphone. The program sends your words to the speech recognition server and types the text in the window. All words and phrases shown in the illustration were typed the first time. Obviously, this method only works with an active Internet connection. (image. 12)

Voice notepad... If you run the program on the application tab, a new tab of the Internet page will open Speechpad.ru... There are detailed instructions on how to use this service and a compact form. The latter is shown in the illustration below. (image. 13)

Voice input text allows you to fill in text fields of Internet pages with your voice. For example, I went to my page Google+... In the new message input field, right-clicked and selected "SpeechPad"... The pink colored input box tells you to dictate your text. (image. 14)

Google voice search allows you to search by voice. When installing and activating this extension, a microphone symbol appears in the search bar. When you click it, a symbol appears in a large red circle. Just say a search phrase and it will appear in the search results. (image. 15)

Important note: for the microphone to work with Chrome extensions, you need to allow access to the microphone in the browser settings. It is disabled by default for security reasons. Go to Settings → Personal data → Content settings... (To access all settings at the end of the list, click Show advanced settings)... A dialog box will open Page content settings... Select an item down the list Multimedia → microphone.

6. Results of working with Russian speech recognition programs

A little experience in using voice input programs has shown an excellent implementation of this feature on the servers of an Internet company Google... Without any prior training, words are recognized correctly. This indicates that the problem of Russian speech recognition has been resolved.

Now we can say that the result of development Google will be a new criterion for evaluating products from other manufacturers. I would like the recognition system to work offline without accessing the company's servers - so it is more convenient and faster. But it is not known when an independent program for working with a continuous stream of Russian speech will be released. It should be assumed, however, that given the possibility of training, this "creation" will be a real breakthrough.

Programs of Russian developers "Gorynych", "Dictographer" and "Combat" I will go into detail in the second part of this review. This article was written very slowly for the reason that the search for original discs is now difficult. At the moment I already have all versions of Russian voice-to-text "recognizers" except "Combat 2.52". None of my friends or colleagues have this program, and I myself have only a few laudatory comments on the forums. True, there was such a strange option - to download "Combat" via SMS, but I don't like it. (image16)

A short video clip will show you how speech recognition is going on in a smartphone with Android OS. A feature of voice dialing is the need to connect to Google servers. Thus, the Internet should work for you