(4.4) OCR | Text Analytics | Speech | OpenAI GPT

8 min readNov 8, 2023

Optical character recognition, Sentiment Analysis, and Speech recognition using Azure AI Services

This is part of a series of articles called Azure Challenges. You can refer to the Intro Page to understand more about how the challenges work.

As usual…

Before we start there are some important clarification points:

(1) Troubleshooting is IMPORTANT — It is important for you to exercise the error message search and solution, and find bugs in your code, environment, IDE, etc.
(2) The code IS JUST a code — There are several ways to write code and different languages. The examples here are just one way to do it.
(3) This IS NOT a prep course — The main goal here is to show the practical application of Azure Resources with a focus on Enterprise AI solutions.
(4) You won’t be graded by the challenges but, they are an important practical component in your learning experience.
(5) Make sure you are using a FREE student account and check your costs!
(6) You can expect a different result OR error if Microsoft changes the API configuration or inputs, please be aware of this and always check the API documentation for more info.

Let's get started!

TIP OF THE DAY!!!

OCR — javascript

The code in HTML.

<!DOCTYPE html>
<html>
<head>
    <title>OCR Sample</title>
    <script src="https://ajax.googleapis.com/ajax/libs/jquery/1.9.0/jquery.min.js"></script>
</head>
<body>

<script type="text/javascript">
    function processImage() {
        // **********************************************
        // *** Update or verify the following values. ***
        // **********************************************

        var subscriptionKey = document.getElementById("subscriptionKey").value;
        var endpoint = document.getElementById("endpointUrl").value;
        
        var uriBase = endpoint + "vision/v3.1/ocr";

        // Request parameters.
        var params = {
            "language": "unk",
            "detectOrientation": "true",
        };

        // Display the image.
        var sourceImageUrl = document.getElementById("inputImage").value;
        document.querySelector("#sourceImage").src = sourceImageUrl;

        // Perform the REST API call.
        $.ajax({
            url: uriBase + "?" + $.param(params),

            // Request headers.
            beforeSend: function(jqXHR){
                jqXHR.setRequestHeader("Content-Type","application/json");
                jqXHR.setRequestHeader("Ocp-Apim-Subscription-Key", subscriptionKey);
            },

            type: "POST",

            // Request body.
            data: '{"url": ' + '"' + sourceImageUrl + '"}',
        })

        .done(function(data) {
            // Show formatted JSON on webpage.
            $("#responseTextArea").val(JSON.stringify(data, null, 2));
        })

        .fail(function(jqXHR, textStatus, errorThrown) {
            // Display error message.
            var errorString = (errorThrown === "") ?
                "Error. " : errorThrown + " (" + jqXHR.status + "): ";
            errorString += (jqXHR.responseText === "") ? "" :
                (jQuery.parseJSON(jqXHR.responseText).message) ?
                    jQuery.parseJSON(jqXHR.responseText).message :
                    jQuery.parseJSON(jqXHR.responseText).error.message;
            alert(errorString);
        });
    };
</script>

<h1>Optical Character Recognition (OCR):</h1>
Enter the URL to an image of printed text, then
click the <strong>Read image</strong> button.
<br><br>
Subscription key: 
<input type="text" name="subscriptionKey" id="subscriptionKey"
    value="" /> 
Endpoint URL:
<input type="text" name="endpointUrl" id="endpointUrl"
    value="" />
<br><br>
Image to read:
<input type="text" name="inputImage" id="inputImage" 
    value="https://upload.wikimedia.org/wikipedia/commons/thumb/a/af/Atomist_quote_from_Democritus.png/338px-Atomist_quote_from_Democritus.png" />
<button onclick="processImage()">Read image</button>
<br><br>
<div id="wrapper" style="width:1020px; display:table;">
    <div id="jsonOutput" style="width:600px; display:table-cell;">
        Response:
        <br><br>
        <textarea id="responseTextArea" class="UIInput"
                  style="width:580px; height:400px;"></textarea>
    </div>
    <div id="imageDiv" style="width:420px; display:table-cell;">
        Source image:
        <br><br>
        <img id="sourceImage" width="400" />
    </div>
</div>
</body>
</html>

You can use/copy the code above OR open the file OCR.html which is part of our class download — from GitHub.

Check for your service keys and endpoint.

Add both info to your file…

Now you have the result embedded in the HTML website and the same you can do with any language. In the screenshot below you can see the API Response and the Source Image — that was used by the API.

Other examples — URL used : https://jeroen.github.io/images/testocr.png

Other examples — URL used: https://courses.cs.vt.edu/csonline/AI/Lessons/VisualProcessing/OCRscans_files/robertson.jpg

Other examples — URL used:

https://www.pyimagesearch.com/wp-content/uploads/2020/08/ocr_handwriting_reco_adrian_sample.jpg

Text Analytics

What is Text Analytics?

Mine insights in unstructured text using natural language processing (NLP) — no machine learning expertise required. Gain a deeper understanding of customer opinions with sentiment analysis. Identify key phrases and entities such as people, places, and organizations to understand common topics and trends. Classify medical terminology using domain-specific, pre-trained models. Evaluate text in a wide range of languages.

The Text Analytics API is a cloud-based service that provides NLP features for text mining and text analysis, including sentiment analysis, opinion mining, key phrase extraction, language detection, and named entity recognition.

https://azure.microsoft.com/en-us/services/cognitive-services/text-analytics/

Text Analytics

Detect Language
Extract Key Phrases
Determine Sentiment
Extract Known Entities

Open the file AIDI1006-text-analytics

Replace the Key and Endpoint values.

Detect Language

Detect Language — Result

Extract Key Phrases

Extract Key Phrases — Result

Determine Sentiment

Determine Sentiment — Result

Extract Known Entities

Extract Known Entities — Result

Speech (Cognitive Services)

Open the file AIDI1006-speech

Next step. Take a look in the code.

Next step.

Next step.

Next step.

Speech from Microphone (Jupyter Notebook)

Attention: Make sure you are running this code locally.

Now, let’s use 2 other additional files as examples of how you can use the speech API and embed this solution in your code.

Please open the file AI1006-speech-from-mic.ipynb

You can use this code to transcribe the sound from your microphone to text.

Text to Speech OR Speech Synthesizer (Jupyter Notebook)

Attention: Make sure you are running this code locally.

Next example. Please open the file AI1006-text-to-speech.ipynb

Open AI GPT

Please open the file OpenAI-GPT.ipynb…

Attention:

(1) OpenAI Key as an environment variable (always)

(2) Use model =”gpt-4o-mini”, OR model=’gpt-3.5-turbo’

(3) Response = ChatCompletion

OpenAI’s Chat Completion format is part of the API that allows users to interact with language models like GPT (e.g., GPT-3, GPT-4) by sending a series of messages, where the model responds with a completion. This process is often used for tasks like dialogue generation, conversation, question-answering, and other interactive tasks.

The general format of a Chat Completion API request is as follows:

{
  "model": "gpt-4",  
  "messages": [
    {
      "role": "system", 
      "content": "You are a helpful assistant."
    },
    {
      "role": "user", 
      "content": "Can you explain the format for chat completion?"
    }
  ],
  "temperature": 0.7,
  "max_tokens": 150,
  "n": 1,
  "stop": ["\n"]
}

Key Fields:

model: Specifies the language model to use, such as gpt-4, gpt-3.5, etc.
messages: A list of message objects that make up the conversation.

Each message contains:

role: Defines the role of the message sender (system, user, or assistant).
content: The content of the message (the text being sent).
temperature: Controls the randomness of the response (0 = deterministic, 1 = more random). A value between 0 and 1.
max_tokens: The maximum number of tokens (words or pieces of words) the model can generate.
n: The number of completions to generate for each prompt.
stop: Optionally, you can specify a stop sequence to end the model’s output.

Response Format

A typical response from the API would look like:

{
  "id": "chatcmpl-6f0lhkdx1XUqX1lsfYcR2kC0ZR4h3",
  "object": "chat.completion",
  "created": 1627665400,
  "model": "gpt-4",
  "usage": {
    "prompt_tokens": 23,
    "completion_tokens": 50,
    "total_tokens": 73
  },
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "The Chat Completion format is used for interactive dialogues. It includes a series of messages exchanged between a system, user, and assistant."
      },
      "finish_reason": "stop",
      "index": 0
    }
  ]
}

Key Response Fields:

id: A unique identifier for the response.
object: The type of the response (usually “chat.completion”).
created: A timestamp of when the response was generated.
model: The model used for the completion.
usage: Information about token usage (e.g., how many tokens were consumed in the request and response).
choices: A list of generated completions. Typically, you get one choice, but if n was set to more than 1, this list would contain multiple responses.
message: The message sent by the assistant.
role: Always “assistant” for the generated response.
content: The model’s response text.
finish_reason: Indicates how the response ended (e.g., “stop” or “length”).

This structure is how OpenAI’s GPT models handle chat interactions. You can adjust parameters like temperature or max_tokens to control the behavior of the model's responses.

Other references:

https://platform.openai.com/docs/overview

OpenAI Cookbook

Open-source examples and guides for building with the OpenAI API. Browse a collection of snippets, advanced techniques…

cookbook.openai.com

Please do not forget to check your subscription costs and available budget in your Azure subscription!

For more code samples using the speech SDK please refer to:

GitHub - Azure-Samples/cognitive-services-speech-sdk: Sample code for the Microsoft Cognitive…

Sample code for the Microsoft Cognitive Services Speech SDK - GitHub - Azure-Samples/cognitive-services-speech-sdk…

github.com

The architecture so far…

Final message…

This is the end of the Challenge#4 series.

You can go to the Intro Page and start the next Challenge.

(4.4) OCR | Text Analytics | Speech | OpenAI GPT

Before we start there are some important clarification points:

TIP OF THE DAY!!!

OCR — javascript

Text Analytics

Detect Language

Detect Language — Result

Extract Key Phrases

Extract Key Phrases — Result

Determine Sentiment

Determine Sentiment — Result

Extract Known Entities

Extract Known Entities — Result

Speech (Cognitive Services)

Speech from Microphone (Jupyter Notebook)

Text to Speech OR Speech Synthesizer (Jupyter Notebook)

Open AI GPT

Attention:

Key Fields:

Response Format

Key Response Fields:

Other references:

OpenAI Cookbook

Open-source examples and guides for building with the OpenAI API. Browse a collection of snippets, advanced techniques…

Please do not forget to check your subscription costs and available budget in your Azure subscription!

GitHub - Azure-Samples/cognitive-services-speech-sdk: Sample code for the Microsoft Cognitive…

Sample code for the Microsoft Cognitive Services Speech SDK - GitHub - Azure-Samples/cognitive-services-speech-sdk…

Final message…

Written by Caio Gasparine

No responses yet