次の方法で共有


E-Book Sentiment using Azure Cognitive Services

Have you ever wondered if a book that you are about to read is pretty positive or it is a bit of a depressing downer sort of book? Are you, like me, too lazy to actually read it to find out? Well, look no further. I have made a very simple e-book sentiment analyzer using Azure Cognitive Services. This will give the overall sentiment of an e-book and also give you the sentiment for each chapter in case you want to jump straight to the good parts.

Prerequisites

To follow this tutorial, you will need:

  1. The source code used in this demo, which you can find at this GitHub repository.
  2. Visual Studio and .NET installed to be able to compile the code.
  3. A Cognitive Services API key.
  4. One or more e-books in EPUB format.

You can obtain a Cognitive Services API key by:

  1. Getting an Azure Subscription. You can get a free trial subscription if you don't have one already.
  2. And creating a Cognitive Services, Text Analytics resource and finding the key.

You can find more details on that here. Alternatively, you can get a free trial API key.

The EPUB Analyzer

There are many different e-book formats out there. In this demo, I have chosen to use the EPUB, which is supported by many browsers and e-readers. You can find books in this format (many of them free, public domain) online. The books I have used were found on https://www.feedbooks.com.

I have written my e-book analyzer in C# (.NET), but the Cognitive Services API is really a simple REST interface that you should be able to call from pretty much any application or language. In order to parse the EPUB files, I have used the VerseOne.Epub library. The code for that library is on GitHub.

You can find the complete source code for the e-book sentiment analyzer on GitHub. It is a pretty simple, single source file application. In fact, it is so short and simple that I will just reproduce the source code here:

 

 using System;
using System.Text;
using HtmlAgilityPack;
using VersOne.Epub;
using Microsoft.ProjectOxford.Text.Core.Exceptions;
using Microsoft.ProjectOxford.Text.Sentiment;

namespace EpubSentiment
{
    class Program
    {

        static void AppendChapter(ref SentimentRequest request, EpubChapter chapter)
        {
            HtmlDocument htmlDocument = new HtmlDocument();
            htmlDocument.LoadHtml(chapter.HtmlContent);

            StringBuilder sb = new StringBuilder();

            foreach (HtmlNode node in htmlDocument.DocumentNode.SelectNodes("//text()"))
            {
                sb.AppendLine(node.InnerText.Trim());
            }

            string chapterText = sb.ToString();

            int maxCharacters = 3 * 1024; //Max characters that we will send to sentiment API
            int chunks = (int)Math.Ceiling((double)chapterText.Length / (double)maxCharacters);
            int charsPerChunk = (int)Math.Ceiling((double)chapterText.Length / (double)chunks);

            int offset = 0;

            for (int i = 0; i < chunks; ++i)
            {
                if (offset + charsPerChunk > chapterText.Length)
                {
                    charsPerChunk = chapterText.Length - offset;
                }

                var testText = chapterText.Substring(offset, charsPerChunk);
                string chunkID = "CHUNKDOCUMENT" + i;
                var doc = new SentimentDocument() { Id = chunkID, Text = testText, Language = "en" };

                request.Documents.Add(doc);

                offset += charsPerChunk;
            }


        }

        static void Main(string[] args)
        {
            if (args.Length != 2)
            {
                Console.WriteLine("Usage: ");
                Console.WriteLine("  " + System.AppDomain.CurrentDomain.FriendlyName + " <FILENAME> <APIKEY>");
                Environment.Exit(1);
            }
            string bookfile = args[0];
            string apiKey = args[1];

            Console.WriteLine("Analyzing book: " + bookfile);
            EpubBook epubBook = EpubReader.ReadBook(bookfile);

            string title = epubBook.Title;
            string author = epubBook.Author;

            Console.WriteLine("Book title: " + title);
            Console.WriteLine();

            double bookScore = 0.0;
            int numChapters = 0;
            foreach (EpubChapter chapter in epubBook.Chapters)
            {

                var request = new SentimentRequest();

                string chapterTitle = chapter.Title;

                AppendChapter(ref request, chapter);

                foreach (EpubChapter subChapter in chapter.SubChapters)
                {
                    AppendChapter(ref request, subChapter);
                }
           
                var client = new SentimentClient(apiKey);
                var response = client.GetSentiment(request);

                foreach (Microsoft.ProjectOxford.Text.Core.DocumentError e in response.Errors)
                {
                    Console.WriteLine("Errors: " + e.Message);
                }

                double score = 0.0;
                int numScores = 0;

                foreach (SentimentDocumentResult r in  response.Documents)
                {
                    score += r.Score;
                    numScores++;
                }

                score /= numScores;

                Console.WriteLine(numChapters + ": " + chapterTitle + ", score: " + score);

                bookScore += score;
                numChapters++;
            }

            bookScore /= numChapters;

            Console.WriteLine();
            Console.WriteLine("Average book sentiment: " + bookScore);
        }
    }
}

This code does a few different things:

  1. Takes two input arguments: an epub file and the API key needed to call the Cognitive Services API
  2. Opens the epub file and figures out some basic stuff like title, etc.
  3. Loops through all chapters (and subchapters) of the book to extract the chapter text.
  4. Packages the text from each chapter into manageable chunks. The Cognitive Services API has a limit on how much text you can send at a time, so we chop a chapter into some smaller chunks.
  5. Queries the Cognitive Services API to get the sentiment for each chunk and then calculates an average sentiment for a given chapter.
  6. Calculates the average sentiment for the entire book.
  7. Prints some output.

The code should be self-explanatory, but a few comments may be in order.

Firstly, the idea of chopping a chapter into chunks and calculating the chapter sentiment based on an average of the sentiment for each chunk is probably not mathematically or statistically all that sound. Specifically, the sentiment is probably not linear, so one could imagine variations depending on how the chapter is chopped and so on. Moreover, this problem is not likely to "average out" over many chapters or books. It is beyond the scope of this little tutorial to go into details on this, but one could actually use this tool to investigate further by varying the chunk sizes, etc. An additional comment on this is that I have somewhat arbitrarily chosen 3k characters as the chunk size. This choice was not based on any rigorous analysis, it was based on having a size that was small enough to fit within the limits of the Cognitive Services API (10KB of data) while being large enough that I don't make too many calls to the API (thus incurring large costs). It is easy to play with these settings in the application.

I am using the .NET API for Cognitive Services in this example. A different way to do this is through the REST API, which would be more generic and probably make it easier for people to port this code to other languages, but the .NET API provided an easy way to make this a very compact code example.

I make no attempt to deal with books in other languages than English. The Cognitive Services API could actually be used to detect the language and get the sentiment for the appropriate language or translate before calling the sentiment API. Again, that would have made for a more elaborate example and in the interest of brevity, this example only works for books in English.

An example analysis

So now that we have a sentiment analyzer, let's take it for a spin. I have chosen "A Christmas Carol" by Charles Dickens.  It is available in public domain form. Running the analyzer on it would look something like this:

 PS> dotnet.exe .\EpubSentiment.dll C:\temp\christmas_carol.epub <API KEY>

Analyzing book: C:\temp\christmas_carol.epub
Book title: A Christmas Carol

0: Title, score: 0.5
1: About, score: 0.999999523162842
2: Chapter 1 - Marley's Ghost, score: 0.228464378760411
3: Chapter 2 - The First Of The Three Spirits, score: 0.561749743918578
4: Chapter 3 - The Second Of The Three Spirits, score: 0.87196253426373
5: Chapter 4 - The Last Of The Spirits, score: 0.384314155578613
6: Chapter 5 - The End Of It, score: 0.999999988079071

Average book sentiment: 0.649498617680464

So we see that once you get past the "About", it is actually a bit of a downer, with the exception of Chapter 3, which is mostly positive (in sentiment). Chapter 5 (The End Of It) is very positive. This is pretty much how I remember that book, so it makes sense.

Obviously the sentiment of the text and the actual feel and message of the book may not be the same. One could imagine some pretty negative language in a book that is ultimately inspiring and uplifting, and vice versa, but the sentiment analysis provides one type of data point on the sentiment of the book.

Give it a try on some of your favorite books and let me know what you find.