Quantcast
Viewing all articles
Browse latest Browse all 4351

OCR of Legislative Assembly Video to Identify Member Speaking

A couple of days ago I wrote about my reverse engineering of the video archives of the Legislative Assembly of Prince Edward Island, and I suggested, at the end, that additional hijinks could now ensue.

When I read How I OCR Hundreds of Hours of Video, I knew that’s where I had to look next: the author of that post, Waldo Jaquith, uses optical character recognition — in essence “getting computers to read the words in images” — with video of the General Assembly of Virginia, to do automated indexing of speakers and bills. I reasoned that a similar approach could be used for Prince Edward Island, as our video here also has lower thirds listing the name of the member speaking.

So I tried it. And it worked! Here’s a walk-through of the toolchain I used, which is adapted from Waldo’s

The structure of the video archive I outlined earlier lends itself well to grabbing a still frame of video every 10 seconds, from the beginning of each 10-second-long transport stream.

I’ll start by illustrating the process of doing OCR on a single frame, and then run through the automation of the process for an entire part of the day.

Each 10-second transport stream has 306 frames. I don’t need all of those, I just need one, so I use FFmpeg to extract a single JPEG like this, run against this transport stream file.

ffmpeg -ss 1 -i "media_w1108428848_014.ts" -qscale:v 2 -vframes 1 "media_w1108428848_014.jpg"

The result is a JPEG like this:

Image may be NSFW.
Clik here to view.
JPEG frame capture from Legislative Assembly video

I only need the area of the frame that includes the “lower third” to do the OCR, so I use ImageMagick to crop this out:

convert "media_w1108428848_014.jpg" -crop 439x60+64+360 +repage -compress none -depth 8 "media_w1108428848_014.tif"

This crops out a 439 pixel by 60 pixel rectangle starting 64 pixels from the left and 360 pixels from the top, this section here:

Image may be NSFW.
Clik here to view.
Cropped Video Section

The lower third is different for members with multiple titles, like the Premier, and back bench members, which is why such a large swath is needed, vertically, to ensure all members’ names can be grabbed.

The resulting TIFF file looks like this:

Image may be NSFW.
Clik here to view.
Lower Third Cropped Out

Next I use ImageMagick again to convert all of the cropped lower thirds to black and white, with:

convert "media_w1108428848_014.tif" -negate -fx '.8*r+.8*g+0*b' -compress none -depth 8 "bw-media_w1108428848_014.tif"

Resulting in black and white images like this:

Image may be NSFW.
Clik here to view.
Black and white lower third

Now I’m ready to do the OCR, for which, like Waldo, I use Tesseract:

tesseract "bw-media_w1108428848_014.tif""bw-media_w1108428848_014"

This results in a text file with the converted text:

Hon H Wade Maclauchlan

mmm-‘v
Mun (-1 n1 hI-Anralui l‘nl‘ln nan-Iv

Tesseract did an almost perfect job on the member’s name — Hon. H. Wade MacLauchlan. It missed the periods, but that’s understandable as they got blown out in the conversion to black and white. And it got the fourth letter of the Premier’s last name as a lower case rather than upper case “L”, but, again, the tail on the “L” got blown out by the conversion.

And that’s it, really: grab a frame, crop out the lower third, convert to black and white, OCR. 

All I need now is a script to pull a series of transport streams and do this as a batch; this is what I came up with:

#!/bin/bash

DATESTAMP=$1

curl -Ss http://198.167.125.144:1935/leg/mp4:${DATESTAMP}.mp4/playlist.m3u8 > /tmp/playlist.m3u8
IFS=_ array=(`tail -1 /tmp/playlist.m3u8`)
IFS=. array=(${array[1]})
UNIQUEID="${array[0]}"

START=$(expr $(($2 * 6 - 1)))
DURATION=$(expr $(($3 * 6)))
END=$(expr $(($START + $DURATION)))

echo "Getting video for ${DATESTAMP}"

rm -f /tmp/concatentated-video.ts

while [ ${START} -lt ${END} ]; do
  echo "Getting chunk ${START}"
  PADDED=`printf %03d $START`
  echo "Changing to ${PADDED}"
  curl -Ss "http://198.167.125.144:1935/leg/mp4:${DATESTAMP}.mp4/media_${UNIQUEID}_${START}.ts"> "ts/media_${UNIQUEID}_${PADDED}.ts"
  ffmpeg -ss 1 -i "ts/media_${UNIQUEID}_${PADDED}.ts" -qscale:v 2 -vframes 1 "frames/media_${UNIQUEID}_${PADDED}.jpg"
  convert "frames/media_${UNIQUEID}_${PADDED}.jpg" -crop 439x60+64+360 +repage -compress none -depth 8 "cropped/media_${UNIQUEID}_${PADDED}.tif"
  convert "cropped/media_${UNIQUEID}_${PADDED}.tif" -negate -fx '.8*r+.8*g+0*b' -compress none -depth 8 "bw/media_${UNIQUEID}_${PADDED}.tif"
  tesseract "bw/media_${UNIQUEID}_${PADDED}.tif""ocr/media_${UNIQUEID}_${PADDED}" 
  let START=START+1
done

With this script in place, and directories set up for each of the generated files — ts/, frames/, cropped/, bw/ and ocr/ — I’m ready to go, using arguments identical to my earlier script. So, for example, if I want to OCR90 minutes of the Legislative Assembly from the morning of April 22, 2016, starting at the second minute, I do this:

./get-video.sh 20160422A 2 90

I leave that running for a while, and I end up with an ocr directory filled with OCRed text from each of the transport streams, files that look like this:

, . 1
Hon J Alan Mclsaac
MHn-Jrl (v0 Axul: HIVIHP thi | l'llr‘HF"

and this:

_ 4'

Hon. Allen F. Roac‘h

As Waldo wrote in his post:

Although Tesseract’s OCR is better than anything else out there, it’s also pretty bad, by any practical measurement.

And that’s born out in my experiments: the OCR is pretty good, but it’s not consistent enough to use for anything without some post-processing. And for that, I used the same technique Waldo did, computing the Levenshtein distance between the text from each OCRed frame and a list of Members of the Legislative Assembly.

From the Members page on the Legislative Assembly website, I prepared a CSV containing a row for each member and their party designation, with a couple of additional rows to allow me to react to frames where no member was identified:

Bradley Trivers,C
Bush Dumville,L
Colin LaVie,C
Darlene Compton,C
Hal Perry,L
Hon. Allen F. Roach,L
Hon. Doug W. Currie,L
Hon. Francis (Buck) Watts,N
Hon. H. Wade MacLaughlan,L
Hon. Heath MacDonald,L
Hon. J. Alan McIsaac,L
Hon. Jamie Fox,C
Hon. Paula Biggar,L
Hon. Richard Brown,L
Hon. Robert L. Henderson,L
Hon. Robert Mitchell,L
Hon. Tina Mundy,L
James Aylward,C
Janice Sherry,L
Jordan Brown,L
Kathleen Casey,L
Matthew MacKay,C
Pat Murphy,L
Peter Bevan-Baker,G
Sidney MacEwen,C
Sonny Gallant,L
Steven Myers,C
None,N
2nd Session,N

The idea is that for each OCRed frame I take the text and compare it to each of the names on this list; the name on the list with the lowest Levenshtein distance value is the likeliest speaker. 

For example, for this OCRed text:

e'b

Hon. Paula Blggav

I get this set of Levenshtein distances:

Bradley Trivers -> 20
Bush Dumville -> 18
Colin LaVie -> 18
Darlene Compton -> 20
Hal Perry -> 17
Hon. Allen F. Roach -> 18
Hon. Doug W. Currie -> 18
Hon. Francis (Buck) Watts -> 22
Hon. H. Wade MacLaughlan -> 19
Hon. Heath MacDonald -> 19
Hon. J. Alan McIsaac -> 17
Hon. Jamie Fox -> 16
Hon. Paula Biggar -> 8
Hon. Richard Brown -> 17
Hon. Robert L. Henderson -> 22
Hon. Robert Mitchell -> 20
Hon. Tina Mundy -> 16
James Aylward -> 19
Janice Sherry -> 19
Jordan Brown -> 17
Kathleen Casey -> 18
Matthew MacKay -> 18
Pat Murphy -> 18
Peter Bevan-Baker -> 19
Sidney MacEwen -> 19
Sonny Gallant -> 17
Steven Myers -> 19
None -> 19
2nd Session -> 20

The smallest Levenshtein distances is Hon. Paula Biggar, with a value of 8, so that’s the value I connect with this frame.

Ninety minutes of video from Friday morning results in 540 frame captures and 540OCRed snippets of text.

With the snippets of text extracted, I run a PHP script on the result, dumping out an HTML file with a thumbnail for each frame, coloured to match the party of the member speaking I identified from the OCR:

<?php

$colors = array("L" => "#F00",  // Liberal
                "C" => "#00F",  // Conservative
                "G" => "#0F0",  // Green
                "N" => "#FFF"   // None
                );

$names = file_get_contents("member-names.txt");
$members = explode("\n", $names);
foreach ($members as $key => $value) {
  if ($value != '') {
    list($name, $party) = explode(",", $value);
    $p = array("name" => $name, "party" => $party);
    $m[] = $p;
  }
}

$fp = fopen("index.html", "w");

if ($handle = opendir('./ocr')) {
  while (false !== ($entry = readdir($handle))) {
      if ($entry != "."&& $entry != ".."&& $entry != '.DS_Store') {
        $ocr = file_get_contents("./ocr/" . $entry);
        $jpeg = "frames/" . basename($entry, ".txt") . ".jpg";
        $ts = "ts/" . basename($entry, ".txt") . ".ts";
        $ocr = preg_replace('/[^a-z\n]+/i', '', $ocr);
        $mindist = 9999;
        unset($found);
        foreach($m as $key => $value) {
          if ($value != '') {
            $d = levenshtein($value['name'], trim($ocr));
            if ($d < $mindist) {
              $mindist = $d;
              $found = $value;
            }
          }
        }
        fwrite($fp, "<div style='float: left; background: " . $colors[$found['party']] . "'>\n");
        fwrite($fp, "<a href='$ts'><img src='$jpeg' style='width: 64px; height: auto; padding: 5px'></a></div>");
      }
  }
  closedir($handle);
}

The resulting HTML file looks like this in a browser:

Image may be NSFW.
Clik here to view.
Friday Morning in the House, colour-coded

The frames that are coloured white are frames where there was either no lower third, or where the lower third didn’t contain the name of the member speaking. It’s not a perfect process: the last dozen frames or so, for example, are from the consideration of the estimates, where there’s no member’s name in the lower third, but my script doesn’t know that, and it simply finds the member’s name with the smallest Levenshtein distance from the jumble of text it does find there; some fine-tuning of the matching process could avoid this.

Changing the output of the PHP script so that the names of the members are included, the thumbnails a little larger, and each thumbnail linked to the transport stream of the associated video, and I get a visual navigator for the morning’s video:

Image may be NSFW.
Clik here to view.

One more experiment, this time representing each OCRed frame as a two-pixel-wide part of a bar, allowing the entire morning to be visualized by party:

Image may be NSFW.
Clik here to view.
The Morning Visualized

Leaving thumbnails and party colours out of it completely, here are the members ranked by the number (of the total 504) 10 second frame captures they appear in the first frame of (the total is not 540 because the remaining frames had no lower third and thus no identified speaker):

  42 Hon. Paula Biggar
  37 Peter Bevan-Baker
  36 James Aylward
  30 Hon. Robert L. Henderson
  28 Hon. Allen F. Roach
  19 Hon. J. Alan McIsaac
  17 Hon. Jamie Fox
  15 Steven Myers
  15 Bradley Trivers
  13 Sidney MacEwen
  13 Hon. H. Wade MacLaughlan
  12 Hal Perry
  11 Colin LaVie
   9 Hon. Doug W. Currie
   8 Hon. Robert Mitchell
   8 Hon. Heath MacDonald
   6 Hon. Tina Mundy
   5 Darlene Compton
   5 Bush Dumville
   4 Sonny Gallant
   4 Jordan Brown
   3 Kathleen Casey
   2 Hon. Richard Brown

Visualized as a bar chart, this data looks like this:

Image may be NSFW.
Clik here to view.
Bar Chart of Frames per Member

And finally, here’s a party breakdown (it’s important to note that this is only a very rough take on the “which party gets the most speaking time” question because I’m only looking at the first frame of every 10 second video chunk):

Image may be NSFW.
Clik here to view.
Pie Chart Showing Frames per Party

Peter Bevan-Baker, Leader of the Green Party, is the only speaker in the Green slice; he’s the second-most-frequent speaker — 37 frame chunks — but the other parties spread their speaking across more members which is why the Green Party only represents 11% of the frame chunks in total.

As with much of the information that public bodies emit, the Legislative Assembly of PEI could make this sort of analysis much easier by releasing time-coded open data in addition to the video — as sort of “structured data Hansard,” if you well. Without that, we’re left to using blunt instruments like OCR which, though fun, involve a lot of futzing that should really be required.


Viewing all articles
Browse latest Browse all 4351

Trending Articles