Get ya Lipsync! Get ya red hot Lipsync! (UPDATED)

7feet · Post by **7feet** » Thu Apr 14, 2005 8:19 am

Many years ago, my first girlfriend got a Myna bird (it came from another older, rich boy friend, but I was young and dim). God, I hated that bird. And it certainly gave the impression it hated me. She'd let it out of it's cage to fly around and dump on everything, and then I'd have to throw a blanket on the poor bastard to pick him up and get him back in his cage without me losing too much blood. The bird was much older (heh) than the pet store owner said when she got 'im. So he didn't learn so well. I did teach him to say hello like a 80 year old chain smoking wino with a traecheotomy, but that was about it. What he did do though, with flair, was imitate every single bird he could hear outside the window beside his cage. And pretty damn well... but he had no sense of scale. There were sparrows. Plenty of 'em. And they would come, every morning, and sit on the windowsill. And Sebastian (that was his name) did a wonderful impression of them at sunrise. At about 120 decibels.

Really, one of the loudest, most ear-piercing things I ever heard. Think about it.

Sorry, I just wanted to tell that little bit o' story. But that's where the name comes from. Myna's are generally pretty good at interpreting and running back what they hear. So I got a name.

1st, it requires this bit. The Linguistic Information Sound Editor. It's a handy-dandy, freely available chunk of the Agent Developers Toolkit. And it's cool beans, I tell ya. I'm not just lazy but I'm sure I couldn't have done the work otherwise. So it's easy to lean on the Redmond folks. The program allows you to bring in a .WAV file, decipher it's "Linguistic Information" (all the phonemes in there) automatically. Well, you have to type in the text in the LISE. Then you save the file as a .LWV file (a "Linguistically" enhanced WaV). It'll still read fine as a .WAV in pretty much any editor, it's just got some good thing tacked on the end. If you wan't, yoy can usually (haven't found an exception yet) rename the file back to .WAV. And it does a bangup job of detecting all the proper phonemes without any input at all (except the typing part).

PREPPING THE AUDIO FILE
As with just about every automatic phoneme detection scheme, you first need to type in the text of what the character is saying as a guide for the detection (I've had some success entering rough phonetic spellings for non-verbal sounds. Also, if the person drops their h's, or ending g's, try spelling it like that. It can help). Then in this case, you need to get the editor to do the detection. Either use the Edit>Generate Linguistic Info command, or the icon that has a little yellow lightning bolt on it. A little progress bar pops up, and in a few seconds you should be seeing yellow bars with the words above and the phonemes below displayed in the window that shows the waveform. You can then play back the wave, and check that it's working about right by watching the mouth on the right side of the editor. If something doesn't look right, you can either move the phonemes around by click-dragging their edges, or right clicking on them to change the phoneme entirely. Anyway, after you've done those couple of things, then you should save the LWV.

It does a better job if the files aren't too long. If the files are really long, or particularly noisy or muffled, it may not be able to detect the phonemes at all. Just keep that in mind.

Here's a sample (and here)of what it does in 1 minute of work(a mouth switch imported form the Moho>Object Library> mouths.moho file, tweaked to use the scripts acceptable names. A file run through the LISE untouched (the Reagan sound file from the tutorials)

HOW TO WORK WITH MYNA
What did I do? I started thinking about this almost 3 years ago. But Moho didn't have any scripting then. Tried to do it in visual basic, and failed to get as far as I liked. Now, after many hours in front of a hot hex editor, I nailed down the structure of a file format that isn't documented anywhere.

So what the script does. Choose a switch layer that has the switch layers named properly. Run the script, and choose a properly encoded audio file you've already run through the Linguistic Sound Editor(generally .LWV file, but you can rename it back to .WAV without any problems I've found). The script will tell you if there is anything wrong with it. So, it'll check the file, see if it's stereo or mono, how many bits per sample, what the sample rate is, how mant frames per second Moho is running at, and just drop those Switch keyframes in there.

I love the sound of "totally automated".

There are a few dialogs that will pop up as you go.
1. The first is, of course, the one asking you which file you want to use.
2. Whether or not you would like a full report of every phoneme being imported sent to the Lua Console window. Not something you will often need, but helpful if something is screwy.
3. Whether you would like to Export a switch data file (which will prompt you for the file you would like to export to), just Add the keyframes to the project, or do Both. If you export a switch data file and then decide to change the Frame Rate of your project, just re-export it from the LWV file, Myna automatically compensates for the new Frame Rate.
4. If you'd like to skip any of the keyframes. Sometimes you can end up with an awful lot of keyframes which can make the mouth movement a little too busy. You can always cull them by hand, but I've tried to automate it for you a bit, you can see if it works for you. The keyframes can be set to FULL (every last one of them), TWOS (every other keyframe), or THREES (every third keyframe). It automatically detects if, because of this culling, you would now have the same keyframe twice in a row, and won't add the second one. Also, "rest" keyframes are never skipped. This isn't a very intelligent method, but improvement is for another day.

I'll eventually put all this into a single dialog, but I just didn't have the time.

SETTING UP YOUR SWITCH LAYER
It's only in the naming of the sublayers in the switch file. The naming is based on the Preston Blair phoneme set (a good description on what the mouth shapes shout be like at http://www.garycmartin.com/mouth_shapes.html), with the addition of a separate shape for "th". The Lost Marble mouths are set up with sublayers for every letter of the alphabet, but the actual drawings are mostly duplicates and really follow this convention. So what I did in this case is find the first letter that existed that would be in one of the phoneme sets and rename it, e.g. I renamed the "A" layer to "ai", the "B" layer to "mbp", "Closed" to "rest", etc. Then I deleted the layers that I hadn't renamed. You can use any set of mouths you like, as long as the layers are named correctly. The ones you need are:

ai
e
mbp
cdgjknrsyz --might just rename this "other"
fv
l
th
o
u
wq
rest -- the "closed" shape

If you use these for your sublayer names it's all a go (note - it's important that the letters be lower case and typed exactly as you see, or it won't work properly). I intend to add more flexibility as I go, so you can set things up as you please, but for now this is a common convention that should get the job done in most circumstances.

The raw phonemes that the LISE puts in there are based on the IPA (International Phonetics Association) designations. Potentially, if you really wanted to you could make up a switch layer with sublayers named after all of the IPA phonemes the editor properly recognizes (close to 90 of them) if you really needed to get every sound exactly proper to lip, teeth and tongue position. It might make sense in hyper-real 3D character work, but seems like overkill for most cartoons. I will set it up so you can do that in the future, but this is a first pass at it.

THE FILE
This is GPL. Play with it. I'm proud of this little bit o'work. I don't think there's ever been a freely available solution to automate Lipsync chores. Have at it. Anyone have a 6 or 7 year old to try it out with? If you find any bugs, let me know. Have fun. Sorry it's only Windows, I can only do what I can do. EDIT If someone with a Mac could try processing a WAV file in Windows or Linux and see if Myna can read the file on a Mac, I'd appreciate it. Might have to tweak the byte order for Macs, but that wouldn't be very hard.

***UPDATED 4/17/05 1:10am GMT***
--added ability to add keyframes starting from current frame
Download the MYNA package here.
Bandwidth limit insurance copy

The .ZIP included the sf_myna. lua file, which goes somewhere in the Moho>Scripts>Menu directory (I have it in the Sound directory). The library (sf_utilities.lua - it's the latest version, no conflict with previous versions but it needs this one) goes in the Moho>Scripts>Utilities directory. The MynaMouths.moho file has all the mouths, and they are all set up now. Set up some different sound files and try it out.

TIPS
-- 1st, this only works for uncompressed files. No MP3's (including MP3's with .WAV headers and .WAV extensions), no ADPCM. But any style of uncompressed .WAV should be fine. If the original .WAV is compressed, just make sure when you save the LWV that the Format is set to PCM.

-- some files that have been overly compressed (audio level control style, not as above) seem to give the LISE a problem. The same for files with a lot of clipping (meaning it was recorded at far too high a level). Good clean sound helps the tracking a lot, but theres some tolerance.

Toontoonz · Post by **Toontoonz** » Fri Apr 15, 2005 12:54 am

Okay, I tried Mouth 4 on my sound file and it worked!

Just a note - when I imported Mouth 4 it comes already loaded with the mouth movement keyframes in the timeline. This was somewhat confusing. I had to go into the timeline and delete the 100 plus frames of mouth movement keyframes, before I could run the Myna script using my sound.

bupaje · Post by **bupaje** » Fri Apr 15, 2005 1:07 am

7feet - you are on one heckuva roll with this scripting. Holy Moley!

teotoon · Post by **teotoon** » Fri Apr 15, 2005 9:52 am

Dear Brian,

This is such an amazing tool! Thank you very much...
One question: Can LISE and MYNA be used for other languages than English?
It seems they cannot be.

Thanks again...

Post by **Lost Marble** » Fri Apr 15, 2005 5:15 pm

(In case you're wondering what happened to this thread, Brian's been updating the first post with answers to questions. To clean things up, I've been deleting those questions/answers, as they can all be found in the main post now.)

Toontoonz · Post by **Toontoonz** » Fri Apr 15, 2005 5:24 pm

After this masterpiece, I hereby nominate Brian (7feet) to the Moho Scriptor´s Hall of Fame!

I have been playing around with this script just a bit and Brian´s Myna really works good. Now that I have it set up and know how to work all the steps smoothly I can see that this is the way I am going to be doing the lipsyncing all my animations.

If you haven´t tried Myna you don´t know what you are missing!

7feet · Post by **7feet** » Fri Apr 15, 2005 7:54 pm

Sorry 'bout leaving those keyframes in there. I'll fix that

The Sound Editor is supposed to be able to use other speech recognition engines. There are apparently (or supposedly) some international speech recognition engines that can be used with it. I just haven't found them, but I haven't spent much time looking, I'll do some more.

When you write something you sometimes forget to tell people all the things they need to know to use it. Hope it's all clear now.

grimble67 · Post by **grimble67** » Sat Apr 16, 2005 6:10 am

I'd like to add my thanks and gush a little over 7Feet's speech recognition work.

I've been battling with Magpie and Pamela for several days and have had problems with "productivity". Both are fantastic tools in their own right, but each required the author to painstakingly match certain audio "frames" with certain mouth positions, which is very difficult and becomes monotonous. Magpie will eventually automatically recognize new speech, but you have to use a lot of care and diligence on the training phase. After several days messing with these tools, I still don't have anything imported in to Moho.

In contrast, within 20 minutes of finding 7Feet's post, I had recorded my own speech and created a lip-synched animation. Absolutely brilliant. I recorded the phrase "The Quick Brown Fox Jumps Over The Lazy Dog" and was able to import it in to Moho with no tweaking, and it looks great.

Thanks so much 7Feet; not only for putting all this work in, but also for sharing it with us in such a friendly, informative, and USEful manner.

</gush>

One question: I noticed the Speech Recognition Engine from Microsoft that you hotlinked to has a date of 1998. Do you think a more up to date one may have been developed since then? Just curious.

7feet · Post by **7feet** » Sat Apr 16, 2005 6:45 am

As far as I can tell, in a word, No. As far as I can tell, it's an old piece of technology thats been left to drift off into the sunset. Agent still existst, but when this bit was developed I guess they thought other developers would come up with recognition engines to work with it. So far, I can't find any. I can find any number of bits to do text to speech with it, but nothing to plug in to recognize speech.

However, I suspect there are plenty of them. If I can find a proper recognition engine that I can translate to use in Lua, I will. I like a lot that Lost Marble has worked to keep things cross-platform, and I don't much like breaking that. But this was the tool I could build now.

It's always been a pain in the ass to do lip sync. I really wonder about the probably endless pile of grunts making up dopesheets for Disney back in the day. Truly not a job I would want to be saddled with.

So this is my bit to make it easier. Sorry for the international folks, but I do suggest using some creativity in using phonetic spellings to fool it. We are talking cartoons here, and at least a reasonable representation would give you something to work with.

a lil' extra Thats one of the reasons ( it's "oldness") that I thought it might run in WXwindows or other schemes to emulate the windows environment. It's a Win95 program, it should be posssible to make it run under damn near anything that can simulate the W32 API. Hell, if it's 95 it should work okay with 16 bit apps. If someone would give a try in Linux, at least, I'd appreciate it. There's always Virtual PC on the mac, but that costs a bit. I have no idea if there's any open source solution for OSX.

jorgy · Post by **jorgy** » Sat Apr 16, 2005 5:02 pm

7feet wrote:If someone would give a try in Linux, at least, I'd appreciate it.

I was planning on trying it under VMware, as well as Wine. I'll let you know my results.

jorgy · Post by **jorgy** » Sat Apr 16, 2005 6:35 pm

Under Wine (Crossover Office 3.0) I got a "ThunkConnect32 Failure" error on installation. It installed and ran under VMware (winXP virtual machine), I just need to figure out how to tweak it, and get playback to run at a decent speed.

Thanks tons!

bupaje · Post by **bupaje** » Sat Apr 16, 2005 8:26 pm

@7feet - in case you are interested two interesting links are Google "speech recognition" search of sourceforge.net and citeseer (better through Google as the site search engines sometimes crawl).

Sourceforge Results

Citeseer Results

grimble67 · Post by **grimble67** » Sun Apr 17, 2005 12:38 am

7Feet: is there a way using your tool to import the lwv file in to the middle of a moho file, instead of at the beginning? It would be nice if the switch keyframes came in from the point of the current frame (in the main timeline)

7feet · Post by **7feet** » Sun Apr 17, 2005 12:45 am

Right on jorgy! I was hoping it would work. Glad you checked that out.

bupaje - I had been looking around at some of that stuff, but many of the methods are specific to speeech to text. Finding the code (let alone understanding it) to just do phoneme detection is tricky. I also don't want to have to deal with dictionarys for internationalization. One option I just ran across I find interesting. There's an implementation of SmallTalk called Squeak that has a set of functions for raw phoneme detection intended for, well what do you know, choosing mouths for animation. No need for a dictionary, it just finds the phonemes. It also has a browser plugin version of the interpreter with versions for Windows, Linux, older Mac OS's and OSX. Seems to work on just about anything (just tried it in Firefox). So one possibility I'm thinking about is to implement Myna with the phoneme detection built in with that. Then, as long as the user had the plugin installed they could run the program from their browser of choice from a local file. Probably even better would be to put the program on a web page you would go to, and you could immediately be using the latest version without having to think about it. I kinda like that idea.

grimble - I'll put that in in momentarily. I need to put all the choices in one window, but I could throw in a quick yes/no for inserting fron the current frame. Look at the first message a little later, I'll slap it up there. EDIT - Done. v1.02 above.

grimble67 · Post by **grimble67** » Sun Apr 17, 2005 1:36 am

Thanks, 7Feet.

I'm curious: where do you get your reference material for LUA? Is there a definitive API published? Does Moho have a specific interface you have to comply to? What is your development cycle (do you test your code through Moho, and if so, do you have to exit and re-enter Moho every time you make a change to your script, etc). I'm a C#/Java programmer and I'm thinking about dabbling in some LUA. You've inspired me with your Myna script.

EDIT - Never mind, I found the Scripting forum. Probably has everything in there that I need

Get ya Lipsync! Get ya red hot Lipsync! (UPDATED)

Get ya Lipsync! Get ya red hot Lipsync! (UPDATED)

All Hail 7Feet

Insert switch frames in the middle?

Works great