2010-03-24

Guest post: Pulling audio from YouTube with PHP and ffmpeg

Editorial note:
Patrick Teague is a colleague of mine from my first job in the mid-1990s. When I heard of the project he was working on, I asked if he wanted to share it here, and he obliged.

I got this working instantly on OpenBSD. All I needed to do was:
sudo pkg_add install pear php5-curl ffmpeg (and then copy the sample php5-curl config into place)
sudo pear install console_commandline

I also noticed that occasionally, I needed to run the script twice, once to set the proper cookies for curl, then it would work after that.

Enjoy!


The attached script requires the following... I've noted the Ubuntu packages and they should match Debian, other linux and/or BSD distros should be similar. Windows MSI installer should just be a simple matter of ticking off the options.

  • PHP 5.2+ CLI (php5-cli) with cURL installed (php5-curl)
  • PEAR's (php-pear) Console/CommandLine
  • ffmpeg binary should be in your environment $PATH and should have access to libmp3lame

I have no idea why people enjoy listening to music as videos with a static image, but Youtube is full of these as well as music videos and people playing their own remixes or original music. If I hear something I like I then try to get it off of emusic, itunes, directly from the artist at a show, or somewhere else. Sometimes the music just isn't available yet (no official CD/mp3 release) or the tunes are available directly off the musician's website, but there's a better/live remix on Youtube...

I found myself listening to something the other night and I was able to find plenty of tracks from the artist on emusic and itunes, but not the particular track. In 1 of the comments somebody asked where they could get a copy of the track and a reply mentioned mp3ify. I googled it and found http://www.mp3ify.com/ which I initially used to grab the mp3. After using it 2 or 3 times it screamed at me for not making a donation so I figured I should be able to figure out how they did it and stop wasting somebody else's bandwidth and processing time. Not to mention the more control I have over something the more control I have over my results.

I have flashgot installed and it allows me to download the flv (flash) files on Youtube pages, but it wasn't letting me see what the URL was and the only flv that I was easily finding in the HTML code was the player. Flashgot does allow you to add downloaders and send the downloader various different options including URL, cookie data, referer URL, etc. I wrote a quick shell script that would dump this information to a text file and used this information in my various attempts to grab the flv from Youtube.

Note: I must have really needed sleep because I completely forgot that I could have had an easier time grabbing the URL and cookie data from firebug.

I haven't done extensive testing, but the Youtube cdn seems to require a combination of several settings. I found that the I couldn't grab the flv without having certain cookie data, a referer URL, a "valid" User-Agent string, and a particular URL. The User-Agent it may not care about as I was rushing my testing, but I couldn't seem to get wget or curl to download the flv without setting it to a Firefox UA string. It seems to want certain cookie data, but I'm not sure how much of the cookie data is relevant vs the generated timestamp in the flv's URL - again, I was rushing this.

From what I can tell Youtube creates a javascript object named yt and adds settings to it via the yt.setConfig() function. This yt.setConfig() can accept single key/value pairs [i.e. yt.setConfig('LOGGED_IN', true);] or large JSON objects with several key/value pairs. In 1 of these large objects the value for VIDEO_TITLE is set (needed for modifying the URL) and the flv URL is listed within the SWF_ARGS key which is a property list and the particular value I found the URL listed in is fmt_url_map. The value for fmt_url_map is URL encoded with a pipe seperating 3 values. The 1st value is a number, the 2nd value is the URL that I used for the flv, and the 3rd value is another URL (I'm not sure what this URL is for). This 2nd value that I used as the URL is mostly correct, but can't be used without being modified. At the very end of the URL is a comma followed by a couple of characters, these need to be trimmed off the end and replaced with &#/ + a modified version of the VIDEO_TITLE + _video.flv.

I've only seen a limited set of translated characters in the modified VIDEO_TITLE. So far I've only seen space ( ), dash (-), period (.), and ampersand (&) are converted to an underscore (_) and square brackets (both [ and ]) are removed. I'm guessing there's a larger set of translated characters, but I'm not sure what they are... I'm also not sure how important it is to get these correct as at least 1 of my attempts had an incorrect name, but worked anyway (I would prefer to use a correct name just in case so it doesn't cause any red flags).

Honestly I'd prefer to run the html page through some shell utility (possibly rhino) that could process javascript and output the variables needed so I could let Youtube's code do it's own work. That would also make it more forward compatible if Youtube decided to change how they structure their URLs or something. In the meantime I used a combination of grep and sed to initially pull out the values I needed (exchanged for preg_grep() and preg_replace() in PHP code).

The final part was to convert the flv to a different format. I've used ffmpeg in the past for converting between media formats, but this was the 1st time I've used it to convert an flv. There are several nifty things that ffmpeg can do - merge raw video files with audio files, transcode decrypted VOBs into a video+audio format, audio and/or video conversions (i.e. wav into mp3), etc. I googled for ffmpeg convert flv to mp3 to get some quick solutions... 1 of the solutions I found suggested using -acodec mp3, but I couldn't get that to work. Initially I dropped back to mp2, but later discovered I needed to use -acodec libmp3lame instead. I should probably also state that -acodec copy is valid, but could cause problems if the embedded audio is in a format that isn't acceptable in an mp3 wrapper.

The other switches on ffmpeg I'm using are -ac 2 (setting the audio channels to 2), -ab 128k (the audio bitrate), -vn (disable video recording - don't need the video for an mp3), and -y (overwrite output files). There are many other switches available and I've thought about adding the following (possibly via some switches set up in Console/CommandLine) - -title string, -author string, -copyright string, -comment string, -album string, -track number, and -year number as these should populate the fields in the id3 tag.

The final bit is more of a pet peeve than anything that's really needed. My preference is to set the date on files to match what's on the server - otherwise how do you know if it's been modified. If you don't want or need this, feel free to comment out the if statement surrounding the curl_getinfo( $ch, CURLINFO_FILETIME ) as well as the if ( !touch( $file_mp3, $GLOBALS['server_filetime'] ) ) { section at the very end right after the passthru( $cmd );.

Download:

get-youtube (syntax highlighted)

get-youtube (plain text source code)

blog comments powered by Disqus