Thursday, April 3, 2008

Going down on Lexi






Downloading a website.

One solo site named 'All over Lexi' uses php as the main cgi based url method. Now sometimed downloading cgi sites can give you alot of unwanted stuff.

The site presents 4 pages with picture sets and 1 page with some videos. As there are way more pictures than videos, the 1st gets precedense. Browse around and look at some images to see wether the webmaster is on drugs or clean. This site a clean webmaster, which is good.

Looking at a picture and then rightclicking it to see the url will give you something like:

http://www.site.com/members/pictures/hallway_lg/lexi-lg-154.jpg


Check more pictures to see wether this way of layout sticks. With this site it does. All images are in the formats:

http://www.site.com/members/pictures/pinkoncouch_lg/lexi-lg-001.jpg
http://www.site.com/members/pictures/hallway_lg/lexi-lg-154.jpg


So only the set names and numbering differs. First we have to get the setnames. So go back to page 1 and save it to your hdd, do the same with page 2, 3 and 4. With those nice saved you have a go at it with GREP. A nice regular expression like "(?<=href=\").*?_lg" does the job nicely and renders the result below.


Mark it and copy it to Notepad, UltraEdit, EditPro or whatever editor you use. Preferrable an editor which handles column mode. Next we need to find out how many pictures to download from each set. Luckily the drug free webmaster wrote this on the pages. So we check the html and find out that some newline characters will obstruct our intentions. Remove all tabs, newlines etc from the 4 htm files and save them again. You can do that by running

tr -d "\n\r\t" < p1.htm >p1.html
tr -d "\n\r\t" < p2.htm >p2.html
tr -d "\n\r\t" < p3.htm >p3.html
tr -d "\n\r\t" < p4.htm >p4.html

The files with the full extension are now newline free. With those 4 new files we run a GREP to extract the number of images to download:

grep -Poh "[0-9]* pictures" p*.html
This will give you a new list and mark the important stuff and paste it into your editor COLUMN WISE.


So you have to write a few things in column mode in your editor to complete your list. And once done search replace ' ' (space) with nothing. Which will concatanate the text to a proper list of url's with macros. (see the article MACRO downloading).


Copy your download list to a new project in OEEE and insert your username and password so that each url will begin like http://yourusername:yourpassword@www.

No need for cookies or referers. Hit Download and let it run. Download the vidoes manually, there are only 5. Handle candids, wallpapers and webcams as you see fit.

No comments: