Maintainers Guild (MG): April 2008

Tuesday, April 29, 2008

Beginners RegEx - part 4 - repeats and boundries

This is a addition to the former article about metacharacters. And a little new feature which allows us to tell how many occurances we want to allow of a centain character.

We have seen that * means zero or more, and that + mean one or more of the preceeding characters.

In this case we will use dates as the taget for our example. Consider some dates like the following.

30 April, 2008
14 May 2008
30. April 2008
1 - May - 2008 Description
2nd May 2008
Date: 12TH - 4 - 2008
02-05-2008

All valid dates to my eye, so we want to make a regular expression which assures us that we get such dates if we look for them. Let's outline some of the rules for dates. I will skip the real validation test regarding invalid dates like the 65th February. But merely consider the format.

Day comes first
Month comes second
Year comes last
Day is 1 or 2 digits long
Month is letters or a number which 1 or 2 digits long
Year is 4 digits long
We have some kind of seperators between day, month and year

Repeats - minimum and maximum
Those are the basic rules for the date formats we wish to accept and match. The first thing we need to assure is that the digit versions of day, month and year do not exceed their natural limitations of 1-2 digits, 1-2 digits and 4 digits. This can be done by setting minimum and maximum.

[0-9]{1,2}

The character class [0-9] must be repeated atleast 1 and maximum 2 times in sequence. So every number between 0 and 99 is good. {minimum,maximum} of the preceeding character or character class. There is a flavour of the {minimum,maximum} specifier. It is used in a case like this:

[0-9]{4}

Repeats - exact count
Using just a {exact count} can be useful when we want to make assure that the year in our dates is exactly 4 digits.

One other thing which could become useful for us is to find the boundry of the dates we wish to match. What we want is to check from the boundry of the date to the ending boundry of the date. Let's consider a simpel version which could match one of our dates.

[0-9]{1,2}-[0-9]{1,2}-[0-9]{4}

Boundries
That one could match 02-05-2008 with success. But what if the date wasn't actually a date, but part of a calculation like 2002-05-20089124, then we would still get a match on some of it, hence 2002-05-20089124. Which is not something we want. For the purpose regular expressions has a metacharacter called \b (boundry). What \b actually does is to distinguish between what is considered a word ( \w = [a-zA-Z0-9_] ) and anything else. So if we re-write the previous regular expression into:

\b[0-9]{1,2}-[0-9]{1,2}-[0-9]{4}\b

Then we get the desired effect. Hence we are saying that we have to have some kind of boundry first then some digits and - (hyphen's) and finally a boundry again. This well allow us to not confuse 2002-05-20089124 with a date.

Multiple choices
There is also the issue about the month being spelled instead of written with digits. This means that we will have to adjust out month part a bit to allow either digits (as now [0-9]{1,2}) or letters ( [a-zA-Z]+ ). The subexpression [a-zA-Z]+ says letters a-z and A-Z and we have to have a minimum of 1 letter and maximum unlimited; but only letters. So we adjust our regex alittle again. This time we use the logical OR construction ( thispart | orthispart ). The expression will look like this:

\b[0-9]{1,2}-([a-zA-Z]+|[0-9]{1,2})-[0-9]{4}\b

Metacharacters for digits
Now all the 0-9 becomes a bit confusing with all the 1,2 numbers in there, so we re-write it using the \d (digit) metacharacter.

\b\d{1,2}-([a-zA-Z]+|\d{1,2}+)-\d{4}\b

Seperators between day, month and year
We're getting there. But we still lack a few issues. Namely the seperating characters. We have to accept , comma, . dot, - hyphen, space, tab etc. as seperating character. If you read the previous article you can find that \s is space, tab, newline etc., so a character class like [\s,.-] will cover our possible seperator. And since we need to have atleast 1 seperator between our day, month and year, we tweak the character class to [\s,.-]+ and we modify it to look like below.

\b\d{1,2}[\s,.-]+([a-zA-Z]+|\d{1,2})[\s,.-]+\d{4}\b

Extensions to day
It's is almost complete we can now have any number of seperators between our day, month and year. So all we need is to allow days like: 1st, 2nd, 3rd,4th etc. We basically need to allowed some text extensions to the day member of our expression. We could do this by a simple subexpression like (st|nd|rd|th). The trick is that the dates may contain the extensions or they may not, so we have to make the subexpression optional. So we add a ? to the subexpression and get (st|nd|rd|th)?. Our expression looks like this:

\b\d{1,2}(st|nd|rd|th)?[\s,.-]+([a-zA-Z]+|\d{1,2})[\s,.-]+\d{4}\b

So what do we need to assure ?, basically the only things left is that the dates MAY have some kind of seperation between the day and the extension. Such as 1 st or 2 nd. So lets inject our seperator subexpression [\s,.-] in between and let us make it optional as to zero or more times. We do that with a *, so we get:

\b\d{1,2}[\s,.-]*(st|nd|rd|th)?[\s,.-]+([a-zA-Z]+|\d{1,2})[\s,.-]+\d{4}\b

Case in-sensitive
What else ?. What if the day was written like 1 St ? There is some case sensitive issues. We could extend our subexpression (st|nd|rd|th)? into (st|nd|rd|th|St|Nd|Rd|Th)? or we could make the whole thing in-case sensitive. That is done by (?i), so now the expression looks like:

\b\d{1,2}[\s,.-]*(?i)(st|nd|rd|th)?[\s,.-]+([a-zA-Z]+|\d{1,2})[\s,.-]+\d{4}\b

And I believe that's is as good as it gets today. Take time to elaborate the regular expression. There is a few nitty gritty things one can do to make it really tight.

Monday, April 28, 2008

Beginners RegEx - part 3 - metacharacters

In "Beginners RegEx - part 1" we saw some character classes (the [a-z] style) to march any character between a and z. We also saw that you can mix character classes to include more characters by using styles like [a-z0-9] to match every letter from a to z and every digit from 0 to 9. And if we wanted allow the uppercase versions of the letters we'd have to write something like [a-zA-Z0-9].

Today we'll look at some metacharacters \char, which can make life easier when using character classes [ characters ].

Here is a short list of some character class metacharacters

\t = Tabulator
\n = Newline
\r = Carrige return
\s = Any white space character, ie. space, tab, newline, formfeed etc.
\S = Any character which is NOT a whitespace
\w = Word, which is basically [a-zA-Z0-9_]
\W = Any character which is NOT defined as being a word
\d = Digit, which is basically [0-9]
\D = Any character which is NOT defined as being a digit

So now we can construct some tighter regular expressions which eventually we be easier to read, once you grow acustomed to it.

So if we have a situation in a html file which looks like this:

Then we can construct a regular expression which matches the url part of the html by writing the following regex.

http://[\w\s/.-]+

This will give us a result like:

We see a new feature of regular expressions in the regex above. One we haven't discussed yet. It is the + (plus), which tells us to match atleast one or more occurances of the preceeding character class. Opposite to the * which means match zero or more occurances of the preceeding character or character class.

Let's take the regex apart and explain what is going on.

Click to enlarge

One of the things you may wonder about, hopefully, is that inside the character class we wrote . (dot) to allow marching of the dots between the url name parts w4nd0rn.blogspot.com and the final . just before the type (.html).

Inside a character class [ ] the . (dot) does not mean any character what so ever. In a character class it means just what it is, ie. . (dot).

Beginners RegEx - part 2 - negation and zero-width positive lookbehind

We're going to concentrate on negating ranges and a regex feature called zero-width positive lookbehind. Dom't let the words scare you. It is harmless.

Negation is used to tell the regex that it should match everything which is NOT in a range []. Zero-width positive lookbehind is used to get the regex to match characters which will be used to find the stuff we are looking for, but to avoid getting the stuff as the result from our regex. The latter sounds really cryptic, but is quite useful.

Zero-width positive lookbehind. Have a look at the regex below

(?<=.H1.)[^<]*

The first part of the regex is the zero-width positive lookbehind (?<=text). The second is a range [characters] and the last is * which matches zero or more of the preceeding range. Let's take it from the top.

(?<=text) tells that we are looking for text, but if we find a match then we do not want the text "printed" as part of the result. In the case we are looking for .H1. which, if we were searching a html file, would match <H1> (header size 1). Remember that . (dot) matches any character so . (dot) before H1 matches both the < before and the > after the H1. The only real restriction when using (?<=text) is that the text has be of fixed length. So it is not possible to put ? and * inside a (?<=text) because these could change the length of the zero width look behind sub expression.

[^<]* is a little tricky to understand. First of, it is a range we specify. So anything inside the [ and ] is what we are looking for. The * after the range tells us that whatever we find that matches the range can be repeated zero or more times. Inside the range we use 1 metacharacter, the ^.

^ as the very first character in a range means "anything but". So if we had [^a] means that we want to match anything but the letter a. With the [^<] we look for anything but the character <, and as we have a * succeding the range [^<]* we look for zero or more occurances of characters which are NOT <. Practically we get a match on a html string like this:

<H1>here is some header text</H1>

The first part of the full regex finds <H1>, but omits it in the result, the second part fines the text after the first < and matches everything till it reaches a <. So the result is that we get the text inside the html tags <H1> and </H1>.

You can negate more than 1 character when using ranges. A regex like [^<=@] tells us that we want anything besides the characters < and = and finally @.

Friday, April 25, 2008

Beginners RegEx

Here's a short article which could be called "RegEx for dummies" or beginners. The most important thing about all of this is to read the primer (below).

PRIMER: Regular Expressions (RegEx) is used to find text which matches certain criteria. RegEx does NOT find words or sentences. It finds single characters. So a regular expression like "gin" does not find the word "gin" it finds the letter "g" followed by the letter "i" followed by the letter "n". Make a not of this because it will become much easier this way.

So here is a few examples followed by explanation of some of the meta characters in RegEx.

RegEx: "gin" matches Beginners.

RegEx: "gin?" matches Begixxers. Because the n is optional (? = preceeding character is optional)

RegEx: "(ht|f)tp://" matches both http:// and ftp:// (| = an logical or between the enclosed character sequences)

RegEx: "[1-2][0-9]" matches any number sequence where the 1st digit is 1 or 2 and the following digit is between 0 and 9. So numbers between 10-29. But also 10000 or 294.51 or 14degrees.

RegEx: "/[a-z0-9]/" marches any sequence where a / is followed by letters from a-z or digits from 0 to 9 and then followed by a /

RegEx: "s.x" matches any sequence where a s is followed by ANY character and then followed by a x. So sex, six, sux, sax are all good matches.

RegEx "http:.*\.zip" matches anything followed by after the letters http: 0 or more times and then finally .zip. The * means reperition 0 or more times . means any character \. means the literal dot itself.

So here is a tiny explanation:

? = The preceeding character is optional. Can be there or not
(x|y) = A sub group where either x or y is required. One could make them both optional by (x|y)?, see ?
[a-z0-9/] = A range for 1 character which can be any letter between a-z, 0-9 or a /.
. = Means any character what so ever.
* = Like ? it tells that the preceeding character can be repeated 0 or an unlimited time.
\. = The literal dot itself. It is "escaped" by the \ to tell that we mean . and not the .(any char)

That's all. Download EditPad Pro and load/write some text, use the search (Ctrl+F) and write your first regular expressions to try searching.

In later articles we shall construct more "game on" kind of regular expressions. And explore tools like GREP which has a slightly different and less featured regex than EditPad Pro.

Thursday, April 24, 2008

Making the URL list

If you've bother to watch the video tutorial on how to easily transform an url list file into a ReGet queue with all nessary information, then you must face the job of creating an URL list file.

Now this can be done manually by copying the nessary information from the html files in your browser....OR....you may lean against regular expressions and GREP.

NOOOOOOOO!, I can hear you scream...too complex!

Yeeees!, I shout and tell you to watch the video tutorial on how to produce a URL list file by using regex. I will make some articles on the regular expression language. But for now.... watch the videoclip and decide for yourself if it is worth using.

Download videoclip here: urllist.avi.zip

Now, you may say "I could have done this with ReGet and IE integration or FireFox and FlashGot." and you're right. But when the contents spans 25 pages or hundreds then using ReGet integration or FlashGot becomes a hazzel. That is why I introduce this way of doing things as it is easier if you have to produce URL lists for several pages by this method. Using only 1 html file is to keep the complexity to a minimum.

Practical TXT2WJR

Here you have a video tutorial with a practical example of how to use the tool. The video takes advantage of the newest version of TXT2WJR, which is v0.03. You can download it here.

Be advised that it is better if you download the video (XViD format) and view it as the image may seem blurred here on the blog.

Download video file for better view: http://www.mediafire.com/?zjzwsymflwe

Monday, April 21, 2008

Generating ReGet Queues

There is a new version available thanks to Flanker37 input. The new version is v0.02.

The new feature is a switch called /UP which is used like this:

TXT2WJR urls.txt reget.wjr /UP my@emailaddy.com mypassword

The reason is that if you specify a username with a @ (like an email address) in the username field with a regular HTACCESS based url then parsing it will go wrong in most cases. This is useful with sites like KarupsPC and others. If your target site does not require a username with a @ then feel free to stick it into the url list, otherwise put it togetther with the /UP switch.

Download TXT2WJR v0.02

Sunday, April 20, 2008

From text list to download queue

I've done some programming. Mainly to make it easier to use ReGet Deluxe as download manager. The tool is called TXT2WJR because it converts a text file (txt) into a .wjr (ReGet Deluxe Queue) file.

Now, that may sound like kiddie stuff, and it almost is. Except that with the tool you can come from a plain text file looking like this:

hxxp://your_user:your_pass@members.bangbros.com/membercheck?path=ms4276/streaming&fname=ms4276500k.wmv
hxxp://your_user:your_pass@members.bangbros.com/membercheck?path=ms4242/streaming&fname=ms4242500k.wmv
hxxp://your_user:your_pass@members.bangbros.com/membercheck?path=ms4159/streaming&fname=ms4159500k.wmv
hxxp://your_user:your_pass@members.bangbros.com/membercheck?path=ms4121/streaming&fname=ms4121500k.wmv

To a ReGet download queue where the the following is taken care of (on all entries):

The download folder for the files is the folder where you ran TXT2WJR.
You can custom set the filename from the URL if needed.
You can custom set a referer to every download file if needed.
You can custom set a cookie to the download queue entries if needed.

Another nice things is that it is fairly easy to produce a plain text file holding download links by using GREP and some regex, possibly a column capable editor and maybe a few SED commands. All pending on how you work.

I might enhance the tool slightly by adding capability to custom set every single download path for every file in the queue. Possible also a migration from commandline console into Windows GUI drag'n'drop (but do not bet on that though).

Click to enlarge

Download TXT2WJR v0.01

Thursday, April 17, 2008

Understanding the URL

This summary is not available. Please click here to view the post.

Wednesday, April 16, 2008

How to siter!p - fooling around with a purpose

Now this is a subject which will be divided into several articles, mainly due to the fact that there is no 100% correct way of doing it and the diversity of sites is huge. So look at this as a learning curve where you pick what ever advice and tips you find useful for your needs.

Define your goal
First of all you need to define your goals for your siter!p (later: SR). What do you want to accomplish ?

Getting all the pictures from a site ?
Getting all pictures and videos ?
Getting everything ?
Getting just a partial rip ?

If you're not doing an update, as they most often can be done quite easily and often manually then pick your poisen. Personally I aim form completeness.

This means get all media which is of value to you - pictures, videos and other things which does NOT change over time.

Size your opponent
Given you've choosen completeness the you must size your opponent. Is the site a solo model site or is it a larger multimodel site where the contents could spand 10+ dvd's ?. When sizing your opponent there is two things which seems to be of defacto rules.

Multi model sites = cleaner html/cgi and structure = Autoleechable
Solo model site = can be dirty = Expect manual tasks

Collect intel
You need to collect intelligence about the site to find out what the best SR way is. Here is a list of things you should make note of while you click around and view the html page sources (because that will be required from you).

Is the site using plain HTML and files ?
Is the site using CGI/PHP/ASP/JSP/Perl or similar to generate pages ?
Is the site using redirection from the pages to the content you want ?
Is the site a front cover for a larger contents server on another address ?
Does individual picture/video pages link to other pages or is it a "dead end" ?
Is the site using Adobe Flash to present contents ?
Is the site using JavaScript and pop-up's to present contents ?
Is the site providing zip files for contents ?
Is contents available in multiple resolutions and qualities/formats ?
Is the site based on FORM login or basic HTACCESS ?
Does the site require you to re-login after a periode of time ?
What seems to be the limitations of simultanious downloads ?

1) If you can answer YES to bullet no. 1 then go further and copy/paste random links to
pictures and videos. Put them into NotePad, UltraEdit or EditPad Pro. Copy 2-5 links to pictures from the most recent sets and 2-5 from sets which seems to have been produced at the beginning of the site and 2-5 from the time when the sites was half-old. Go to your editor and look at the url's and detemine: Is there a common structure ? Was the name of the set the same as the directory part of the url ? What tasks is needed to go from set name on the page to files on the site ? The reason is to determine wether or not it is sufficient to just download the media or if you need the HTML also. Downloading the HTML will require you to do more cleaning up after the download, it will also require you to do more structure work when the download has finished.

2) If the answer to this is YES then expect some decent structure. Because the computer generated pages comes partly from a database on the server which holds set names, links and other pieces of information. Do the same as required in 1) but make not of what parts of the url's change from media to media. To get the url's you must be looking at the pictures/videos and rightlickhing choosing Properties and copy/paste the url of the media, possible also copy/paste the url in the address linie of your browser. Copy/paste the url (copy link) which you clicked to arrive at the media. Look at them and determine what steps was needed to go from the frontpage to the media.

3) If redirection is used then primarely make a note (copy/paste) the path of the media and compare them to the link the media came from. Especially make a note of the content is on a deeper level than the page the link came from or a higher page. Deeper and higher is determined by the number of / in the url of the media/page. Higher amount of / means deeper. Also, find out what part of the link url was used to determine the specific redirection - if any special part seems to be the cause.

4) Some sites hosts all their real contents on other servers or on download servers. This has been the case for sites like Only Tease, Brazzers etc. This tends to change as the cost of hosting fluctuates. The download servers seldom require login. Instead they either require a proper cookie or a proper session url. You will have to find out which of the two it is, further you will need to find out how long time the session/cookie lasts before it needs to be refreshed. You will also need to find out if the download server require a proper referer. The easiest way of finding this is to try to use ReGet Deluxe on a link. Look at the HTTP properties in ReGet to look at the cookie/session. And copy the link to a new browser and see if it allows you to download. Remove the refere in ReGet and see if it allows you to download. Most often you just need a proper cookie. Some times you just need the proper name of the encoded url for the contents server. Most often the cookie/session lasts 1-4 hours.

5) If every page with a picture/video has multiple links to the main page, set page and next picture/video etc. then find the urls which links further and copy/paste them so that you can filter them away during your download.

6) If Flash is used to present the contents then you're in for a 100% manual download - more or less. None of the current tools handle Flash sites very good. So get your patience and long hours ready.

7) If JavaScript and pop-up's are used the examine the code and the urls from link to media to see if you can avoid having to execute the JavaScript to download the media. So far I have only seen 1 site that was un-circumvental. If you cannot avoid the JavaScript then you're in for a manual download.

8) If the site provides zip files for the contents then make sure that there ALWAYS is a zip file. Also make sure that it is the proper zip file. Most often the zip files are pre-produced (except for 1 site - to my knowledge) so you will have to probe a few zip files to see if the zip everything or if they are sloppy. Brazzers sometimes zips thumbnails alongside the contents as example. If the zip's match the individuals then go for the zip's otherwise omit the zips and go for the individual downloads.

9) Always aim for the higest quality. So if the pictures come in 2-3 resolution then choose the higest resolution. If the videos comes in same quality and bitrate but multiple formats then pick the one with the lowest MB. That is most often WMV. MPEG/MOV tends to be larger. Download both the AVI and WMV version and watch the video to determine the better of the two.

10) FORM login is the login type you see at Hotmail, YouTube etc. These days the FORM login often comes with some OCR to accomplish the username and password. If FORM is used then be certain that a cookie/sessions is the key to accessing the site from your download tools. If HTACCESS is used then you should be able to go directly to the members page with an url like http://username:password@www.sitename.com/members/ or what ever the URL is. Now make a note that Internet Explorer does not allow that method for logging in (unless you tweaked it), so try it with FireFox or your download program.

11) Re-logins is used with FORM sites and it refreshes/renews the cookie/session. So if you queue up a huge download which will take several hours then find the time limits and do not queue up more than you can manage to download with the time limitations. Why?, because otherwise the site admin will see you hammering the site all night and the chances for a ban is severly increased which means you will be cut off.

12) Some sites has limitations to the number of simultanious downloads. For Ann Angel as example there is a limit of 2. This means that you should not exceed this number or you will be cut off by the site. Choose a number which fits a hardcore download guy using FireFox or some other tool. That means go no higher than 8 simultanious downloads. Go down to 2-3 when you start your download manager and watch it progress to see if it could be increased. Do NOT roll over a site with 32 simultanious like a steam roller. The site admin or system will notice and you will get a ban. Nobody likes their site to be 'raped'.

Probing
Now it is time to test wether your intel is sufficient. Unless it is a video site then try with the intel you have on the pictures first. This means that you will need to make a new project in Offline Explorer Enterprise Edition and put your urls, cookie=, referer=, username and passwords into OEEE and see wether it will download outside your Internet browser. Start with 1 set at the beginnig. Set your URL filters as tight as possible according to your intel, both for server/domain, path and filename (to avoid downloading thumbnails). Limit your file types to pictures only. Hit Download and switch to the queue tab to see if things are resolved properly and that you infact do download pictures. If NOT, then relax your restrictions until you get an acceptable result. Next try with 2 sets and see if it works. If it does then queue up the lot and let it run. Still taking into condiration any time limitits and simultanious limitations. Use F9 to pause you download when needed to watch the url resolving. And use # as a comment to play around with your urls.

cookie=user=myself&pass=secrets&sessionid=abch48d1gaup39Hl
referer=http://www.site.com/members/image?setname1
http://www.site.com/members/image?setname1

#http://www.site.com/members/image?setname2
#http://www.site.com/members/image?setname3
#http://www.site.com/members/image?setname4

It is a good idea to configure OEEE to either Download All Files or Do not download existing files because while you are probing you may get some "trash" along the download. Try to limit the amount of excessive stuff you download. Clean you download folder after probing has ended and before you begin the real deal.

Finale
To get good download lists then you may need to save some html pages and possibly use some regex or some column copy/paste. Macro downloading may also be of benefit to you. See articles ablut MACRO download and regex.

ReGet Deluxe can import a text file with URL's so make one for the videos and import it. Mark all the url's in ReGet and set the properties for the download path, cookies, possible username/passwords etc, before you begin. Set max simultanious downloads in ReGet to 8 (on huge videos).

The article serie will continue with more specific information and tool practice alongside regex and download list creation. Until then - take care.

Thursday, April 10, 2008

How to find the invisible

When you are using regular expressions to grab text from files, especially from HTML files you often run into a case where the easiest way to find the specif information you are looking for is to use the html tags as qualifiers.

As example:

<STRONG>Here is the text I want to grab</STRONG>

In a case like above the only good qualifier is the HTML tags.

But you do not want a list like

<STRONG>Here is the first occurance</STRONG>
<STRONG>This could be the second occurance</STRONG>

Instead you want only the text inside the specific HTML tags. To accomplish that you should use a advanced regular expression feature called: Zero-width positive lookbehind.

With zero-width positive lookbehind you can use one part as qualifier and another to grab. Let's make a RegEx which handles the above.

(?<=<STRONG>)[a-zA-Z ]*

The magic part of the RegEx is the zero-width positive lookbehind, which specifies that it should look for the occurance <STRONG>. It is done by (?<=plain text). You can only have plain text inside the zero-width positive lookbehind part.

The second part of the RegEx is the matching of any number of characters between a-z and also A-Z plus ' ' (space). As the character < is NOT a part of the matching text then the RegEx will not match the </STRONG>.

I hope this is useful to you.

Tuesday, April 8, 2008

Tools of choice

EditPad Pro

One of the tools which may come in handy for you when you build your regular expressions is the too EditPad Pro. This tool has excellent support for RegEx and you can play around with your RegEx expressions interactively.

Screenshots: http://www.editpadpro.com/editpadproscreen.html

The tools is not that expensive and an excellent addition to you software toolbox. EditPad Pro can do the same stuff as UltraEdit, although you may favour UltraEdit in some cases.

I'll not list all the features of EditPad Pro, except for the already mentioned excellent support for RegEx and the fact that the flavour of RegEx in the editor matches the flavour of RegEx for the GNU version of GREP.

So give it a try, and buy it if you find it useful - especially if you're a newbie at RegEx (like me).

Click to enlarge

Thursday, April 3, 2008

Exploring Lexi

Organizing and csv'ing.

So you've managed to download Lexi's picture sets and videos. Now let's organize them. Counting the images set you will find 38. So let's at least number them from 01 (being on the last page) and to 38 (being on the 1st page) as it seems like page 1 is where the set grows.

We have the list with set names. We got those from the grep command below. Now we need the proper set name as written on the html page.

A small grep command like the one below does the job for us

grep -Poh "(?<=<CENTER><FONT face=verdana,arial,sans-serif,\* size=-2>)[a-zA-Z0-9&; ]*[^<]" p*.html

click to enlarge

We copy the text and insert it in column mode next to the original text (from previous article) with the set names. We edit the text a bit and end up with something looking like the picture below.

click to enlarge

Now do some normal Search Replace where you replace " (" with " (" and " )" with ")". Save the text files as rendirs.bat in the pictures folder of the site. Doubleclikc the .bat file to run it and let it finish renaming the directories.

Now all you have to do is to pick the csv tool of your liking (ScanSort, The!Checker, Hunter, PhotoServe Check Ultimate, PhotoServe Professional Verify) and make the csv.

Going down on Lexi

Downloading a website.

One solo site named 'All over Lexi' uses php as the main cgi based url method. Now sometimed downloading cgi sites can give you alot of unwanted stuff.

The site presents 4 pages with picture sets and 1 page with some videos. As there are way more pictures than videos, the 1st gets precedense. Browse around and look at some images to see wether the webmaster is on drugs or clean. This site a clean webmaster, which is good.

Looking at a picture and then rightclicking it to see the url will give you something like:

http://www.site.com/members/pictures/hallway_lg/lexi-lg-154.jpg

Check more pictures to see wether this way of layout sticks. With this site it does. All images are in the formats:

http://www.site.com/members/pictures/pinkoncouch_lg/lexi-lg-001.jpg
http://www.site.com/members/pictures/hallway_lg/lexi-lg-154.jpg

So only the set names and numbering differs. First we have to get the setnames. So go back to page 1 and save it to your hdd, do the same with page 2, 3 and 4. With those nice saved you have a go at it with GREP. A nice regular expression like "(?<=href=\").*?_lg" does the job nicely and renders the result below.

click to enlarge

Mark it and copy it to Notepad, UltraEdit, EditPro or whatever editor you use. Preferrable an editor which handles column mode. Next we need to find out how many pictures to download from each set. Luckily the drug free webmaster wrote this on the pages. So we check the html and find out that some newline characters will obstruct our intentions. Remove all tabs, newlines etc from the 4 htm files and save them again. You can do that by running

tr -d "\n\r\t" < p1.htm >p1.html
tr -d "\n\r\t" < p2.htm >p2.html
tr -d "\n\r\t" < p3.htm >p3.html
tr -d "\n\r\t" < p4.htm >p4.html

The files with the full extension are now newline free. With those 4 new files we run a GREP to extract the number of images to download:

grep -Poh "[0-9]* pictures" p*.html

This will give you a new list and mark the important stuff and paste it into your editor COLUMN WISE.

click to enlarge

So you have to write a few things in column mode in your editor to complete your list. And once done search replace ' ' (space) with nothing. Which will concatanate the text to a proper list of url's with macros. (see the article MACRO downloading).

click to enlarge

Copy your download list to a new project in OEEE and insert your username and password so that each url will begin like http://yourusername:yourpassword@www.

No need for cookies or referers. Hit Download and let it run. Download the vidoes manually, there are only 5. Handle candids, wallpapers and webcams as you see fit.

HAM downloads

HAM downloading is a lazy way to obtain your download links, but it is rather effective. The basis of a HAM download is to grab the urls the site produces while in pause. Let's take Ann Angel as an example. Visit her site and login, go to the pictures page (if that is what you are on the look for) and rightclick anywhere and choose "Download with ReGet". I assume you have installed ReGet Deluxe and configured it to expert mode.

Once ReGet pops up you can switch tab and go to the HTTP properties tab. See it below.

Click to enlarge

Since the site is a FORM site, then we cannot use direct username and password in OEEE. We have to obtain the cookie which hold the information needed to login to the site. Mark the line and copy the cookie to your clipboard.

Now start OEEE and create a new project. Paste the cookie from your clipboard into the project. Make it look something like below.

Click to enlarge

Now change the maximum simultanious downloads in OEEE into 1. Do that by changing the Options in OEEE, see below.

Click to enlarge

With that done you hit the Suspend button so that all downloads are set to pause. After that click the Queue tab in OEEE. Now you hit Download button in OEEE (everything is in pause).

Click to enlarge

Here comes the nice thing hit F9 twice. This will turn off Suspend for a brief second (allowing the link to be downloade) and then put OEEE into Suspend mode right after that. This will make OEEE resolve the 1st download into all subsequent links in level 1. Now all you have to do is to mark all the lines you wish to download anc choose Copy URL.

Click to enlarge

You can now do paste the list back into OEEE (hit the Stop button first) or paste the url's into Notepad for some GREP processing or whatever you wish for, example deduct url filters.

Tuesday, April 29, 2008

Beginners RegEx - part 4 - repeats and boundries

Monday, April 28, 2008

Beginners RegEx - part 3 - metacharacters

Beginners RegEx - part 2 - negation and zero-width positive lookbehind

Friday, April 25, 2008

Beginners RegEx

Thursday, April 24, 2008

Making the URL list

Practical TXT2WJR

Monday, April 21, 2008

Generating ReGet Queues

Sunday, April 20, 2008

From text list to download queue

Thursday, April 17, 2008

Understanding the URL

Wednesday, April 16, 2008

How to siter!p - fooling around with a purpose

Thursday, April 10, 2008

How to find the invisible

Tuesday, April 8, 2008

Tools of choice

Thursday, April 3, 2008

Exploring Lexi

Going down on Lexi

HAM downloads

Blog Archive

Labels