Now this is a subject which will be divided into several articles, mainly due to the fact that there is no 100% correct way of doing it and the diversity of sites is huge. So look at this as a learning curve where you pick what ever advice and tips you find useful for your needs.
Define your goal
First of all you need to define your goals for your siter!p (later: SR). What do you want to accomplish ?
- Getting all the pictures from a site ?
- Getting all pictures and videos ?
- Getting everything ?
- Getting just a partial rip ?
This means get all media which is of value to you - pictures, videos and other things which does NOT change over time.
Size your opponent
Given you've choosen completeness the you must size your opponent. Is the site a solo model site or is it a larger multimodel site where the contents could spand 10+ dvd's ?. When sizing your opponent there is two things which seems to be of defacto rules.
- Multi model sites = cleaner html/cgi and structure = Autoleechable
- Solo model site = can be dirty = Expect manual tasks
Collect intel
You need to collect intelligence about the site to find out what the best SR way is. Here is a list of things you should make note of while you click around and view the html page sources (because that will be required from you).
- Is the site using plain HTML and files ?
- Is the site using CGI/PHP/ASP/JSP/Perl or similar to generate pages ?
- Is the site using redirection from the pages to the content you want ?
- Is the site a front cover for a larger contents server on another address ?
- Does individual picture/video pages link to other pages or is it a "dead end" ?
- Is the site using Adobe Flash to present contents ?
- Is the site using JavaScript and pop-up's to present contents ?
- Is the site providing zip files for contents ?
- Is contents available in multiple resolutions and qualities/formats ?
- Is the site based on FORM login or basic HTACCESS ?
- Does the site require you to re-login after a periode of time ?
- What seems to be the limitations of simultanious downloads ?
pictures and videos. Put them into NotePad, UltraEdit or EditPad Pro. Copy 2-5 links to pictures from the most recent sets and 2-5 from sets which seems to have been produced at the beginning of the site and 2-5 from the time when the sites was half-old. Go to your editor and look at the url's and detemine: Is there a common structure ? Was the name of the set the same as the directory part of the url ? What tasks is needed to go from set name on the page to files on the site ? The reason is to determine wether or not it is sufficient to just download the media or if you need the HTML also. Downloading the HTML will require you to do more cleaning up after the download, it will also require you to do more structure work when the download has finished.
2) If the answer to this is YES then expect some decent structure. Because the computer generated pages comes partly from a database on the server which holds set names, links and other pieces of information. Do the same as required in 1) but make not of what parts of the url's change from media to media. To get the url's you must be looking at the pictures/videos and rightlickhing choosing Properties and copy/paste the url of the media, possible also copy/paste the url in the address linie of your browser. Copy/paste the url (copy link) which you clicked to arrive at the media. Look at them and determine what steps was needed to go from the frontpage to the media.
3) If redirection is used then primarely make a note (copy/paste) the path of the media and compare them to the link the media came from. Especially make a note of the content is on a deeper level than the page the link came from or a higher page. Deeper and higher is determined by the number of / in the url of the media/page. Higher amount of / means deeper. Also, find out what part of the link url was used to determine the specific redirection - if any special part seems to be the cause.
4) Some sites hosts all their real contents on other servers or on download servers. This has been the case for sites like Only Tease, Brazzers etc. This tends to change as the cost of hosting fluctuates. The download servers seldom require login. Instead they either require a proper cookie or a proper session url. You will have to find out which of the two it is, further you will need to find out how long time the session/cookie lasts before it needs to be refreshed. You will also need to find out if the download server require a proper referer. The easiest way of finding this is to try to use ReGet Deluxe on a link. Look at the HTTP properties in ReGet to look at the cookie/session. And copy the link to a new browser and see if it allows you to download. Remove the refere in ReGet and see if it allows you to download. Most often you just need a proper cookie. Some times you just need the proper name of the encoded url for the contents server. Most often the cookie/session lasts 1-4 hours.
5) If every page with a picture/video has multiple links to the main page, set page and next picture/video etc. then find the urls which links further and copy/paste them so that you can filter them away during your download.
6) If Flash is used to present the contents then you're in for a 100% manual download - more or less. None of the current tools handle Flash sites very good. So get your patience and long hours ready.
7) If JavaScript and pop-up's are used the examine the code and the urls from link to media to see if you can avoid having to execute the JavaScript to download the media. So far I have only seen 1 site that was un-circumvental. If you cannot avoid the JavaScript then you're in for a manual download.
8) If the site provides zip files for the contents then make sure that there ALWAYS is a zip file. Also make sure that it is the proper zip file. Most often the zip files are pre-produced (except for 1 site - to my knowledge) so you will have to probe a few zip files to see if the zip everything or if they are sloppy. Brazzers sometimes zips thumbnails alongside the contents as example. If the zip's match the individuals then go for the zip's otherwise omit the zips and go for the individual downloads.
9) Always aim for the higest quality. So if the pictures come in 2-3 resolution then choose the higest resolution. If the videos comes in same quality and bitrate but multiple formats then pick the one with the lowest MB. That is most often WMV. MPEG/MOV tends to be larger. Download both the AVI and WMV version and watch the video to determine the better of the two.
10) FORM login is the login type you see at Hotmail, YouTube etc. These days the FORM login often comes with some OCR to accomplish the username and password. If FORM is used then be certain that a cookie/sessions is the key to accessing the site from your download tools. If HTACCESS is used then you should be able to go directly to the members page with an url like http://username:password@www.sitename.com/members/ or what ever the URL is. Now make a note that Internet Explorer does not allow that method for logging in (unless you tweaked it), so try it with FireFox or your download program.
11) Re-logins is used with FORM sites and it refreshes/renews the cookie/session. So if you queue up a huge download which will take several hours then find the time limits and do not queue up more than you can manage to download with the time limitations. Why?, because otherwise the site admin will see you hammering the site all night and the chances for a ban is severly increased which means you will be cut off.
12) Some sites has limitations to the number of simultanious downloads. For Ann Angel as example there is a limit of 2. This means that you should not exceed this number or you will be cut off by the site. Choose a number which fits a hardcore download guy using FireFox or some other tool. That means go no higher than 8 simultanious downloads. Go down to 2-3 when you start your download manager and watch it progress to see if it could be increased. Do NOT roll over a site with 32 simultanious like a steam roller. The site admin or system will notice and you will get a ban. Nobody likes their site to be 'raped'.
Probing
Now it is time to test wether your intel is sufficient. Unless it is a video site then try with the intel you have on the pictures first. This means that you will need to make a new project in Offline Explorer Enterprise Edition and put your urls, cookie=, referer=, username and passwords into OEEE and see wether it will download outside your Internet browser. Start with 1 set at the beginnig. Set your URL filters as tight as possible according to your intel, both for server/domain, path and filename (to avoid downloading thumbnails). Limit your file types to pictures only. Hit Download and switch to the queue tab to see if things are resolved properly and that you infact do download pictures. If NOT, then relax your restrictions until you get an acceptable result. Next try with 2 sets and see if it works. If it does then queue up the lot and let it run. Still taking into condiration any time limitits and simultanious limitations. Use F9 to pause you download when needed to watch the url resolving. And use # as a comment to play around with your urls.
cookie=user=myself&pass=secrets&sessionid=abch48d1gaup39Hl
referer=http://www.site.com/members/image?setname1
http://www.site.com/members/image?setname1
#http://www.site.com/members/image?setname2
#http://www.site.com/members/image?setname3
#http://www.site.com/members/image?setname4
It is a good idea to configure OEEE to either Download All Files or Do not download existing files because while you are probing you may get some "trash" along the download. Try to limit the amount of excessive stuff you download. Clean you download folder after probing has ended and before you begin the real deal.
Finale
To get good download lists then you may need to save some html pages and possibly use some regex or some column copy/paste. Macro downloading may also be of benefit to you. See articles ablut MACRO download and regex.
ReGet Deluxe can import a text file with URL's so make one for the videos and import it. Mark all the url's in ReGet and set the properties for the download path, cookies, possible username/passwords etc, before you begin. Set max simultanious downloads in ReGet to 8 (on huge videos).
The article serie will continue with more specific information and tool practice alongside regex and download list creation. Until then - take care.
No comments:
Post a Comment