Sirobot - a web fetch tool similar to wget
sirobot.pl [options] <URL> [[options] <URL>...]
Sirobot is a web fetch tool. It's implemented in Perl 5 and runs from a command line.
Sirobot takes URLs as arguments and is able to download them as well as all given images and links in those HTML files recursively, too.
The main advantage over other tools like GNU wget is the ability to fetch several files concurrently which effectively speeds up your download.
Call Sirobot (the executable is called sirobot.pl) with at least one
URL (see URL) as an argument or specify a file to read URLs from
(option --file <file>, see OPTIONS).
If it can't find any URLs, a short usage advice is displayed and
Sirobot quits.
There are various possibilities to influence Sirobot's behaviour such as how deep it should crawl into a WWW tree.
Sirobot tries to figure out which proxy to use. Therefor it looks for the
environment variables $http_proxy and $ftp_proxy. You can always set
the proxy configuration manually (see --proxy and --ftpproxy).
Often used options may be put into ~/.sirobotrc. This file
is processed upon startup before any command line option is read.
This is done similar to the --file command (see below) so the
syntax is the same as describe there.
See also EXAMPLES for a rather useful example.
(If you are familiar with the usage of URLs you may skip this section)
A correct URL may looks like this:
http://server/path/to/index.html # Standard URL http://server/file?query # Standard URL with query http://server/file#frag # Standard URL with fragment
If you need to access a webserver at another port instead of the commonly used port 80 (default), try this (example accesses port 1234):
http://server:1234/
Some pages are protected by passwords. Sirobot can access these pages, too but it needs a username and password from you. The following example takes ``honestguy'' as username and ``secret'' as password:
http://honestguy:secret@server/
It works the same for FTP.
Note: If you get a strange message about a missing method while using password authentication try updating your libwww-perl and/or URI libraries. See INSTALL for where to get them.
(See EXAMPLES for how to use them)
Sirobot's behaviour can be influenced in a lot of different ways to better fit your needs.
You can see a short summary of available options by simply running
sirobot.pl --help (displays summary of frequently used options) sirobot.pl --morehelp (displays summary of ALL available options) sirobot.pl --examples (displays some examples how to use Sirobot)
Please don't get confused by so many options, you surely
do not need them all :-)) If you don't know where to start, run
sirobot.pl --help and check out the commands displayed there.
Many arguments like --depth, --samedomain or --exclude remain
active for all remaining URLs unless other commands overwrites them.
Some arguments take an additional value (eg. --depth takes a number).
Note: the following notations are all the same and internally converted to the first version.
--depth 1 --depth=1 -d 1 (only available for short options) -d1 (only available for short options)
See also --morehelp
See also --help
Note: The following options are mutually exclusive which means every option lasts until overwritten by another.
--nocurses to
your commandline.
See also --verbose, --silent and --quiet.
See also --stats.
See also --verbose, --debug and --silent.
See also --quiet, --verbose and --debug.
See also --nostats.
See also --quiet, --silent and --debug.
Note: The following options are mutually exclusive which means every option lasts until overwritten by another.
If curses cannot be used (eg. if stdout is not a tty), the ``old'' interface will be used.
See also --nocurses.
See also --curses.
Note: The following options are global and mutually exclusive which means only the last of the given options is active.
--tries (see there for limitations) except the fact that
--continue works even if the (incomplete) file was fetched with
another tool.
See also --force and --noclobber.
See also --continue, --newer and --noclobber.
See also --force, --newer and --continue.
--noclobber.
See also --force, --noclobber and --continue.
Note: The following options are mutually exclusive which means every option lasts until overwritten by another.
Note: These options also affect in which subdirectory the files are stored.
See also --samedir, --samedomain, --sameserver and --depth.
See also --sameserver, --samedomain, --anyserver and --depth.
See also --samedir, --sameserver, --anyserver and --depth.
See also --samedir, --samedomain, --anyserver and --depth.
Note: The following options can be mixed and each option may overwrite the preceeding one partially or completly.
Everything Perl provides as regular expressions can be used for <regexp>, it will be directly converted to the Perl statement m/<regexp>/; Here are the main facts:
ba matches bad and alban but not bla.
h.llo matches hallo and hello.
xa*ba matches xaba, xaaba, xaaaaaba and even xba.
^ at the beginning denotes the start of a line
^here matches only if here appears at the beginning
of a line. Therefor it never matches there.
$ at the end denotes the end.
gif$ matches any file that ends on gif.
$, ., ^, brackets among others must be escaped by a
backslash (\) eg \$
See man perlre for even more stuff and EXAMPLES.
You may enter several C<--exclude> and mix them with C<--include>. If you want to allow only particular files, try this combination:
--exclude . --include <regexp>
which will disallow all files (a dot matches any string with at least one character) and re-allow files matching <regexp>.
The default can be restored by inserting C<--include .>. B<Note>: when entered as a shell command, the regexp should be quoted: C<--include '.*'>.
See also --include.
--include and mix
them with --exclude.
By default, all files are allowed.
See --exclude for more informations.
Note: Sirobot first reads the environment variables $http_proxy,
$ftp_proxy and $no_proxy to figure out your system's the default
settings.
Note: These settings are global for all URLs to fetch. Commandline options override environment settings.
See also --proxy and --noproxy.
See also --proxy and --ftpproxy.
See also --ftpproxy and --noproxy.
Note: The following options are mutually exclusive which means every option lasts until overwritten by another.
--anyserver, --sameserver,
--samedomain and --samedir affect the decision which links to
actually convert and which not because they affect in which folder the
files are actually stored.
See also --noconvert.
=for html <br><br>
=item --noconvert
Turn conversion feature off (default).
See also --convert.
Note: The following options are mutually exclusive which means every option lasts until overwritten by another.
Note: Althought it is possible to have multiple arguments per line, using one line per argument is strongly recommended.
All arguments read from the file are processed as if they have been entered in the command line. That means the same syntax applies but remember you must not escape special shell characters or use quotes. This also implies you can't have spaces as a part of an argument or empty arguments at all (really need that? Write me!)
See also EXAMPLES.
--remove feature (default).
See also --remove.
--file). After the URL has been
downloaded successfully, it is deactivated in the file it came
from. --remove is useful to better keep track of which files
are already fetched and which are not.
Deactivation of a link is done by prepending a #[SIROBOT: done]
to the line that contains the link.
In order to perform the work correctly it is necessary to have only one link per line (and only the link, no options in the same line, put them in a separate line before the link).
This flag is inteded to be used in combination with --continue
(which is not turned on by default) in order to continue large
downloads whenever you are online but it can be used without --continue,
too.
Note: As mentioned earlier, Sirobot can only detect if a file is complete if the server provides information about it's content length.
See also --noremove, --file and EXAMPLES.
--curses. In that case,
everything printed to the upper part of the curses screen is also
written to file.
If you have curses turned off (eg. by --nocurses), the output is the
same as on the screen.
See also --nolog
See also --log
--pipe) and does not exit if there are no more waiting jobs.
You can write any arguments to the file and Sirobot will process them like
those given by --file.
Note: The named pipe must be created before you run Sirobot
(eg. by the shell command mkfifo).
Note: Unfortunally, Sirobot blocks upon startup unless at least one line is written to the pipe (eg. by echo >/tmp/sirobot). This is not Sirobot's fault.
See also --nodaemon, --pipe and EXAMPLES.
See also --daemon and --pipe.
See also --daemon and --nodaemon.
Depth 1 tells Sirobot to download all included images but no
further links.
Depth 2 does the same as Depth 1 PLUS it fetches all links on this
page PLUS all images of the links.
Depth 3-... I think you guess it ;-)
To avoid downloading the whole internet, the use of --samedir,
--sameserver and --samedomain as well as --exlucde and --include
is strongly recommended!
$USER and $HOSTNAME.
Please set your email address with this option in ~/.sirobotrc
as shown in EXAMPLES.
--header From=myname@home will be translated
into a ``From: myname@home''-line in the HTTP request header.
Useful for sites that need a correct Referrer:-tag before they
allow downloads.
This option is NOT RECOMMENDED! USE WITH CARE!
To be able to determine if the download was incomplete, Sirobot
needs some help from the server so this feature might not work
with all files! This also applies to --continue.
Any output written to STDOUT will be discarded unless the lines start with
ECHO. Error messages written to STDERR are currently not filtered and
therefor go directly to Sirobot's STDERR and may cause screen corruption
when Curses are turned on.
--dump), this feature will cause Sirobot to NOT follow
links and download them recursively but write all found links to a file
<file> (or STDOUT, if <file> is ``-''). Double links are automatically
removed and dumped only once. Also, a "-d<number ``> will be put in
front of each link to represent the current depth setting.
This can be used by an external program to filter incoming links or to
run Sirobot in some kind of dry or test mode. In conjunction with
option --daemon, the external program can feed the filtered links
back into Sirobot. Here's a simple and senseless loopback demonstration
(the named pipe /tmp/sirobot must exist):
sirobot.pl --dump --dumpfile /tmp/sirobot --pipe /tmp/sirobot \
--daemon -d2 http://www.sirlab.de/linux/
Please note the following drawbacks:
--anyserver.
--curses and --nocurses). You might also want to turn statistics
and other messages off, too: --quiet, --nostats.
Default value is 4096 Bytes (4 KB). Use bigger values for fast links and use the default value or less for slow ones.
sirobot.pl http://www.sirlab.de/linux/
Get the Sirobot homepage (index.html) and it's images and store them in the current directory.
sirobot.pl --prefix /tmp/fetched/ \
http://www.sirlab.de/linux/
Same as above but save all downloaded files to /tmp/fetched/
sirobot.pl --depth 0 http://www.sirlab.de/linux/
Get index.html only (depth 0).
sirobot.pl --anyserver --depth 2 http://www.tscc/searcher.html
Get all links mentioned on this page, whereever they're pointing to with a maximum depth of two.
sirobot.pl --exclude '\.gif$' http://www.linux.org/
Get homepage of linux.org but don't download URLs that end with ``.gif''.
Get all pages recursively with a maximum depth of 2. Exclude all files and re-allow those that end with ``.html''. That effectively means, only HTML files get fetched but no images and other stuff.
sirobot.pl --file getthis.txt
Read getthis.txt and process it's content as command line arguments. Imagine getthis.txt consists of the following lines:
### start of getthis.txt ### --depth 0 http://xy.org/ --prefix zzz http://zzz.net/ ### end of getthis.txt ###
which is the same as if you invoke
sirobot.pl --depth 0 http://xy.org/ --prefix zzz http://zzz.net/
sirobot.pl --remove --continue --file getthis.txt
This is nearly the same as above, with one major difference: After http://xy.org/ and http://zzz.net/ are successfully downloaded, getthis.txt reads like this:
### start of getthis.txt ### --depth 0 #[SIROBOT: done] http://xy.org/ --prefix zzz #[SIROBOT: done] http://zzz.net/ ### end of getthis.txt ###
What's that good for you ask? Well, imagine your connection becomes terminated before the files are completly fetched (eg. because you've hung up your modem, the link broke down etc). Then you can issue exactly the same line when you're back online again. You don't need to keep track which files are complete and which are not.
You may create a file ~/.sirobotrc which will be processed upon startup. It usually contains your preferred settings so you don't need to type them every time.
Here's what I have put into my ~/.sirobotrc:
### start of ~/.sirobotrc ### # Put your email address here: --from yourusername@somedomain
# Exclude all nasty big files that might accidently be fetched # during recursions. They still may be re-enabled if needed. --exclude \.(gz|bz2|tar|tgz|zip|lzh|lha)(\?.*)?$ --exclude \.(mpg|mp3|wav|aif|au)(\?.*)?$ --exclude \.(ps|pdf)(\?.*)?$ ### end of ~/.sirobotrc ###
mkfifo /tmp/sirobot sirobot.pl --daemon & echo >/tmp/sirobot
This creates the named pipe /tmp/sirobot (aka fifo) and puts Sirobot in daemon mode. Sirobot will block until you write something to the named pipe, that's what the last line is good for.
Now you can send Sirobot additional commands if you write to the pipe:
echo --depth 0 >/tmp/sirobot echo http://slashdot.org >/tmp/sirobot echo --prefix fm/ http://freshmeat.net >/tmp/sirobot
End daemon mode by writing --nodaemon to the pipe:
echo --nodaemon >/tmp/sirobot
Remember that the following options affect only URLs issued after them:
--anyserver, --samedomain, --samedir, --sameserver, --depth,
--prefix, --exclude, --include and --tries.
This means, you can get URL1 with depth 2 and URL2 with depth 1 and save them to different directories with one single call of Sirobot if you try the combination ``--prefix dir1/ --depth 2 URL1 --prefix dir2/ --depth 1 URL2''.
sirobot.pl --anyserver -d 2 http://slashdot.org/ \
--samedir http://freshmeat.net/
Get all links from Slashdot (depth 2) and those links from freshmeat.net that point to the same directory (depth 2, too!).
You still didn't get it? Let me know! See CONTACT for how to contact the author.
This piece of software comes with absolutely no warranty. The author cannot be made responsible for any failures, defects or other damages caused by this program. Use it at your own risk.
Sirobot is GPL.
Problems? Found a bug? Want new features?
Feel free to contact the author for any kind of reason except SPAM:
Email: Settel <settel@sirlab.de>
WWW: http://www.sirlab.de/linux/contact.html
IRC: Settel, usually on #unika and #linuxger
See the following page for updates, changelogs etc:
http://www.sirlab.de/linux/