Tuesday, April 8, 2008

Heritrix 1: Installing and running Heritrix

I am, in general term, new to linux OS/administration and I am working on focused crawling using open source Heritrix. Though common to get stucked, I am writing this so that it may help someone like me.

. Download Heritrix
. Install Heritrix
. Install Sun Java (apt-get install sun-java5-jdk)
. Choose Java we installed (update-java-alternatives -l)
. Make sure that port 8080 is free (netstat -nltp)
mine 8080 was used by java so I used (killall java)
Then follow steps from user manual (I made startup.sh containing)
export HERITRIX_HOME=/home/tyampoo/Desktop/heritrix-1.14.0-RC1
export JAVA_HOME=/usr/lib/jvm/java-1.5.0-sun/jre
export JAVA_OPTS="-Xmx512M"
$HERITRIX_HOME/bin/heritrix --admin=admin:abc123

Now can see the web interface (http://127.0.0.1:8080/index.jsp). I believe, as parse says, this is just a spoon and now I have to dig this enormous hill with it!

Tyampoo.

No comments: