Scraping an Entire Website using LINUX

January 29, 2010

Web scraping (also called Web harvesting or Web data extraction) is a computer software technique of extracting information from websites. Usually, such software programs simulate human exploration of the Web by either implementing low-level Hypertext Transfer Protocol (HTTP), or embedding certain full-fledged Web browsers, such as the Internet Explorer (IE) and the Mozilla Web browser. Web scraping is closely related to Web indexing, which indexes Web content using a bot and is a universal technique adopted by most search engines. In contrast, Web scraping focuses more on the transformation of unstructured Web content, typically in HTML format, into structured data that can be stored and analyzed in a central local database or spreadsheet. Web scraping is also related to Web automation, which simulates human Web browsing using computer software. Exemplary uses of Web scraping include online price comparison, weather data monitoring, website change detection, Web research, Web content mashup and Web data integration. [wikipedia]

Scarping a website using wget command in Linux.

$ wget -m --tries=7 "http://www.linuxnuggetz.blogspot.com"

Spoofing the browser that you are using.

$ wget --user-agent="Mozilla 2.0" -m http://www.linuxnuggetz.blogspot.com

Comments

AnonymousOctober 23, 2010 at 7:07 PM
its nice post i dont know how to extract web by using linux as basically i am from .net field but i got good info from your post thanks

Web Data Extraction
ReplyDelete
Replies
RamseizeSeptember 8, 2011 at 9:32 AM
nice tut, i would try this on my ubuntu box, if your still active in blogging, invite sana kita to join in our new blog community @ http://pinoygeeks.net
ReplyDelete
Replies
AnantzJune 5, 2012 at 5:01 PM
Where does the scrapped website get stored?
ReplyDelete
Replies
UnknownJune 15, 2012 at 10:04 PM
website will be stored in the current folder where you launch the wget command unless you specify another path with the "-O" (--output-document=FILE) parameter.
ReplyDelete
Replies
SJH_InvestJanuary 30, 2013 at 6:48 AM
How do you install lynx and how well does it work with Ubuntu? I would like to gather property data for an auction coming up soon and this tool could help.

Thanks
ReplyDelete
Replies
JagadeeshFebruary 24, 2014 at 1:18 PM
how to scrap text alone from the website? .... as i am interested only in text processing i need to collect more text data.
ReplyDelete
Replies

Add comment

Search This Blog

Linux Nuggetz

Scraping an Entire Website using LINUX

Comments

Post a Comment

Popular posts from this blog

Linux OS Backup and Recovery

How to enable clustering in Openfire Enterprise?