[F#][SNIPPET] Basic web crawler example (multi-core)

Status
Not open for further replies.

Hyperz

Active Member
Veteran
2,428
2009
576
14,005
Code:
open System
open System.Net
open Hyperz.SharpLeech.Engine
open Hyperz.SharpLeech.Engine.Html
open Hyperz.SharpLeech.Engine.Net

let getData url =
    url
    |> Http.Prepare
    |> Http.Request
    |> fun result ->
        if result.HasError then result.Data
        else Http.HandleRedirects(result, false).Data

let getUrls html sourceUrl =
    let baseUrl =
        new Uri(sourceUrl)
        |> fun u -> u.Scheme + "://" + u.Host
    new HtmlDocument()
    |> fun doc -> doc.LoadHtml(html); doc
    |> fun doc -> doc.DocumentNode.SelectNodes("//a")
    |> Seq.map (fun node -> node.GetAttributeValue("href", ""))
    |> Seq.map (fun url -> HttpUtility.HtmlDecode(url).Trim())
    |> Seq.map (fun url ->
        if url.StartsWith("http://") then url
        elif url.StartsWith("https://") then url
        elif url.StartsWith("/") then baseUrl + url
        elif url.StartsWith("#") then ""
        else baseUrl + "/" + url)
    |> Seq.filter (fun url -> url.Length > 0)

let rec crawl url crawled = Async.Start(async {
    let data = getData url
    let urls =
        getUrls data url
        |> Seq.filter (fun u -> not(List.exists (fun itm -> itm = u) (crawled)))
    do printfn "Crawling: %s\nFound:    %i URL's" url (Seq.length urls)
    for u in urls do crawl u (crawled @ [u]) })


(* ================================================ *)
(* START CRAWLING                                   *)
(* ================================================ *)

let url = "http://thepiratebay.org/"
let rec memCleaner() =
    (* Clean memory every 10 seconds *)
    System.Threading.Thread.Sleep(10000)
    GC.Collect()
    memCleaner()

ServicePointManager.DefaultConnectionLimit <- 10
Http.MaxRedirects <- 2
Http.Timeout <- 10000
Http.KeepAlive <- true
Http.UseCompression <- true
Console.BufferWidth <- 256
Console.BufferHeight <- 768
Console.Title <- "F# Web Crawler"

(* Start the crawler and mem cleaner *)
Async.Start(async{memCleaner()})
crawl url [url]

stdin.Read() |> ignore

This is just a little something I made while I was working on SL to test the performance of the Http class. It'll crawl a webpage, extract the links, filter out the dupes (already crawled ones) and add the new links to a queue. It's based on recursion and is "state-less" (no mutable variables are used to store data/the state). That means no matter how many CPU cores you have, it'll scale linear across all of them with very little performance loss. It doesn't have any practical function but might be a handy reference for people who are looking into functional programming or F#.

Video showing a slightly modified version:
[youtube]c41mmbuXBZQ[/youtube]
Ignore the heavy hiphop tune and other nonsense. I was a bit drunk when I was recording it, lol :(.
 
8 comments
lool! we were talking on teamspeak about this. Personally, and i dont care what you say Hyp, my method pwns urs! Let the battle begin! >_> i'll post some speed-tests using my little engine.

EDIT: chomps ur CPU a little - why so? What part needs so much calculation?
 
:/ makes me wonder how fast the googlebot system is..

Great tut Hyperz, any chance a C# version for comparison of speeds ??

A C# version would have at least 10 times more code to do the same thing.

lool! we were talking on teamspeak about this. Personally, and i dont care what you say Hyp, my method pwns urs! Let the battle begin! >_> i'll post some speed-tests using my little engine.

EDIT: chomps ur CPU a little - why so? What part needs so much calculation?

This doesn't show my method >_>. But sure, bring it on. It's time to kick ass and chew bubblegum
221.jpg
!

It has a queue of 500.000 urls. Each time it crawls a page it extracts 50 urls on average. Each of those 50 urls has to be checked against the entire queue to filter out dupes. And this gets done more than 5 times in a second at moments.

500.000 * 50 * 5 = 125.000.000 string matches per second. Yes, that will indeed keep a quad core busy. I'm surprised it only eats 50%. Just goes to show how good it scales :). Keep in mind that filtering dupes is not the only thing that happens a few times a second here.
 
I dont think it matters about the amount of code, im looking for a comparison is speeds against F# / C#

But them stats are pretty nice Hyperz !

Actually your using libraries aswell here
Code:
open Hyperz.SharpLeech.Engine
open Hyperz.SharpLeech.Engine.Html
open Hyperz.SharpLeech.Engine.Net
 
It does matter because I'm not gonna bother with a C# port. But the answer to your question is that F# will be faster. Not because it's a better language per sé, they both are .NET anyway. But rather because it's a functional language. Some stuff will perform better when written in C#, some stuff will run better with F#. This is one of the things that will run better with a functional language because of the asynchronous stuff that's involved.

F# is gonna be better choice over C# for:
- stuff that needs to run in parallel
- complex maths
- AI / neural networks
- situations where a lot of data needs to be handled
- writing a compiler
- ray-tracing
- DSL's
- few other things

For other stuff you really want to use C#. Did you know they wrote the F# 2.0 compiler in F# itself in under 10.000 lines of code :p?

And yes it uses the SL library. As I stated in the 1st post this was made to test the Http part of it.
 
Im not even gonna entertain async. im going to throw it all in an array and threadPool it like i showed you on TeamSpeak.
 
Status
Not open for further replies.
Back
Top