[F#][SNIPPET] Basic web crawler example (multi-core)

Hyperz · Jun 15, 2010

Code:

open System
open System.Net
open Hyperz.SharpLeech.Engine
open Hyperz.SharpLeech.Engine.Html
open Hyperz.SharpLeech.Engine.Net

let getData url =
    url
    |> Http.Prepare
    |> Http.Request
    |> fun result ->
        if result.HasError then result.Data
        else Http.HandleRedirects(result, false).Data

let getUrls html sourceUrl =
    let baseUrl =
        new Uri(sourceUrl)
        |> fun u -> u.Scheme + "://" + u.Host
    new HtmlDocument()
    |> fun doc -> doc.LoadHtml(html); doc
    |> fun doc -> doc.DocumentNode.SelectNodes("//a")
    |> Seq.map (fun node -> node.GetAttributeValue("href", ""))
    |> Seq.map (fun url -> HttpUtility.HtmlDecode(url).Trim())
    |> Seq.map (fun url ->
        if url.StartsWith("http://") then url
        elif url.StartsWith("https://") then url
        elif url.StartsWith("/") then baseUrl + url
        elif url.StartsWith("#") then ""
        else baseUrl + "/" + url)
    |> Seq.filter (fun url -> url.Length > 0)

let rec crawl url crawled = Async.Start(async {
    let data = getData url
    let urls =
        getUrls data url
        |> Seq.filter (fun u -> not(List.exists (fun itm -> itm = u) (crawled)))
    do printfn "Crawling: %s\nFound:    %i URL's" url (Seq.length urls)
    for u in urls do crawl u (crawled @ [u]) })


(* ================================================ *)
(* START CRAWLING                                   *)
(* ================================================ *)

let url = "http://thepiratebay.org/"
let rec memCleaner() =
    (* Clean memory every 10 seconds *)
    System.Threading.Thread.Sleep(10000)
    GC.Collect()
    memCleaner()

ServicePointManager.DefaultConnectionLimit <- 10
Http.MaxRedirects <- 2
Http.Timeout <- 10000
Http.KeepAlive <- true
Http.UseCompression <- true
Console.BufferWidth <- 256
Console.BufferHeight <- 768
Console.Title <- "F# Web Crawler"

(* Start the crawler and mem cleaner *)
Async.Start(async{memCleaner()})
crawl url [url]

stdin.Read() |> ignore

This is just a little something I made while I was working on SL to test the performance of the Http class. It'll crawl a webpage, extract the links, filter out the dupes (already crawled ones) and add the new links to a queue. It's based on recursion and is "state-less" (no mutable variables are used to store data/the state). That means no matter how many CPU cores you have, it'll scale linear across all of them with very little performance loss. It doesn't have any practical function but might be a handy reference for people who are looking into functional programming or F#.

Video showing a slightly modified version:
[youtube]c41mmbuXBZQ[/youtube]
Ignore the heavy hiphop tune and other nonsense. I was a bit drunk when I was recording it, lol

.

gamelord · Jun 15, 2010

What language is this??

(

Just googled F#.Wow I never new that microsoft was developing F#. :0

Hyperz · Jun 15, 2010

It's F#: http://en.wikipedia.org/wiki/F_Sharp_(programming_language)

litewarez · Jun 17, 2010

:/ makes me wonder how fast the googlebot system is..

Great tut Hyperz, any chance a C# version for comparison of speeds ??

jayfella · Jun 17, 2010

lool! we were talking on teamspeak about this. Personally, and i dont care what you say Hyp, my method pwns urs! Let the battle begin! >_> i'll post some speed-tests using my little engine.

EDIT: chomps ur CPU a little - why so? What part needs so much calculation?

Hyperz · Jun 17, 2010

litewarez said:
:/ makes me wonder how fast the googlebot system is..

Great tut Hyperz, any chance a C# version for comparison of speeds ??

A C# version would have at least 10 times more code to do the same thing.

jayfella said:
lool! we were talking on teamspeak about this. Personally, and i dont care what you say Hyp, my method pwns urs! Let the battle begin! >_> i'll post some speed-tests using my little engine.

EDIT: chomps ur CPU a little - why so? What part needs so much calculation?

This doesn't show my method >_>. But sure, bring it on. It's time to kick ass and chew bubblegum

!

It has a queue of 500.000 urls. Each time it crawls a page it extracts 50 urls on average. Each of those 50 urls has to be checked against the entire queue to filter out dupes. And this gets done more than 5 times in a second at moments.

500.000 * 50 * 5 = 125.000.000 string matches per second. Yes, that will indeed keep a quad core busy. I'm surprised it only eats 50%. Just goes to show how good it scales

. Keep in mind that filtering dupes is not the only thing that happens a few times a second here.

litewarez · Jun 17, 2010

I dont think it matters about the amount of code, im looking for a comparison is speeds against F# / C#

But them stats are pretty nice Hyperz !

Actually your using libraries aswell here

Code:

open Hyperz.SharpLeech.Engine
open Hyperz.SharpLeech.Engine.Html
open Hyperz.SharpLeech.Engine.Net

Hyperz · Jun 17, 2010

It does matter because I'm not gonna bother with a C# port. But the answer to your question is that F# will be faster. Not because it's a better language per sÃ©, they both are .NET anyway. But rather because it's a functional language. Some stuff will perform better when written in C#, some stuff will run better with F#. This is one of the things that will run better with a functional language because of the asynchronous stuff that's involved.

F# is gonna be better choice over C# for:
- stuff that needs to run in parallel
- complex maths
- AI / neural networks
- situations where a lot of data needs to be handled
- writing a compiler
- ray-tracing
- DSL's
- few other things

For other stuff you really want to use C#. Did you know they wrote the F# 2.0 compiler in F# itself in under 10.000 lines of code :p?

And yes it uses the SL library. As I stated in the 1st post this was made to test the Http part of it.

jayfella · Jun 17, 2010

Im not even gonna entertain async. im going to throw it all in an array and threadPool it like i showed you on TeamSpeak.

[F#][SNIPPET] Basic web crawler example (multi-core)

Hyperz

Active Member

gamelord

Active Member

Hyperz

Active Member

litewarez

Active Member

jayfella

Active Member

Hyperz

Active Member

litewarez

Active Member

Hyperz

Active Member

jayfella

Active Member