SharpLeech 2 source code is now available on GitHub under GPL v2

Status
Not open for further replies.

Hyperz

Active Member
Veteran
2,428
2009
575
13,755
Was just passing by and saw deeTrix's ARTLeech topic which reminded me that I still had an old project laying around that was collecting dust. Since I abandoned it ages ago and usually mess with C++/Qt these days I figured I might as well release the source code so others can benefit, use, and/or learn from it. So, enjoy :).

Repo:

https://github.com/Hyperz/SharpLeech

Releases:

https://github.com/Hyperz/SharpLeech/releases

Notes:


  • The code is OLD and a lot of it wasn't written "by the book" so don't expect proper design patterns.
  • Support for newer forum types can be added trough DefaultSiteTypes.cs in the Engine project.
  • Even though it's old, AFIAK it is still the fastest and most feature rich forum leecher. All that it really needs is support for newer forum software.
  • Don't count on me for support, this project is abandoned.

3RFGfQ8.png
 
Last edited:
35 comments
Small heads up. I'm currently working with a client who needs SharpLeech to work on some specific forums. Because of this and because the GPL v2 license requires me to make any changes publicly available there will be a 2.0.1 release which at the very least will add IPB 3.4.x support and improved vBulletin 4.x.x support. Do note that this doesn't mean that the default plugins will be updated. I *might* build plugins for specific sites (for a fee that is) for those that need it after I'm done with the current work.

The release will posted at https://github.com/Hyperz/SharpLeech/releases when it's done.
 
So this thing is ancient. I'm surprised you even got it to post stuff at this point. Back in the day it could scrape from most supported forum software without requiring custom scraping code (with some exceptions). However, said code is 7 years out of date by now and looking back the codebase is complete garbage (hence why I took down the code a long time ago, besides github not liking piracy stuff).

To get it to scrape the correct topic title you'd have to provide a custom scraping implementation in the plugin xml files. If you're scraping from a phpbb2 forum the easiest way to do that would be to copy/paste the code for the default phpbb2 implementation and edit it and/or look at existing plugins.
 
Sure, it's old but it gets the job done. Compiling it was easier than I thought it would be for someone who barely used visual studio.

I opened up a xml & "saved as" to prevent overwriting. I think the one I copied was warezbb. I'm not sure which is the default phpbb2 implementation in the sitereaders subfolder. I'm trying to copy over the cheat engine tutorial threads, since recently they deleted their tables forum & are looking for third partys to host content.

Here's the current cheatengine.xml that I have.

Code:
<?xml version="1.0" encoding="utf-8" ?>

<!-- SharpLeech 2.x.x SiteReader Plugin -->

<!-- Version MUST be in x.x.x.x format! -->
<SiteReader pluginVersion="2.0.0.0" pluginAuthor="Hyperz">
    <Settings>
        <SiteName>Cheatengine</SiteName>
        <BaseUrl>http://forum.cheatengine.org</BaseUrl>
        <TopicsPerPage>45</TopicsPerPage>
      
        <!-- Supported type values are: IP.Board 3.1.4+, IP.Board 3.x.x, IP.Board 2.x.x,
             vBulletin 4.x.x, vBulletin 3.x.x, phpBB 3.x.x, phpBB 2.x.x -->
        <Type>phpBB 2.x.x</Type>
      
        <!-- If unsure choose ISO-8859-1. Except for phpBB 3 boards, they use UTF-8 by default. -->
        <DefaultEncoding>ISO-8859-1</DefaultEncoding>
      
        <!-- Set to true if the site uses SEO urls, otherwise false. -->
        <AllowRedirects>false</AllowRedirects>
        <UseFriendlyLinks>false</UseFriendlyLinks>
    </Settings>

    <Sections>
        <Section title="Cheat Engine Tutorials" id="7" />
      
      
        <!-- If you have an account with VIP access you can un-comment this (:
        <Section title="VIP / Donators Only" id="24" />
        -->
    </Sections>

    <!-- Edit this when the site requires custom parsing -->
    <Code>
        <![CDATA[
      
        protected override void Init()
        {
            base.Init();
        }

        public override void LoginUser(string username, string password)
        {
            base.LoginUser(username, password);
        }

        public override void LogoutUser()
        {
            base.LogoutUser();
        }

        public override string[] GetTopicUrls(string html)
        {
            return base.GetTopicUrls(html);
        }

        public override SiteTopic GetTopic(string url)
        {
            return base.GetTopic(url);
        }

        public override SiteTopic GetTopic(int topicId)
        {
            return base.GetTopic(topicId);
        }
      
        public override HttpWebRequest GetPage(int sectionId, int page, int siteTopicsPerPage)
        {
            return base.GetPage(sectionId, page, siteTopicsPerPage);
        }

        public override void MakeReady(int sectionId)
        {
            base.MakeReady(sectionId);
        }
      
        ]]>
    </Code>
</SiteReader>
 
You'll need to edit this part:
PHP:
        public override SiteTopic GetTopic(string url)
        {
            return base.GetTopic(url);
        }

This is the default code for it, taken from the DefaultSiteTypes.cs file:
PHP:
        public override SiteTopic GetTopic(string url)
        {
            if (!this.User.IsLoggedIn) return null;

            HtmlDocument doc = new HtmlDocument();
            HttpWebRequest req;
            HttpResult result;

            req = Http.Prepare(url);
            req.Method = "GET";
            req.Referer = url;

            try
            {
                result = this.AllowRedirects ? Http.HandleRedirects(Http.Request(req), false) : Http.Request(req);
                doc.LoadHtml(result.Data);

                ErrorLog.LogException(result.Error);

                HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//img[@alt='Reply with quote']");
                string link = HttpUtility.HtmlDecode(nodes[0].ParentNode.GetAttributeValue("href", String.Empty));
                
                nodes = doc.DocumentNode.SelectNodes("//*[@class='maintitle']");
                string title = HttpUtility.HtmlDecode(nodes[0].InnerText).Trim();

                req = Http.Prepare((link.StartsWith("http:")) ? link : this.BaseUrl + "/" + link);
                req.Method = "GET";
                req.Referer = url;

                result = this.AllowRedirects ? Http.HandleRedirects(Http.Request(req), false) : Http.Request(req);
                doc.LoadHtml(result.Data);

                ErrorLog.LogException(result.Error);

                string content = doc.DocumentNode.SelectNodes("//textarea[@name='message']")[0].InnerText;

                content = HttpUtility.HtmlDecode(content.Substring(content.IndexOf(']') + 1)).Trim();
                content = content.Substring(0, content.Length - "[/quote]".Length);

                // Empty read topics cookie
                var cookies = from Cookie c in Http.SessionCookies
                              where c.Name.EndsWith("_t")
                              select c;

                foreach (Cookie c in cookies) c.Value = String.Empty;

                return new SiteTopic(
                    title.Trim(),
                    content.Trim(),
                    0, 0, url
                );
            }
            catch (Exception error)
            {
                ErrorLog.LogException(error);
                return null;
            }
        }

Assuming everything else works you'd just need to edit the xpath of:
PHP:
nodes = doc.DocumentNode.SelectNodes("//*[@class='maintitle']");
 
Had a look at the cheatengine site. It seems they used the same css class for the site's title. This should get the topic title instead:
PHP:
 nodes = doc.DocumentNode.SelectNodes("//a[@class='maintitle']");
 
I was browsing through some posts about how you wanted to make SL opensource so other users could contribute. Does sharpleech have a discord yet?
 
Nope. Last time I worked on it was 7 years or so ago. I released the source mostly because I wasn't doing anything with it anymore. In other words, this is pretty much abandoned software. If you want to do something with it or fork it feel free to do so. All I ask is that the credits for the original work remain. That said, I wouldn't recommend basing anything off this code. Some of it dates back to 2008 when I initially got into programming. It doesn't follow any design patterns and a lot of the code doesn't make sense when you look at C# 7 and .NET > 4.5's features. For example the HTTP stuff doesn't use modern async and relies on the deprecated HttpWebRequest/Response classes (use HttpClient and maybe something like Flurl.Http), the plugin system should be done using something like the MEF, the GUI should use a proper design pattern like MVVM, the IRC and media player bloat shouldn't be in there, etc.
 
Unfortunately, I'm not a coder, so I wouldn't be able to add anything if I wanted to, at least not easily lol. Barely figured out how to compile it lol. I'm used to dealing with modifications to vbulletin & some htaccess when I need to (linked to my vb.org profile), which is quite different than dealing with C#. Realistically, I'd have to ask another coder or put the file & source out there and hope some other coder makes the changes I need while still making SL free.

If you're up for it, I would pay to have the DefaultSiteTypes.cs file updated to provide proper leeching from:

mybb 1.8 & 2.0
vbulletin 4 (still has issues from what I can tell, when trying to leech, at least from this site http://www.psvitaiso.com/)
vbulletin 5 (it sucks but surprisingly, some communities have adopted it, against advice not to)
smf (Simple Machines Forum) 1 & 2 versions
proboards (big free forum software)
xenforo
and example xml templates for each.

I understand if you dont want to but I figured I would ask and explain my situation/ideas anyways.
 
Last edited:
No harm in asking, but I'll have to pass on that. You can probably find a freelancer that wants to do it but chances are it's never going to be worth the price unless he charges crazy low rates and/or only makes it work with the default html/css of the forum software.

For every individual forum software:

  • A nulled copy has to be found if it's not free software and local installation has to be setup.
  • A scraping and posting implementation has to be written.
  • A bunch of existing forums using that forum software has to be found so that a pattern can be found and scraping implementation adjusted to work with most of them.
  • The code has to be tested against those existing forums AND a few forums of every other supported forum type.

None of that is hard or requires a lot of code but it is a very time consuming and annoying process if you want to do it right. And at the end of the day there will still be plenty of sites that will require a custom scraping implementation because their html/css structure deviates from the default one. Not to mention cross-forum incompatibilities (such as BBCode).
 
Status
Not open for further replies.
Back
Top