1. Hyperz

    Hyperz Well-Known Member Respected

    Feb 8, 2009
    2,258
    The Problem.

    Pirate streaming sites have a problem. Over the years they have been getting swarmed by an army of automated bots/scrapers that detect links to copyrighted work and automatically file DMCA complaints, and it's getting worse every day. In some cases stuff gets reported and deleted within minutes. As I've explained a few times you can't really prevent this from happening. Some sites offer services to protect/hide links or embeds but in reality using such a service will do nothing but annoy your users. Other sites try cheap tricks like encoding the links to try and hide them but that is utterly pointless. As someone who's been writing bots/scrapers for over a decade I can tell you that none of that stuff works. It's 30 minutes of extra work at best, though usually it's as simple as adding only 1 line of code to decode the encoded link. There are other steps you can take to combat bots in general but again none of them are effective or practical against anything but the most simple bots. The kind of bots employed by the copyright mafia couldn't care less about any of the above. Well then... If you can't prevent them from doing their thing what can you do?

    But First...

    There is something I need to point out. Scraping is easy (if you know what you're doing) and requires very little in terms of resources. A single cheapo dedicated server is enough to track and scrape 100's of websites. It can scrape and process dozens of pages per second easily. This means that a for the price of a cheap server they can take on all of the streaming sites that matter, or perhaps even all of them. Period. This is one of the reasons anti-piracy is a valid business model. They get paid good money by the copyright holders and the operating costs are low. All they need is one or a couple of servers and some devs. This is important to understand because it's one of the key reasons why DMCA abuse and trolling has become so widespread. It's easy money. But what if that wasn't the case? The answer to that question should be pretty obvious.

    Distributed Denial of Copyright.

    Everyone here should know what a DDoS attack is. A network of computers spam a server or network causing it to overload, knocking out the site/service. So what's "DDoC"? Excellent clickbait material if I do say so myself :smirk:! Nah just kidding :sweat_smile:. For starters, unlike DDoS it is not an attack but a defense. Consider this: what if a server that could scrape 100 pages per second suddenly can only do 1 per second. There are probably hundreds, if not thousands of pirate streaming sites. Each one easily has a few 1000 pages worth of movies/episodes/whatever. Suddenly instead of that latest episode getting reported in an hour it takes days. Instead needing one server to DMCA troll the entire internet now you need and entire datacenter. Operating costs explode .

    But How?

    How do you slow a server down 100x? Easy: you increase the workload 100x. Right now processing a page only takes a fraction of a second. A HTTP request, parse the HTML DOM, extract the links and that's it. This is a very lightweight task which is why you can do it many times a second in parallel on a single server. However, if we were to pull the links through 5000 iterations of state-of-the-art encryption a CPU core would now need to spend a few seconds decrypting it. To avoid nuking your own server you have to use Javascript to do the same thing on the client side (inside the browser). A user on your site wouldn't have much problem waiting a couple of extra seconds for the stream to start or the links to appear but for a server running a bot(s) that has to do this a ton of times it creates a massive CPU load/bottleneck. The drawback is that this relies on it being a widely used technique to really ruin the copyright mafia's day. That said, if nothing else the implementation I have in mind for this concept is the best link obfuscation out there bar none. So even if nobody else uses it still does a better job at hiding links than anything out there. I'm not sure yet if I'm actually going to write a proof-of-concept implementation. Being a keeper of robots myself I'd be shooting myself in the foot for little gain. That said, the idea is pretty simple so anyone with some programming experience can easily implement this.

    Thoughts? I'd love to hear what other devs think. Actually, this is one of those rare cases where I'd love to see someone poke holes into this and tell me "nah that will never work because XYZ, stop being silly!". :thinking:
     
    Last edited: May 28, 2018
  2. Hyperz

    Hyperz Well-Known Member Respected

    Feb 8, 2009
    2,258
    Python 3.6 PoC:

    Code:
    #!/usr/bin/env python3
    
    from cryptography.hazmat.primitives.ciphers import Cipher, algorithms, modes
    from cryptography.hazmat.backends import default_backend
    from PIL import Image
    from zlib import crc32
    from random import getrandbits
    from math import ceil
    
    
    CRYPTO_KEY = 'eaff2e97ad4449104ef2dd479a212c39c270a80d87aa07ab21606a7ff8604032'
    CRYPTO_BLOCK_SIZE = algorithms.AES.block_size // 8
    CRYPTO_PASSES = 25000  # decryption takes about 5-7 seconds on an i5 2500k @ 4.5GHz
    
    IMG_WIDTH = 1024
    IMG_BPP = 32
    
    
    def main() -> None:
    
        text = '''
            <h3><a id="scrape_this" href="https://www.wjunction.com/threads/ddoc-distributed-denial-of-copyright-how-to-fight-an-armada-of-dmca-bots.228164/">Distributed Denial of Copyright</a></h3>
            <p><iframe id="scrape_this_too" src="https://openload.co/embed/mQhErnEyNCs/" scrolling="no" frameborder="0" width="700" height="430" allowfullscreen="true" webkitallowfullscreen="true" mozallowfullscreen="true"></iframe></p>
        '''
    
        img = ddoc_it(text)
        img.save('test.png', 'PNG', quality=100, optimize=False)
        img.close()
    
        img = Image.open('test.png')
        print(unddoc_it(img))
        img.close()
    
    
    def ddoc_it(text: str) -> Image.Image:
    
        key = bytes.fromhex(CRYPTO_KEY)
        iv = bytes(getrandbits(8) for _ in range(16))
        payload = encrypt(text.encode('utf-8'), key, iv, CRYPTO_PASSES)
        header_and_payload = [
            b'DDoC',                               # Header: char[4]    SIGNATURE
            0x0001.to_bytes(2, 'little'),          # Header: uint16     VERSION
            key,                                   # Header: uint8[32]  KEY
            iv,                                    # Header: uint8[16]  IV
            CRYPTO_PASSES.to_bytes(4, 'little'),   # Header: uint32     PASSES
            len(payload).to_bytes(4, 'little'),    # Header: uint32     PAYLOAD_SIZE
            crc32(payload).to_bytes(4, 'little'),  # Header: uint32     PAYLOAD_CHECKSUM
            payload,
        ]
        data = b''.join(header_and_payload)
        pixels_required = ceil(len(data) / (IMG_BPP // 8))
        height = max(1, ceil(pixels_required / IMG_WIDTH))
        data = data.ljust(height * IMG_WIDTH * (IMG_BPP // 8), b'\x00')
        img = Image.frombytes('RGBA', (IMG_WIDTH, height), data)
    
        return img
    
    
    def unddoc_it(img: Image.Image) -> str:
    
        data = img.tobytes()
    
        header_signature = data[0:4]
        header_version = int.from_bytes(data[4:6], 'little')
        header_key = data[6:38]
        header_iv = data[38:54]
        header_passes = int.from_bytes(data[54:58], 'little')
        header_payload_size = int.from_bytes(data[58:62], 'little')
        header_payload_checksum = int.from_bytes(data[62:66], 'little')
    
        payload = data[66:(66 + header_payload_size)]
        checksum = crc32(payload)
    
        assert header_signature == b'DDoC'
        assert header_version <= 0x0001
        assert header_payload_checksum == checksum
    
        text = decrypt(payload, header_key, header_iv, header_passes).decode('utf-8')
    
        return text
    
    
    def encrypt(data: bytes, key: bytes, iv: bytes, passes: int) -> bytes:
    
        backend = default_backend()
        cipher = Cipher(algorithms.AES(key), modes.CBC(iv), backend)
    
        for _ in range(max(1, passes)):
    
            encryptor = cipher.encryptor()
            data = encryptor.update(pad(data)) + encryptor.finalize()
    
        return data
    
    
    def decrypt(data: bytes, key: bytes, iv: bytes, passes: int) -> bytes:
    
        backend = default_backend()
        cipher = Cipher(algorithms.AES(key), modes.CBC(iv), backend)
    
        for i in range(max(1, passes)):
    
            decryptor = cipher.decryptor()
            data = decryptor.update(unpad(data) if i > 0 else data) + decryptor.finalize()
    
        return unpad(data)
    
    
    def pad(b: bytes) -> bytes:
    
        length = len(b)
        padding = CRYPTO_BLOCK_SIZE - (length % CRYPTO_BLOCK_SIZE)
    
        return b.ljust(length + padding, bytes([padding]))
    
    
    def unpad(b: bytes) -> bytes:
    
        return b[:-ord(b[len(b) - 1:])]
    
    
    if __name__ == '__main__':
    
        main()
    
    To use the resulting image on a web page view the source of this demo page: https://hyperz.github.io/index.html.
     
    Last edited: Jun 1, 2018
  3. Tango

    Tango Super Moderator Staff Member

    Jul 9, 2009
    3,200
    gethostbyaddr is very useful, you could find if the reverse IP of visitor/bot contains eg. 'leaseweb' and pause for x amount of time or redirect.

    for example, If you have a report with 20 urls you could find the bots ip from your server logs & ban/limit the whole ip range/datacenter.
    But... some bots remove the reverse so it shows blank.

    Then some bots just pull URLs from google index.

    FYI. there are some bots out there with 2000+ IP's usually the main page could be eg. unblock-piratebay sort of script, but can scrape a whole site in minutes.
     
    Last edited: May 28, 2018
  4. Hyperz

    Hyperz Well-Known Member Respected

    Feb 8, 2009
    2,258
    Sadly that is one of those general anti-bot methods I was talking about. At minimum proxies and/or a botnet defeats it. If you do the pausing on the server side you create an attack vector (resource starvation/DoS). If you do it on the client side the bot can just ignore the JS or limit its execution time.

    DDoC fixes this as the urls are not exposed in plain text. Well actually they would not be exposed in text at all (not even encrypted text). In fact, my implementation would likely cause a situation where Google unknowingly hosts a copy of said urls lol. Fun times. Sometimes I miss running a site.

    Edit: my focus with this is making it unfeasible instead trying to counter the bots because you can't counter bots but you can make it unfeasible.
     
    Last edited: May 29, 2018
  5. Ranchvapour

    Ranchvapour Well-Known Member

    Dec 23, 2013
    446
    I haven't done any scraping but does Cloudflare's Under Attack mode do anything against bots?
     
  6. Hyperz

    Hyperz Well-Known Member Respected

    Feb 8, 2009
    2,258
    No, it can help with some types of flood attacks assuming the attack doesn't exploit something that spikes the CPU/MEM. For scraping bots it does nothing that I'm aware of besides doing a simple JS check. One of my bots bypasses it in 60-ish lines of Python without even needing JS bindings. Just some string operations and Python's eval(). Even eval() isn't required to solve it.
     
    Last edited: May 29, 2018
  7. Ranchvapour

    Ranchvapour Well-Known Member

    Dec 23, 2013
    446
    Well, my site is slow as shit so I guess that works in my favour. :sweat_smile:
     
  8. octopusdev

    octopusdev Active Member

    May 29, 2018
    25
    It doesn't help much cause it will only check for JavaScript in the user's browser. You can easily bypass that by parsing the javascript and making some kind of math using their own logic.

    One thing that might help but not solve is rate limiting, usually those bots are concurrent, meaning that it will make several requests in parallel to your server in order to get the pages, it's faster.
    If you use a rate limiting you can set up your server to send an error if the client is trying to do more than N requests per second, this way a multi threaded/concurrent bot won't work as expected as they are going to need to queue the scrapping and it's frustrating.

    As I said earlier, it won't solve the problem, but it helps.
     
  9. Hyperz

    Hyperz Well-Known Member Respected

    Feb 8, 2009
    2,258
    Alright just wrote a mockup of it in Python. It's fairly simple. Take input text, encrypt it with 25.000 passes of 256-bit AES and store the output along with a ddoc header in a PNG image. The idea is to read this image in JS using a canvas and decrypt the data. Once the data is decrypted you just show it. Because it's just text at the end of the day you can also use this to encrypt JS or HTML instead of plain text. So in terms of obfuscation you can do some really funky stuff. If you were to use only 1 pass of encryption your entire site could consist of images containing AES encrypted content that get dynamically loaded in the users browser instead everything being plaintext HTML/JS/CSS.

    This image is the result of "ddocing" the URL of this thread:
    [​IMG]

    There's a downside here. AES works in 128 bit blocks and requires input to be aligned to this. That means if the input isn't a multiple of 16 bytes it has to be padded. This inflates the size. After 25.000 passes 99% of the data is nothing but padding data. This is very suboptimal and bad for bandwidth so AES might not be a good choice here. Still, its workable.

    I haven't written the Javascript side yet but here's the Python 3.6 mockup used to create this image. You can use it to reverse the above image using unddoc_it():

    Code:
    #!/usr/bin/env python3
    
    from cryptography.hazmat.primitives.ciphers import Cipher, algorithms, modes
    from cryptography.hazmat.backends import default_backend
    from PIL import Image
    from zlib import crc32
    from math import ceil
    
    
    CRYPTO_KEY = 'eaff2e97ad4449104ef2dd479a212c39c270a80d87aa07ab21606a7ff8604032'
    CRYPTO_IV = '2fccd3de92e01edb25d6ee41345b714b'
    CRYPTO_BLOCK_SIZE = algorithms.AES.block_size // 8
    CRYPTO_PASSES = 25000  # decryption takes about 5-7 seconds on an i5 2500k @ 4.5GHz
    
    IMG_WIDTH = 1024
    IMG_BPP = 32
    
    
    def main() -> None:
    
        text = 'https://www.wjunction.com/threads/ddoc-distributed-denial-of-copyright-how-to-fight-an-armada-of-dmca-bots.228164/'
    
        img = ddoc_it(text)
        img.save('test.png', 'PNG')
        img.close()
    
        img = Image.open('test.png')
        print(unddoc_it(img))
        img.close()
    
    
    def ddoc_it(text: str) -> Image.Image:
    
        key = bytes.fromhex(CRYPTO_KEY)
        iv = bytes.fromhex(CRYPTO_IV)
        payload = encrypt(text.encode('utf-8'), key, iv, CRYPTO_PASSES)
        header_and_payload = [
            b'DDoC',                               # Header: char[4]    SIGNATURE
            0x0001.to_bytes(2, 'little'),          # Header: uint16     VERSION
            key,                                   # Header: uint8[32]  KEY
            iv,                                    # Header: uint8[16]  IV
            CRYPTO_PASSES.to_bytes(4, 'little'),   # Header: uint32     PASSES
            len(payload).to_bytes(8, 'little'),    # Header: uint64     PAYLOAD_SIZE
            crc32(payload).to_bytes(4, 'little'),  # Header: uint32     PAYLOAD_CHECKSUM
            payload,
        ]
        data = b''.join(header_and_payload)
        pixels_required = ceil(len(data) / (IMG_BPP // 8))
        height = max(1, ceil(pixels_required / IMG_WIDTH))
        data = data.ljust(height * IMG_WIDTH * (IMG_BPP // 8), b'\x00')
        img = Image.frombytes('RGBA', (IMG_WIDTH, height), data)
    
        return img
    
    
    def unddoc_it(img: Image.Image) -> str:
    
        data = img.tobytes()
    
        header_signature = data[0:4]
        header_version = int.from_bytes(data[4:6], 'little')
        header_key = data[6:38]
        header_iv = data[38:54]
        header_passes = int.from_bytes(data[54:58], 'little')
        header_payload_size = int.from_bytes(data[58:66], 'little')
        header_payload_checksum = int.from_bytes(data[66:70], 'little')
    
        payload = data[70:(70 + header_payload_size)]
        checksum = crc32(payload)
    
        assert header_signature == b'DDoC'
        assert header_version <= 0x0001
        assert header_payload_checksum == checksum
    
        text = decrypt(payload, header_key, header_iv, header_passes).decode('utf-8')
    
        return text
    
    
    def encrypt(data: bytes, key: bytes, iv: bytes, passes: int) -> bytes:
    
        backend = default_backend()
        cipher = Cipher(algorithms.AES(key), modes.CBC(iv), backend)
    
        for _ in range(max(1, passes)):
    
            encryptor = cipher.encryptor()
            data = encryptor.update(pad(data)) + encryptor.finalize()
    
        return data
    
    
    def decrypt(data: bytes, key: bytes, iv: bytes, passes: int) -> bytes:
    
        backend = default_backend()
        cipher = Cipher(algorithms.AES(key), modes.CBC(iv), backend)
    
        for i in range(max(1, passes)):
    
            decryptor = cipher.decryptor()
            data = decryptor.update(data if i == 0 else unpad(data)) + decryptor.finalize()
    
        return unpad(data)
    
    
    def pad(b: bytes) -> bytes:
    
        length = len(b)
        padding = CRYPTO_BLOCK_SIZE - (length % CRYPTO_BLOCK_SIZE)
    
        return b.ljust(length + padding, bytes([padding]))
    
    
    def unpad(b: bytes) -> bytes:
    
        return b[:-ord(b[len(b) - 1:])]
    
    
    if __name__ == '__main__':
    
        main()
    
     
    Last edited: May 30, 2018
  10. Hyperz

    Hyperz Well-Known Member Respected

    Feb 8, 2009
    2,258
    Had to make a few changes because JS is still terrible at doing anything beyond standard web stuff. I've updated the second post with the python code and web demo at https://hyperz.github.io/index.html.
     
  11. jayfella

    jayfella Well-Known Member

    Mar 25, 2009
    1,593
    Now throw a captcha in front of it before you get the data to decode. It's literally designed for this. It's the only thing I can think of that challenges the human/bot concept effectively, and it's a bit of a slap in the face using it against the grain for the gain yo :p
     
  12. Hyperz

    Hyperz Well-Known Member Respected

    Feb 8, 2009
    2,258
  13. SEOwarez

    SEOwarez Member

    Nov 27, 2017
    12
    a different approach would be to add a iptables rule (on linux) to block connections that sends more than 10 requests in say 5 seconds.
    however that would probably also block legitimate bots like googlebot.
     
  14. Hyperz

    Hyperz Well-Known Member Respected

    Feb 8, 2009
    2,258
    That's called rate limiting and is easily defeated by spreading the workload over multiple IPs/proxies.
     

Share This Page