data scraping with powershell quick and dirty example -> pornkind.net (win&unix)

Status
Not open for further replies.

darkestblue

Active Member
680
2020
370
3,355
i was bored and create quick and dirty script for data scraping. as example source i use the pron streaming site
pornkind.net
the script parse all sites and posts and create a csv file with following columns:
titlelink_pornkindlink_streamdurationstudiotagsdescriptionthump
the script dont use higher level function like html-parsing and elements-searches with xpath. only string splits for fast and "stable" results.
why powershell? because iam bored and i can use every tool , not only the standards
full pornkind_export.csv downloadble at -> https://anonymousfiles.io/XCVghyEh/


Code:
$out_file = "$home\pornkind_export.csv"
Write-Host "out_file ->" $out_file
$data_list = @()

function split_between($text,$from,$till) {
   return ((($text -split $from )[1]) -split $till)[0]
}
#go through pages
1..63 | % {
    $page = $_
    Write-Host "check page" $page
    $link_page = "https://pornkind.net/page/$page/"
    $a = Invoke-RestMethod $link_page

    #grab only the "latest" part
    $data = $a -split "video-loop" | Select-Object -Last 1

    #grab all videos per page an go foreach
    $list = $data -split "video-block thumbs-rotation" | Select-Object -Skip 1
    $list | % {
        $line = $_
        $link_pornkind = split_between  $line 'href="' '"'
        $thump=   split_between  $line 'data-src="'  '"'
        $duration=   split_between  $line '"duration">'  '<'
        $title_short=  split_between  $line 'title="'  '"'
        $title_short= [System.Web.HttpUtility]::HtmlDecode($title_short)
  
        #load video page and get more details
        $a2 = Invoke-RestMethod $link_pornkind
        $link_stream =      split_between  $a2 'itemprop="contentURL" content="'  '"'
        $description =     split_between  $a2 '"description": "'  '"'
        $tag_list = @()
        $a2 -split '<meta property="article:tag" content="' | Select-Object -Skip 1 | % {
            $tag_list += ($_ -split '"')[0]
        }
        $studio =   split_between  $a2 '<meta property="article:section" content="'  '"'
        $title_full=     split_between  $a2 '<h1>'  '<'
        $title_full= [System.Web.HttpUtility]::HtmlDecode($title_full)

        $data_map = New-Object System.Collections.Specialized.OrderedDictionary
        $data_map.Add("title" , $title_full)
        $data_map.Add("link_pornkind" , $link_pornkind)
        $data_map.Add("link_stream", $link_stream)
        $data_map.Add("duration" , $duration)
        $data_map.Add("studio" , $studio)
        $data_map.Add("tags" , ($tag_list -join ","))
        $data_map.Add("description" , $description)
        $data_map.Add("thump" , $thump)
        $data_list +=  New-Object PSObject -Property  $data_map
    }
}
$data_list  |  Export-Csv -Path $out_file -Delimiter ";"  -NoTypeInformation
 
Last edited:
6 comments
i think the DOM changes more often than this. but the word stable* is maybe not the best choose and focus is on "quick and dirty"

*i use stable because i need no null/none checks and look if i get the data per src/href/text/innerhtml attribute/property
 
Last edited:
i was bored and create quick and dirty script for data scraping. as example source i use the pron streaming site
pornkind.net
the script parse all sites and posts and create a csv file with following columns:
titlelink_pornkindlink_streamdurationstudiotagsdescriptionthump
the script dont use higher level function like html-parsing and elements-searches with xpath. only string splits for fast and "stable" results.
why powershell? because iam bored and i can use every tool , not only the standards
full pornkind_export.csv downloadble at -> https://anonymousfiles.io/XCVghyEh/


Code:
$out_file = "$home\pornkind_export.csv"
Write-Host "out_file ->" $out_file
$data_list = @()

function split_between($text,$from,$till) {
   return ((($text -split $from )[1]) -split $till)[0]
}
#go through pages
1..63 | % {
    $page = $_
    Write-Host "check page" $page
    $link_page = "https://pornkind.net/page/$page/"
    $a = Invoke-RestMethod $link_page

    #grab only the "latest" part
    $data = $a -split "video-loop" | Select-Object -Last 1

    #grab all videos per page an go foreach
    $list = $data -split "video-block thumbs-rotation" | Select-Object -Skip 1
    $list | % {
        $line = $_
        $link_pornkind = split_between  $line 'href="' '"'
        $thump=   split_between  $line 'data-src="'  '"'
        $duration=   split_between  $line '"duration">'  '<'
        $title_short=  split_between  $line 'title="'  '"'
        $title_short= [System.Web.HttpUtility]::HtmlDecode($title_short)

        #load video page and get more details
        $a2 = Invoke-RestMethod $link_pornkind
        $link_stream =      split_between  $a2 'itemprop="contentURL" content="'  '"'
        $description =     split_between  $a2 '"description": "'  '"'
        $tag_list = @()
        $a2 -split '<meta property="article:tag" content="' | Select-Object -Skip 1 | % {
            $tag_list += ($_ -split '"')[0]
        }
        $studio =   split_between  $a2 '<meta property="article:section" content="'  '"'
        $title_full=     split_between  $a2 '<h1>'  '<'
        $title_full= [System.Web.HttpUtility]::HtmlDecode($title_full)

        $data_map = New-Object System.Collections.Specialized.OrderedDictionary
        $data_map.Add("title" , $title_full)
        $data_map.Add("link_pornkind" , $link_pornkind)
        $data_map.Add("link_stream", $link_stream)
        $data_map.Add("duration" , $duration)
        $data_map.Add("studio" , $studio)
        $data_map.Add("tags" , ($tag_list -join ","))
        $data_map.Add("description" , $description)
        $data_map.Add("thump" , $thump)
        $data_list +=  New-Object PSObject -Property  $data_map
    }
}
$data_list  |  Export-Csv -Path $out_file -Delimiter ";"  -NoTypeInformation
You have got way too much time on your hands :) haha

Thank you for the attention and promotion though. Very appreciated
 
Last edited:
Status
Not open for further replies.
Back
Top