darkestblue
Active Member
i was bored and create quick and dirty script for data scraping. as example source i use the pron streaming site
pornkind.net
the script parse all sites and posts and create a csv file with following columns:
the script dont use higher level function like html-parsing and elements-searches with xpath. only string splits for fast and "stable" results.
why powershell? because iam bored and i can use every tool , not only the standards
full pornkind_export.csv downloadble at -> https://anonymousfiles.io/XCVghyEh/
pornkind.net
the script parse all sites and posts and create a csv file with following columns:
title | link_pornkind | link_stream | duration | studio | tags | description | thump |
why powershell? because iam bored and i can use every tool , not only the standards
full pornkind_export.csv downloadble at -> https://anonymousfiles.io/XCVghyEh/
Code:
$out_file = "$home\pornkind_export.csv"
Write-Host "out_file ->" $out_file
$data_list = @()
function split_between($text,$from,$till) {
return ((($text -split $from )[1]) -split $till)[0]
}
#go through pages
1..63 | % {
$page = $_
Write-Host "check page" $page
$link_page = "https://pornkind.net/page/$page/"
$a = Invoke-RestMethod $link_page
#grab only the "latest" part
$data = $a -split "video-loop" | Select-Object -Last 1
#grab all videos per page an go foreach
$list = $data -split "video-block thumbs-rotation" | Select-Object -Skip 1
$list | % {
$line = $_
$link_pornkind = split_between $line 'href="' '"'
$thump= split_between $line 'data-src="' '"'
$duration= split_between $line '"duration">' '<'
$title_short= split_between $line 'title="' '"'
$title_short= [System.Web.HttpUtility]::HtmlDecode($title_short)
#load video page and get more details
$a2 = Invoke-RestMethod $link_pornkind
$link_stream = split_between $a2 'itemprop="contentURL" content="' '"'
$description = split_between $a2 '"description": "' '"'
$tag_list = @()
$a2 -split '<meta property="article:tag" content="' | Select-Object -Skip 1 | % {
$tag_list += ($_ -split '"')[0]
}
$studio = split_between $a2 '<meta property="article:section" content="' '"'
$title_full= split_between $a2 '<h1>' '<'
$title_full= [System.Web.HttpUtility]::HtmlDecode($title_full)
$data_map = New-Object System.Collections.Specialized.OrderedDictionary
$data_map.Add("title" , $title_full)
$data_map.Add("link_pornkind" , $link_pornkind)
$data_map.Add("link_stream", $link_stream)
$data_map.Add("duration" , $duration)
$data_map.Add("studio" , $studio)
$data_map.Add("tags" , ($tag_list -join ","))
$data_map.Add("description" , $description)
$data_map.Add("thump" , $thump)
$data_list += New-Object PSObject -Property $data_map
}
}
$data_list | Export-Csv -Path $out_file -Delimiter ";" -NoTypeInformation
Last edited: