1. kingmaster56

    kingmaster56 New Member

    Mar 21, 2018
    2
    I want to scraping the id on a source code of a page with python but not all link just code ( sorry for my english i use google translate ) thank you so much
    [​IMG]
     
  2. Hyperz

    Hyperz Well-Known Member Respected

    Feb 8, 2009
    2,229
    There are 2 ways to go about it, parsing the HTML DOM, grabbing the target element and running a regular expression on the href attribute. This is the preferred way but requires a HTML parser like lxml:

    Code:
    import requests
    import re
    from lxml import html
    
    url = 'http://your-url-here/'
    resp = requests.get(url)
    doc = html.fromstring(resp.text)
    href = doc.xpath('string((//span[@class = "dato"]/a[contains(@href, "imdb.com/title/")])[1]/@href)')
    regex = re.search(r'imdb\.com/title/(?P<id>tt\d+)', href)
    imdb_id = regex.group('id') if regex is not None else 'NOT_FOUND'
    
    print(imdb_id)
    The (dirty) alternative is just running a regular expression on the entire HTML string or (even dirtier) using substring operations. For regex the above would become something like:

    Code:
    import requests
    import re
    
    url = 'http://your-url-here/'
    resp = requests.get(url)
    regex = re.search(r'imdb\.com/title/(?P<id>tt\d+)', resp.text, re.M)
    imdb_id = regex.group('id') if regex is not None else 'NOT_FOUND'
    
    print(imdb_id)
    Both examples use the "requests" package for the HTTP stuff.
     
  3. kingmaster56

    kingmaster56 New Member

    Mar 21, 2018
    2
    Hyperz
    Yess!!! your are great !!thanks you so much !! ^^
     

Share This Page