French scrape code ( id )

Status
Not open for further replies.

kingmaster56

New Member
2
2018
0
0
I want to scraping the id on a source code of a page with python but not all link just code ( sorry for my english i use google translate ) thank you so much
take.png
 
2 comments
There are 2 ways to go about it, parsing the HTML DOM, grabbing the target element and running a regular expression on the href attribute. This is the preferred way but requires a HTML parser like lxml:

Code:
import requests
import re
from lxml import html

url = 'http://your-url-here/'
resp = requests.get(url)
doc = html.fromstring(resp.text)
href = doc.xpath('string((//span[@class = "dato"]/a[contains(@href, "imdb.com/title/")])[1]/@href)')
regex = re.search(r'imdb\.com/title/(?P<id>tt\d+)', href)
imdb_id = regex.group('id') if regex is not None else 'NOT_FOUND'

print(imdb_id)

The (dirty) alternative is just running a regular expression on the entire HTML string or (even dirtier) using substring operations. For regex the above would become something like:

Code:
import requests
import re

url = 'http://your-url-here/'
resp = requests.get(url)
regex = re.search(r'imdb\.com/title/(?P<id>tt\d+)', resp.text, re.M)
imdb_id = regex.group('id') if regex is not None else 'NOT_FOUND'

print(imdb_id)

Both examples use the "requests" package for the HTTP stuff.
 
Status
Not open for further replies.
Back
Top