French scrape code ( id )

kingmaster56 · Mar 22, 2018

I want to scraping the id on a source code of a page with python but not all link just code ( sorry for my english i use google translate ) thank you so much

Hyperz · Mar 22, 2018

There are 2 ways to go about it, parsing the HTML DOM, grabbing the target element and running a regular expression on the href attribute. This is the preferred way but requires a HTML parser like lxml:

Code:

import requests
import re
from lxml import html

url = 'http://your-url-here/'
resp = requests.get(url)
doc = html.fromstring(resp.text)
href = doc.xpath('string((//span[@class = "dato"]/a[contains(@href, "imdb.com/title/")])[1]/@href)')
regex = re.search(r'imdb\.com/title/(?P<id>tt\d+)', href)
imdb_id = regex.group('id') if regex is not None else 'NOT_FOUND'

print(imdb_id)

The (dirty) alternative is just running a regular expression on the entire HTML string or (even dirtier) using substring operations. For regex the above would become something like:

Code:

import requests
import re

url = 'http://your-url-here/'
resp = requests.get(url)
regex = re.search(r'imdb\.com/title/(?P<id>tt\d+)', resp.text, re.M)
imdb_id = regex.group('id') if regex is not None else 'NOT_FOUND'

print(imdb_id)

Both examples use the "requests" package for the HTTP stuff.

kingmaster56 · Mar 22, 2018

Hyperz
Yess!!! your are great !!thanks you so much !! ^^

French scrape code ( id )

kingmaster56

New Member

Hyperz

Active Member

kingmaster56

New Member