Webscraping: verschil tussen versies

Versie van 3 aug 2019 15:23

Hoera! Webscraping! Dit is echt heel leuk!

Context

Wat ik gebruik:

Webclient: requests
HTML-parser: lxml [1]
Bewerken van de parse-tree: Beautiful Soup

Casussen

WooCrack Kopiëeren: Hoe kun je WooCrack kopiëren, inclusief de downloads? Dit geldt voor de situatie dat je over inlogcodes beschikt
Price bot: Hoe kan ik met een script voor bepaalde producten, de prijzen van de concurrent zien? In dit geval zou ik vantevoren de URL's van de websites van concurrenten verzamelen
Alle koolborstel-info: Kan ik voor een klant die in koolborstels handelt, alle koolborstels te wereld downloaden, inclusief alle gerelateerde data?

Vraagstukken

Hoe kun je sites scrapen waar je eerst moet inloggen? Paar extra stappen die je met de webclient doorloopt? → Zie voorbeeld requests.
Hoe download je bestanden, zoals van WooCrack?

Webclients

Om te interacteren met een webserver, heb je een webclient nodig. De gebruikelijke pakketten in Python hiervoor:

urllib
urllib2
urllib3
requests - Op zichzelfstaand pakket. Niet onderdeel van een van de andere pakketten! - Vermoedelijk het beste pakket [2]

GET, POST, PUT

De belangrijkste HTTP requests:

GET: Request a resource from a server. Argument: An URL
POST: Sent data to a server to create/update a resource. Als je Terug klikt binnen een browser, en deze waarschuwt je, dat je iets opnieuw submit, dan betreft het een POST-request
PUT: Broertje van POST, maar dan idempotent: Als je hetzelfde PUT-commando meerdere keren uitvoert, verandert het resultaat niet. Als je een POST-commando herhaalt, creëer je additionele resources/updates.

GET en POST zijn de twee meestvoorkomende HTTP requests. PUT is een stuk zeldzamer, net als de overige requests (die hier niet behandeld worden).

Requests - Inloggen op een afgeschermde pagina

Dat gaat met requests verbazend simpel. Sterker nog: Het voorbeeld op de home page demonstreet dit al:

>>> r = requests.get('http://example.strompf.com', auth=('xxx','yyy'))
>>> r.status_code
200
>>> r.headers
{'Content-language': 'en', 'Expires': 'Thu, 01 Jan 1970 00:00:00 GMT', 'Vary': 'Accept-Encoding,Cookie', 'X-Powered-By': 'PHP/5.5.9-1ubuntu4.27', 'Date': 'Fri, 02 Aug 2019 12:14:12 GMT', 'Cache-Control': 'private, must-revalidate, max-age=0', 'Server': 'Apache/2.4.7 (Ubuntu)', 'Content-Type': 'text/html; charset=UTF-8', 'Content-Encoding': 'gzip', 'Keep-Alive': 'timeout=5, max=49', 'Connection': 'Keep-Alive', 'Content-Length': '32237', 'Last-Modified': 'Fri, 02 Aug 2019 10:03:24 GMT'}
>>> r.headers['content-type']
'text/html; charset=UTF-8'
>>> r.encoding
'UTF-8'
>>> r.text
'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">\n<html lang="en" dir="ltr">\n<head>\n<title>Main Page - Example</title>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />\n<meta name="generator" content="MediaWiki 1.16.4" />\n<link rel="alternate" type="application/x-wiki" title="Edit" href="/index.php?title=Main_Page&action=edit" />\n<link rel="edit" title="Edit" href="/index.php?title=Main_Page&action=edit" />\n<link rel="shortcut icon" href="/favicon.ico" />\n<link rel="search" type="application/opensearchdescription+xml" href="/opensearch_desc.php" title="Example (en)" />\n<link rel="alternate" type="a

Ter verificatie

#! /usr/bin/python3
#
# Experiments with requests
###################################################################
#
#
import requests

print(">>> Login zonder credentials...")
r = requests.get('http://example.strompf.com')
print(r.status_code)

print(">>> Login met correcte credentials...")
r = requests.get('http://example.strompf.com', auth=('xxx','correcte wachtwoord'))
print(r.status_code)

print(">>> Login met incorrecte credentials...")
r = requests.get('http://example.strompf.com', auth=('xxx','Verkeerde wachtwoord'))
print(r.status_code)

Uitvoer:

>>> Login zonder credentials...
401
>>> Login met correcte credentials...
200
>>> Login met incorrecte credentials...
401

P.s.: Handjevol HTTP Status-codes:

200: OK
301: Moved Permanently
307: Temporary Redirected
400: Bad Request
401: Unauthorized
403: Forbidden - Ik geloof dat je dit krijgt als je Google Search probeert te scrapen
404: Not found
504: Gateway Timeout - Krijg ik regelmatig op https://couchsurfing.com

Voorbeeld NewEgg - Scraping

[3]

Doel

Lijst met

Productitels
SKU's (indien beschikbaar)
EAN-codes (indien beschikbaar)
Prijzen.

Script

#! /usr/bin/python3
#
# Newegg webcrawling-example - Data Science Dojo
###################################################################
#
# Source: https://www.youtube.com/watch?v=XQgXKtPSzUI

###################################################################
# Libraries
###################################################################
#
# Beautiful Soup
################
#
# * For processing websites; the actual crawling
# * Only "BeatitfulSoup" is imported from bs4
# * "Soup" functions like an alias
#
from bs4 import BeautifulSoup as soup

# Webclient
################
#
# * From urllib, only urlopen from request is needed
# * "uReq" works like an alias
#
from urllib.request import urlopen as uReq


###################################################################
# Fetch a webpage
###################################################################
#
# The page we want to scrape
############################
#
my_url = 'https://www.newegg.com/global/nl-en/p/pl?d=graphics+card'

# Download the page to object p
#########################################
#
p = uReq(my_url)

# What kind of object is this?
##############################
#
# print(type(p))
#
# Reply:
#
# <class 'http.client.HTTPResponse'>

# Welke methodes heeft dit object?
##################################
#
dir(p)
#
# Reply:
#
# ['__abstractmethods__', '__class__', '__del__', '__delattr__', '__dict__', 
# '__dir__', '__doc__', '__enter__', '__eq__', '__exit__', '__format__', '__ge__', 
# '__getattribute__', '__gt__', '__hash__', '__init__', '__iter__', '__le__', 
# '__lt__', '__module__', '__ne__', '__new__', '__next__', '__reduce__', 
# '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', 
# '__subclasshook__', '_abc_cache', '_abc_negative_cache', 
# '_abc_negative_cache_version', '_abc_registry', '_checkClosed', 
# '_checkReadable', '_checkSeekable', '_checkWritable', '_check_close', 
# '_close_conn', '_get_chunk_left', '_method', '_peek_chunked', 
# '_read1_chunked', '_read_and_discard_trailer', '_read_next_chunk_size', 
# '_read_status', '_readall_chunked', '_readinto_chunked', '_safe_read', 
# '_safe_readinto', 'begin', 'chunk_left', 'chunked', 'close', 'closed', 
# 'code', 'debuglevel', 'detach', 'fileno', 'flush', 'fp', 'getcode', 
# 'getheader', 'getheaders', 'geturl', 'headers', 'info', 'isatty', 
# 'isclosed', 'length', 'msg', 'peek', 'read', 'read1', 'readable', 'readinto', 
# 'readinto1', 'readline', 'readlines', 'reason', 'seek', 'seekable', 'status', 
# 'tell', 'truncate', 'url', 'version', 'will_close', 'writable', 'write', 
# 'writelines']

# Stop de eigenlijke content in een variable
############################################
#
p_html = p.read()

# Wat type variable is dit geworden? → byte
###########################################
#
type(p_html)
#
# Reply: 
#
# <class 'bytes'>
# Reason that this is 'bytes' and not e.g., 'text': A page can contain mixes
# text/binary content 

# Sluit de connectie
####################
#
# Why HTML is a stateless protocol, isn't it? Whatever
#
p.close()


###################################################################
# Process the webpage
###################################################################
#
# Parse this object as an html-object (and not like e.g., an XML-
# or FTP-object)
#
p_soup = soup(p_html, "html.parser")

# Wat voor klasse is p_soup?
############################
#
type(p_soup)
#
# → <class 'bs4.BeautifulSoup'>

# Wat voor methodes heeft p_soup?
#################################
#
dir(p_soup)
#
# Reply:
#
# ['ASCII_SPACES', 'DEFAULT_BUILDER_FEATURES', 'NO_PARSER_SPECIFIED_WARNING', 
# 'ROOT_TAG_NAME', '__bool__', '__call__', '__class__', '__contains__', '__copy__', 
# '__delattr__', '__delitem__', '__dict__', '__dir__', '__doc__', '__eq__', 
# '__format__', '__ge__', '__getattr__', '__getattribute__', '__getitem__', 
# '__getstate__', '__gt__', '__hash__', '__init__', '__iter__', '__le__', '__len__', 
# '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', 
# '__repr__', '__setattr__', '__setitem__', '__sizeof__', '__str__', 
# '__subclasshook__', '__unicode__', '__weakref__', '_all_strings', 
# '_check_markup_is_url', '_feed', '_find_all', '_find_one', '_is_xml', 
# '_lastRecursiveChild', '_last_descendant', '_linkage_fixer', 
# '_most_recent_element', '_namespaces', '_popToTag', '_should_pretty_print', 
# 'append', 'attrs', 'builder', 'can_be_empty_element', 'cdata_list_attributes', 
# 'childGenerator', 'children', 'clear', 'contains_replacement_characters', 
# 'contents', 'currentTag', 'current_data', 'declared_html_encoding', 'decode', 
# 'decode_contents', 'decompose', 'descendants', 'encode', 'encode_contents', 
# 'endData', 'extend', 'extract', 'fetchNextSiblings', 'fetchParents', 
# 'fetchPrevious', 'fetchPreviousSiblings', 'find', 'findAll', 'findAllNext', 
# 'findAllPrevious', 'findChild', 'findChildren', 'findNext', 'findNextSibling', 
# 'findNextSiblings', 'findParent', 'findParents', 'findPrevious', 
# 'findPreviousSibling', 'findPreviousSiblings', 'find_all', 'find_all_next', 
# 'find_all_previous', 'find_next', 'find_next_sibling', 'find_next_siblings', 
# 'find_parent', 'find_parents', 'find_previous', 'find_previous_sibling', 
# 'find_previous_siblings', 'format_string', 'formatter_for_name', 'get', 'getText', 
# 'get_attribute_list', 'get_text', 'handle_data', 'handle_endtag', 'handle_starttag', 
# 'has_attr', 'has_key', 'hidden', 'index', 'insert', 'insert_after', 'insert_before', 
# 'isSelfClosing', 'is_empty_element', 'is_xml', 'known_xml', 'markup', 'name', 
# 'namespace', 'new_string', 'new_tag', 'next', 'nextGenerator', 'nextSibling', 
# 'nextSiblingGenerator', 'next_element', 'next_elements', 'next_sibling', 
# 'next_siblings', 'object_was_parsed', 'original_encoding', 'parent', 
# 'parentGenerator', 'parents', 'parse_only', 'parserClass', 'parser_class', 
# 'popTag', 'prefix', 'preserve_whitespace_tag_stack', 'preserve_whitespace_tags', 
# 'prettify', 'previous', 'previousGenerator', 'previousSibling', 
# 'previousSiblingGenerator', 'previous_element', 'previous_elements', 
# 'previous_sibling', 'previous_siblings', 'pushTag', 'recursiveChildGenerator', 
# 'renderContents', 'replaceWith', 'replaceWithChildren', 'replace_with', 
# 'replace_with_children', 'reset', 'select', 'select_one', 'setup', 'smooth', 
#'string', 'strings', 'stripped_strings', 'tagStack', 'text', 'unwrap', 'wrap']



# Try out some some stuff...
############################
#
p_soup.h1			# → <h1 class="page-title-text">"graphics card"</h1>
p_soup.p   			# First p-tag
p_soup.meta 		# First meta tag
p_soup.body			# Gewoon, de body van de pagina :)
p_soup.body.span	# First span-tag


# Create a resultset with all "item-container" div classes
##########################################################
#
# * This is actually plain HTML code
# * "div" has one argument-value-pair (or whatever its called). 
#   That's included here as a dictionary: {"argument":"value"}
#
cs = p_soup.findAll("div",{"class":"item-container"})

type(cs)	# → <class 'bs4.element.ResultSet'>
len(cs)		# → Aantal elementen = 40

# print(cs[5]) # → HTML-code


##############################################################
# Have a closer look at cs[10]
##############################################################
#
# * This 10th item is a good example, as it has a price (not all items have prices)
#
c=cs[10]

c2=c.findAll("a",{"class":"item-brand"})	# It's a list

print(c2)
#
# [<a class="item-brand" href="https://www.newegg.com/global/nl-en/GIGABYTE/BrandStore/ID-1314">
# <img alt="GIGABYTE" src="//c1.neweggimages.com/Brandimage_70x28//Brand1314.gif" title="GIGABYTE"/>
# </a>]

c3=c2[0]

print(c3)
#
# <a class="item-brand" href="https://www.newegg.com/global/nl-en/GIGABYTE/BrandStore/ID-1314">
# <img alt="GIGABYTE" src="//c1.neweggimages.com/Brandimage_70x28//Brand1314.gif" title="GIGABYTE"/>
# </a>

print(type(c3))
#
# <class 'bs4.element.Tag'>

# → And now I'm lost. How do I get the "title" attribute???


##############################################################
# Loop through all containers
##############################################################
#
# I suspect there are more eligent ways of looping through a
# hierarchical data set - PHP can already do that, so surely
# Python can do that even better
#
# print("********************** loop")

# i=0
# while i < len(cs):

# 	Print("Item "+i)

# 	c=cs[i]

# 	brand_container=c.findAll("a",{"class":"item-brand"})

# 	print("Type brand_container: ".type(brand_container))
# 	# print("Length brand_container: "+len(brand_container))
# 	print("Title: "+c.img["title"])	# Complete product title
# 	i += 1

Beautiful Soup - Hun eigen tutorial

De eigen documentatie van Beautiful Soup, op https://www.crummy.com/software/BeautifulSoup/bs4/doc/, vind ik geweldig.
Beautiful Soup fietst een complex HTML-document om naar een complexe hiërarchische boom van Python-objecten.

Tag

Zoiets als dit:

>>> soup.b
<b>The Dormouse's story</b>
>>> tag=soup.b
>>> tag
<b>The Dormouse's story</b>
>>> type(tag)
<class 'bs4.element.Tag'>

Tags hebben een hoop methodes en attributen.

Naam van een tag:

>>> tag.name
'b'

Etc. Zie de handleiding

NavigableString

Zie ook

Bronnen

https://www.youtube.com/watch?v=XQgXKtPSzUI - Windows & beginners, maar eigenlijk best wel heel erg goed
https://zach-adams.com/2015/04/python-scraping-wordpress/
https://www.youtube.com/watch?v=ng2o98k983k - Python Tutorial: Web Scraping with BeautifulSoup and Requests
https://www.youtube.com/watch?v=tb8gHvYlCFs - Computerbrowsing

Post (ipv. get)

Request-library

https://2.python-requests.org//en/latest/index.html

@@ Regel 399: / Regel 399: @@
 == Zie ook ==
+* [[Beautiful Soup]]
 * [[HTML-filtering in Python]]
 * [[Pip (Python) | Pip]]