Webscraping: verschil tussen versies

Huidige versie van 19 jan 2021 om 09:25

Hoera! Webscraping! Dit is echt heel leuk!

Context

Wat ik gebruik:

Webclient: requests
HTML-parser: lxml [1]
Bewerken van de parse-tree: Beautiful Soup

Casussen

WooCrack Kopiëeren: Hoe kun je WooCrack kopiëren, inclusief de downloads? Dit geldt voor de situatie dat je over inlogcodes beschikt
Price bot: Hoe kan ik met een script voor bepaalde producten, de prijzen van de concurrent zien? In dit geval zou ik vantevoren de URL's van de websites van concurrenten verzamelen
Alle productinformatie: Kan ik voor een klant die in bepaalde producten handelt, alle producten te wereld downloaden, inclusief alle gerelateerde data?

Vraagstukken

Hoe kun je sites scrapen waar je eerst moet inloggen? Paar extra stappen die je met de webclient doorloopt? → Zie voorbeeld requests.
Hoe download je bestanden, zoals van WooCrack?

Webclients

Om te interacteren met een webserver, heb je een webclient nodig. De gebruikelijke pakketten in Python hiervoor:

urllib
urllib2
urllib3
requests - Op zichzelfstaand pakket. Niet onderdeel van een van de andere pakketten! - Vermoedelijk het beste pakket [2]. Zie Requests (Python).

Voorbeeld NewEgg

Dit uitgewerkte voorbeeld is gebaseerd op deze tutorial.

Doel

Lijst met

Productitels
SKU's (indien beschikbaar)
EAN-codes (indien beschikbaar)
Prijzen.

Oorspronkelijk

Deze code sluit het meest aan op de YouTube-video, inclusief gebruik van urllib en aliassen.

#! /usr/bin/python3
#
# Newegg webcrawling-example - Data Science Dojo
###################################################################
#
# Source: https://www.youtube.com/watch?v=XQgXKtPSzUI
#
# Goals
#######
#
# Create a list with the following info per product:
#
# * Brand
# * Title
# * Price
# * SKU (if available)
# * EAN-code (if available)
#
print ("\n\n\n")
print (">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>")
print (">>> 100-Newegg.py")
print (">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>")
print ("\n\n")


###################################################################
# Load libraries
###################################################################
#
# Beautiful Soup
################
#
# * For processing websites; the actual crawling
# * Only "BeatitfulSoup" is imported from bs4
# * "Soup" functions like an alias
#
from bs4 import BeautifulSoup as soup

# Webclient
################
#
# * From urllib, only urlopen from request is needed
# * "uReq" works like an alias
#
from urllib.request import urlopen as uReq


###################################################################
# Fetch a webpage
###################################################################
#
# The page we want to scrape
############################
#
my_url = 'https://www.newegg.com/global/nl-en/p/pl?d=graphics+card'

# Download the page to object p
#########################################
#
p = uReq(my_url)

# What kind of object is this?
##############################
#
# print(type(p))
#
# Reply:
#
# <class 'http.client.HTTPResponse'>

# Welke methodes heeft dit object?
##################################
#
dir(p)
#
# Reply:
#
# ['__abstractmethods__', '__class__', '__del__', '__delattr__', '__dict__', 
# '__dir__', '__doc__', '__enter__', '__eq__', '__exit__', '__format__', '__ge__', 
# '__getattribute__', '__gt__', '__hash__', '__init__', '__iter__', '__le__', 
# '__lt__', '__module__', '__ne__', '__new__', '__next__', '__reduce__', 
# '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', 
# '__subclasshook__', '_abc_cache', '_abc_negative_cache', 
# '_abc_negative_cache_version', '_abc_registry', '_checkClosed', 
# '_checkReadable', '_checkSeekable', '_checkWritable', '_check_close', 
# '_close_conn', '_get_chunk_left', '_method', '_peek_chunked', 
# '_read1_chunked', '_read_and_discard_trailer', '_read_next_chunk_size', 
# '_read_status', '_readall_chunked', '_readinto_chunked', '_safe_read', 
# '_safe_readinto', 'begin', 'chunk_left', 'chunked', 'close', 'closed', 
# 'code', 'debuglevel', 'detach', 'fileno', 'flush', 'fp', 'getcode', 
# 'getheader', 'getheaders', 'geturl', 'headers', 'info', 'isatty', 
# 'isclosed', 'length', 'msg', 'peek', 'read', 'read1', 'readable', 'readinto', 
# 'readinto1', 'readline', 'readlines', 'reason', 'seek', 'seekable', 'status', 
# 'tell', 'truncate', 'url', 'version', 'will_close', 'writable', 'write', 
# 'writelines']

# Stop de eigenlijke content in een variable
############################################
#
p_html = p.read()

# Wat type variable is dit geworden? → byte
###########################################
#
type(p_html)
#
# Reply: 
#
# <class 'bytes'>
# Reason that this is 'bytes' and not e.g., 'text': A page can contain mixes
# text/binary content 

# Sluit de connectie
####################
#
# Why HTML is a stateless protocol, isn't it? Whatever
#
p.close()


###################################################################
# Process the webpage
###################################################################
#
# Parse this object as an html-object (and not like e.g., an XML-
# or FTP-object)
#
p_soup = soup(p_html, "html.parser")

# Wat voor klasse is p_soup?
############################
#
type(p_soup)
#
# → <class 'bs4.BeautifulSoup'>

# Wat voor methodes heeft p_soup?
#################################
#
dir(p_soup)
#
# Reply:
#
# ['ASCII_SPACES', 'DEFAULT_BUILDER_FEATURES', 'NO_PARSER_SPECIFIED_WARNING', 
# 'ROOT_TAG_NAME', '__bool__', '__call__', '__class__', '__contains__', '__copy__', 
# '__delattr__', '__delitem__', '__dict__', '__dir__', '__doc__', '__eq__', 
# '__format__', '__ge__', '__getattr__', '__getattribute__', '__getitem__', 
# '__getstate__', '__gt__', '__hash__', '__init__', '__iter__', '__le__', '__len__', 
# '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', 
# '__repr__', '__setattr__', '__setitem__', '__sizeof__', '__str__', 
# '__subclasshook__', '__unicode__', '__weakref__', '_all_strings', 
# '_check_markup_is_url', '_feed', '_find_all', '_find_one', '_is_xml', 
# '_lastRecursiveChild', '_last_descendant', '_linkage_fixer', 
# '_most_recent_element', '_namespaces', '_popToTag', '_should_pretty_print', 
# 'append', 'attrs', 'builder', 'can_be_empty_element', 'cdata_list_attributes', 
# 'childGenerator', 'children', 'clear', 'contains_replacement_characters', 
# 'contents', 'currentTag', 'current_data', 'declared_html_encoding', 'decode', 
# 'decode_contents', 'decompose', 'descendants', 'encode', 'encode_contents', 
# 'endData', 'extend', 'extract', 'fetchNextSiblings', 'fetchParents', 
# 'fetchPrevious', 'fetchPreviousSiblings', 'find', 'findAll', 'findAllNext', 
# 'findAllPrevious', 'findChild', 'findChildren', 'findNext', 'findNextSibling', 
# 'findNextSiblings', 'findParent', 'findParents', 'findPrevious', 
# 'findPreviousSibling', 'findPreviousSiblings', 'find_all', 'find_all_next', 
# 'find_all_previous', 'find_next', 'find_next_sibling', 'find_next_siblings', 
# 'find_parent', 'find_parents', 'find_previous', 'find_previous_sibling', 
# 'find_previous_siblings', 'format_string', 'formatter_for_name', 'get', 'getText', 
# 'get_attribute_list', 'get_text', 'handle_data', 'handle_endtag', 'handle_starttag', 
# 'has_attr', 'has_key', 'hidden', 'index', 'insert', 'insert_after', 'insert_before', 
# 'isSelfClosing', 'is_empty_element', 'is_xml', 'known_xml', 'markup', 'name', 
# 'namespace', 'new_string', 'new_tag', 'next', 'nextGenerator', 'nextSibling', 
# 'nextSiblingGenerator', 'next_element', 'next_elements', 'next_sibling', 
# 'next_siblings', 'object_was_parsed', 'original_encoding', 'parent', 
# 'parentGenerator', 'parents', 'parse_only', 'parserClass', 'parser_class', 
# 'popTag', 'prefix', 'preserve_whitespace_tag_stack', 'preserve_whitespace_tags', 
# 'prettify', 'previous', 'previousGenerator', 'previousSibling', 
# 'previousSiblingGenerator', 'previous_element', 'previous_elements', 
# 'previous_sibling', 'previous_siblings', 'pushTag', 'recursiveChildGenerator', 
# 'renderContents', 'replaceWith', 'replaceWithChildren', 'replace_with', 
# 'replace_with_children', 'reset', 'select', 'select_one', 'setup', 'smooth', 
#'string', 'strings', 'stripped_strings', 'tagStack', 'text', 'unwrap', 'wrap']


# Try out some stuff...
############################
#
p_soup.h1			# → <h1 class="page-title-text">"graphics card"</h1>
p_soup.p   			# First p-tag
p_soup.meta 		# First meta tag
p_soup.body			# Gewoon, de body van de pagina :)
p_soup.body.span	# First span-tag


##############################################################
# OK - Have a closer look at cs[10]
##############################################################
#
# cs = p_soup.findAll("div",{"class":"item-container"})

# type(cs)	# <class 'bs4.element.ResultSet'>
# len(cs)		# Aantal elementen = 40

# * This 10th item is a good example, as it has a price (not all items have prices)
#
# c=cs[10]
# type(c)   # bs4.element.tag

# Fetch brand
############################################
#
# Like this:
#
# <div class="item-info"> » 
# <a 
#    class="item-brand" 
#    href="https://www.newegg.com/global/nl-en/GIGABYTE/BrandStore/ID-1314">
#    <img 
#       alt="GIGABYTE" 
#       src="//c1.neweggimages.com/Brandimage_70x28//Brand1314.gif" 
#       title="GIGABYTE"/>
# </a>
#
# c_brand = c.find(class_="item-brand").img['alt']
# print ("c10 - Brand: "+c_brand)

# # Product name
# ############################################
# #
# c_name = c.find(class_="item-title").text
# print("c10 - Name: "+c_name)

# # Fetch price
# ############################################
# #
# c_price = c.find(class_="price-current").strong.text
# print ("c10 - Price: "+c_price)


##############################################################
# Iterate over items in cs
##############################################################
#
# Create a resultset with all "item-container" div classes
##########################################################
#
# * This is actually plain HTML code
# * "div" has one argument-value-pair (or whatever its called). 
#   That's included here as a dictionary: {"argument":"value"}
#
print (">>> Create resultset cs...")

cs = p_soup.findAll("div",{"class":"item-container"})

type(cs)	# <class 'bs4.element.ResultSet'>
len(cs)		# Aantal elementen = 40
i=0

for c in cs:
	print("\n\n>>>>>>>>> Volgende element")
	i=i+1
	print(i)

	if (c.find(class_="item-brand") is not None):
		c_brand = c.find(class_="item-brand").img['alt']
		print ("Brand: "+c_brand)

	if (c.find(class_="item-title") is not None):
		c_name = c.find(class_="item-title").text
		print("Name: "+c_name)

	if (c.find(class_="price-current") is not None):
		c_price = c.find(class_="price-current").strong.text
		print ("Price: "+c_price)

Refactored

Met requests ipv. urllib
Zonder aliassen
Beetje korter.

#! /usr/bin/python3
#
# Newegg webcrawling-example - Refactored
###################################################################
#
print ("\n\n\n")
print (">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>")
print (">>> 105-Newegg-refactored.py")
print (">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>")
print ("\n\n")


###################################################################
# Load libraries
###################################################################
#
# Beautiful Soup
################
#
from bs4 import BeautifulSoup

# Webclient
################
#
import requests


###################################################################
# Fetch a webpage
###################################################################
#
# The page we want to scrape
############################
#
url = 'https://www.newegg.com/global/nl-en/p/pl?d=graphics+card'

# Download the page to text object p
#########################################
#
# * type(requests.get(url): <class 'requests.models.Response'>
# * Attribute "text" converts this directly to a string
#
p = requests.get(url).text


###################################################################
# Process the webpage
###################################################################
#
# Parse this object as an html-object (and not like e.g., an XML-
# or FTP-object)
#
p_soup = BeautifulSoup(p, "html.parser")

# Wat voor klasse is p_soup?
############################
#
type(p_soup)
#
# → <class 'bs4.BeautifulSoup'>

# Wat voor methodes heeft p_soup?
#################################
#
dir(p_soup)
#
# Reply:
#
# ['ASCII_SPACES', 'DEFAULT_BUILDER_FEATURES', 'NO_PARSER_SPECIFIED_WARNING', 
# 'ROOT_TAG_NAME', '__bool__', '__call__', '__class__', '__contains__', '__copy__', 
# '__delattr__', '__delitem__', '__dict__', '__dir__', '__doc__', '__eq__', 
# '__format__', '__ge__', '__getattr__', '__getattribute__', '__getitem__', 
# '__getstate__', '__gt__', '__hash__', '__init__', '__iter__', '__le__', '__len__', 
# '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', 
# '__repr__', '__setattr__', '__setitem__', '__sizeof__', '__str__', 
# '__subclasshook__', '__unicode__', '__weakref__', '_all_strings', 
# '_check_markup_is_url', '_feed', '_find_all', '_find_one', '_is_xml', 
# '_lastRecursiveChild', '_last_descendant', '_linkage_fixer', 
# '_most_recent_element', '_namespaces', '_popToTag', '_should_pretty_print', 
# 'append', 'attrs', 'builder', 'can_be_empty_element', 'cdata_list_attributes', 
# 'childGenerator', 'children', 'clear', 'contains_replacement_characters', 
# 'contents', 'currentTag', 'current_data', 'declared_html_encoding', 'decode', 
# 'decode_contents', 'decompose', 'descendants', 'encode', 'encode_contents', 
# 'endData', 'extend', 'extract', 'fetchNextSiblings', 'fetchParents', 
# 'fetchPrevious', 'fetchPreviousSiblings', 'find', 'findAll', 'findAllNext', 
# 'findAllPrevious', 'findChild', 'findChildren', 'findNext', 'findNextSibling', 
# 'findNextSiblings', 'findParent', 'findParents', 'findPrevious', 
# 'findPreviousSibling', 'findPreviousSiblings', 'find_all', 'find_all_next', 
# 'find_all_previous', 'find_next', 'find_next_sibling', 'find_next_siblings', 
# 'find_parent', 'find_parents', 'find_previous', 'find_previous_sibling', 
# 'find_previous_siblings', 'format_string', 'formatter_for_name', 'get', 'getText', 
# 'get_attribute_list', 'get_text', 'handle_data', 'handle_endtag', 'handle_starttag', 
# 'has_attr', 'has_key', 'hidden', 'index', 'insert', 'insert_after', 'insert_before', 
# 'isSelfClosing', 'is_empty_element', 'is_xml', 'known_xml', 'markup', 'name', 
# 'namespace', 'new_string', 'new_tag', 'next', 'nextGenerator', 'nextSibling', 
# 'nextSiblingGenerator', 'next_element', 'next_elements', 'next_sibling', 
# 'next_siblings', 'object_was_parsed', 'original_encoding', 'parent', 
# 'parentGenerator', 'parents', 'parse_only', 'parserClass', 'parser_class', 
# 'popTag', 'prefix', 'preserve_whitespace_tag_stack', 'preserve_whitespace_tags', 
# 'prettify', 'previous', 'previousGenerator', 'previousSibling', 
# 'previousSiblingGenerator', 'previous_element', 'previous_elements', 
# 'previous_sibling', 'previous_siblings', 'pushTag', 'recursiveChildGenerator', 
# 'renderContents', 'replaceWith', 'replaceWithChildren', 'replace_with', 
# 'replace_with_children', 'reset', 'select', 'select_one', 'setup', 'smooth', 
#'string', 'strings', 'stripped_strings', 'tagStack', 'text', 'unwrap', 'wrap']


# Try out some stuff...
############################
#
p_soup.h1			# → <h1 class="page-title-text">"graphics card"</h1>
p_soup.p   			# First p-tag
p_soup.meta 		# First meta tag
p_soup.body			# Gewoon, de body van de pagina :)
p_soup.body.span	# First span-tag


##############################################################
# OK - Have a closer look at cs[10]
##############################################################
#
# cs = p_soup.findAll("div",{"class":"item-container"})

# type(cs)	# <class 'bs4.element.ResultSet'>
# len(cs)		# Aantal elementen = 40

# * This 10th item is a good example, as it has a price (not all items have prices)
#
# c=cs[10]
# type(c)   # bs4.element.tag

# Fetch brand
############################################
#
# Like this:
#
# <div class="item-info"> » 
# <a 
#    class="item-brand" 
#    href="https://www.newegg.com/global/nl-en/GIGABYTE/BrandStore/ID-1314">
#    <img 
#       alt="GIGABYTE" 
#       src="//c1.neweggimages.com/Brandimage_70x28//Brand1314.gif" 
#       title="GIGABYTE"/>
# </a>
#
# c_brand = c.find(class_="item-brand").img['alt']
# print ("c10 - Brand: "+c_brand)

# # Product name
# ############################################
# #
# c_name = c.find(class_="item-title").text
# print("c10 - Name: "+c_name)

# # Fetch price
# ############################################
# #
# c_price = c.find(class_="price-current").strong.text
# print ("c10 - Price: "+c_price)


##############################################################
# Iterate over items in cs
##############################################################
#
# Create a resultset with all "item-container" div classes
##########################################################
#
# * This is actually plain HTML code
# * "div" has one argument-value-pair (or whatever its called). 
#   That's included here as a dictionary: {"argument":"value"}
#
print (">>> Create resultset cs...")

cs = p_soup.findAll("div",{"class":"item-container"})

type(cs)	# <class 'bs4.element.ResultSet'>
len(cs)		# Aantal elementen = 40
i=0

for c in cs:
	print("\n\n>>>>>>>>> Volgende element")
	i=i+1
	print(i)

	if (c.find(class_="item-brand") is not None):
		c_brand = c.find(class_="item-brand").img['alt']
		print ("Brand: "+c_brand)

	if (c.find(class_="item-title") is not None):
		c_name = c.find(class_="item-title").text
		print("Name: "+c_name)

	if (c.find(class_="price-current") is not None):
		c_price = c.find(class_="price-current").strong.text
		print ("Price: "+c_price)

Compact

#! /usr/bin/python3
#
# Newegg webcrawling-example - Compact
###################################################################
#
print ("\n\n\n")
print (">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>")
print (">>> 107-Newegg-compact.py")
print (">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>")
print ("\n\n")


###################################################################
# Load libraries
###################################################################
#
from bs4 import BeautifulSoup
import requests


###################################################################
# Fetch webpage
###################################################################
#
url = 'https://www.newegg.com/global/nl-en/p/pl?d=graphics+card'
p_html = requests.get(url).text
p_soup = BeautifulSoup(p_html, "html.parser")


###################################################################
# Process webpage
###################################################################
#
cs = p_soup.findAll("div",{"class":"item-container"})

i=0
for c in cs:
	i=i+1
	print("")
	print(i)

	if (c.find(class_="item-brand") is not None):
		c_brand = c.find(class_="item-brand").img['alt']
		print ("Brand: "+c_brand)

	if (c.find(class_="item-title") is not None):
		c_name = c.find(class_="item-title").text
		print("Name: "+c_name)

	if (c.find(class_="price-current") is not None):
		c_price = c.find(class_="price-current").strong.text
		print ("Price: "+c_price)

Complete website verwerken? - wget

Het voorbeeld hierboven, betreft één pagina. Maar hoe zit het als je bv. een hele webwinkel wilt verwerken? Dan moet je dus meerdere pagina's downloaden en verwerken.

Dit kan ongetwijfeld mbv. wget. Bv.:

wget -r -l25 https://example.com

Ook een standaard .htaccess-login-procedure is geen probleem:

wget --user 'xxx' --password 'yyy' http://example.strompf.com

Non-HTML-objecten?

Wget lijkt in staat te zijn om ook andersoortige objecten te downloaden. Zag er niet direct indrukwekkend uit, maar dat gaat wel lukken (in dit geval vermoedelijk zip-bestanden)

WordPress-authenticatie

Maar nu een website waar authenticatie niet plaatsvindt via .htaccess, maar via de gebruikelijk WordPress-login-procedure. Dan moet je iets doen met een cookie die toegang verschaft.

Geen probleem:

#!/bin/bash
#
#################################################
# Setup - This is verified
#################################################

site="https://example.com/"
login_address="$site/wp-login.php"
log="xxx"
pwd="yyy"
cookies="/tmp/cookies.txt"
agent="Mozilla/5.0"

#################################################
# Authenticate
#################################################
#
wget \
    --user-agent="$agent" \
    --save-cookies $cookies \
    --keep-session-cookies \
    --delete-after \
    --post-data="log=$log&pwd=$pwd&testcookie=1" \
    "$login_address"

#################################################
# Download!
#################################################
#
# OK - 140 - Single page "aaa"
##############################
#
# wget \
#     --user-agent="$agent" \
#     --load-cookies $cookies \
#     "https://example.com/download/aaa/"

# 170-Complete site download
####################################
#
# * A a basis for further efforts
#
wget \
    -r -l10 \
    --user-agent="$agent" \
    --load-cookies $cookies \
    "https://example.com/"

Zip-bestanden downloaden?

Da's een koud kunstje: Als ik de betreffende links heb gefiltered (bv. met BeautifulSoup), dan kan ik die gebruiken als argument voor wget - Da's vermoedelijk waar wget oorspronkelijk voor is bedoeld: Downloaden van bestanden vanaf de command line.

JavaScrip & cloacking

Ik heb de indruk dat ik met cloaking te maken heb op deze pagina, ihb. JavaScript-code die client-sided geëxecuteerd wordt om tot de volledige HTML-pagina te komen.

BS kan geen JavaScript executeren. Daar bestaan aparte oplossingen voor. Ihb. Selenium (Selinium?) - Een user agent die JavaScript kan executeren.

Zie ook

Oplossingen

Broncode van gerenderde pagina's kopiëren?
Selenium gebruiken?

Zie ook

Bronnen

https://www.youtube.com/watch?v=XQgXKtPSzUI - Windows & beginners, maar eigenlijk best wel heel erg goed
https://zach-adams.com/2015/04/python-scraping-wordpress/
https://www.youtube.com/watch?v=ng2o98k983k - Python Tutorial: Web Scraping with BeautifulSoup and Requests
https://www.youtube.com/watch?v=tb8gHvYlCFs - Computerbrowsing

Hele site downloaden

https://stackoverflow.com/questions/10885708/write-a-python-script-that-goes-through-the-links-on-a-page-recursively

wget & login

Webscraping: verschil tussen versies

Huidige versie van 19 jan 2021 om 09:25

Inhoud

Context

Casussen

Webclients

Voorbeeld NewEgg

Doel

Oorspronkelijk

Refactored

Compact

Complete website verwerken? - wget

Non-HTML-objecten?

WordPress-authenticatie

Zip-bestanden downloaden?

JavaScrip & cloacking

Oplossingen

Zie ook

Bronnen

Navigatiemenu

Zoeken

@@ Regel 13: / Regel 13: @@
 * '''WooCrack Kopiëeren:''' Hoe kun je ''WooCrack'' kopiëren, inclusief de downloads? Dit geldt voor de situatie dat je over inlogcodes beschikt
 * '''Price bot:''' Hoe kan ik met een script voor bepaalde producten, de prijzen van de concurrent zien? In dit geval zou ik vantevoren de URL's van de websites van concurrenten verzamelen
-* '''Alle koolborstel-info:''' Kan ik voor een klant die in ''koolborstels'' handelt, alle koolborstels te wereld downloaden, inclusief alle gerelateerde data?
+* '''Alle productinformatie:''' Kan ik voor een klant die in bepaalde producten handelt, alle producten te wereld downloaden, inclusief alle gerelateerde data?
 ''' Vraagstukken '''
@@ Regel 27: / Regel 27: @@
 * urllib2
 * urllib3
-* requests - Op zichzelfstaand pakket. ''Niet'' onderdeel van een van de andere pakketten! - Vermoedelijk het beste pakket [https://stackoverflow.com/questions/2018026/what-are-the-differences-between-the-urllib-urllib2-and-requests-module]. Zie [[requests (Python)]].
+* requests - Op zichzelfstaand pakket. ''Niet'' onderdeel van een van de andere pakketten! - Vermoedelijk het beste pakket [https://stackoverflow.com/questions/2018026/what-are-the-differences-between-the-urllib-urllib2-and-requests-module]. Zie [[Requests (Python)]].
-== GET, POST, PUT ==
+== Voorbeeld NewEgg ==
-De belangrijkste ''HTTP requests'':
+Dit uitgewerkte voorbeeld is gebaseerd op [https://www.youtube.com/watch?v=XQgXKtPSzUI deze tutorial].
-* <code>GET:</code> Request a resource from a server. Argument: An URL
-* <code>POST:</code> Sent data to a server to create/update a resource. Als je ''Terug'' klikt binnen een browser, en deze waarschuwt je, dat je iets opnieuw ''submit'', dan betreft het een POST-request
-* <code>PUT:</code> Broertje van POST, maar dan ''idempotent:'' Als je hetzelfde PUT-commando meerdere keren uitvoert, verandert het resultaat niet. Als je een POST-commando herhaalt, creëer je additionele resources/updates.
-GET en POST zijn de twee meestvoorkomende ''HTTP requests''. PUT is een stuk zeldzamer, net als de [https://www.w3schools.com/tags/ref_httpmethods.asp overige requests] (die hier niet behandeld worden).
-== Requests - Inloggen op een afgeschermde pagina ==
-Dat gaat met <code>requests</code> verbazend simpel. Sterker nog: Het voorbeeld op de [https://pypi.org/project/requests/ home page] demonstreet dit al:
-<pre>
->>> r = requests.get('http://example.strompf.com', auth=('xxx','yyy'))
->>> r.status_code
->>> r.headers
-{'Content-language': 'en', 'Expires': 'Thu, 01 Jan 1970 00:00:00 GMT', 'Vary': 'Accept-Encoding,Cookie', 'X-Powered-By': 'PHP/5.5.9-1ubuntu4.27', 'Date': 'Fri, 02 Aug 2019 12:14:12 GMT', 'Cache-Control': 'private, must-revalidate, max-age=0', 'Server': 'Apache/2.4.7 (Ubuntu)', 'Content-Type': 'text/html; charset=UTF-8', 'Content-Encoding': 'gzip', 'Keep-Alive': 'timeout=5, max=49', 'Connection': 'Keep-Alive', 'Content-Length': '32237', 'Last-Modified': 'Fri, 02 Aug 2019 10:03:24 GMT'}
->>> r.headers['content-type']
-'text/html; charset=UTF-8'
->>> r.encoding
-'UTF-8'
->>> r.text
-'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">\n<html lang="en" dir="ltr">\n<head>\n<title>Main Page - Example</title>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />\n<meta name="generator" content="MediaWiki 1.16.4" />\n<link rel="alternate" type="application/x-wiki" title="Edit" href="/index.php?title=Main_Page&amp;action=edit" />\n<link rel="edit" title="Edit" href="/index.php?title=Main_Page&amp;action=edit" />\n<link rel="shortcut icon" href="/favicon.ico" />\n<link rel="search" type="application/opensearchdescription+xml" href="/opensearch_desc.php" title="Example (en)" />\n<link rel="alternate" type="a
-</pre>
-''' Ter verificatie '''
-<pre>
-#! /usr/bin/python3
-#
-# Experiments with requests
-###################################################################
-#
-#
-import requests
-print(">>> Login zonder credentials...")
-r = requests.get('http://example.strompf.com')
-print(r.status_code)
-print(">>> Login met correcte credentials...")
-r = requests.get('http://example.strompf.com', auth=('xxx','correcte wachtwoord'))
-print(r.status_code)
-print(">>> Login met incorrecte credentials...")
-r = requests.get('http://example.strompf.com', auth=('xxx','Verkeerde wachtwoord'))
-print(r.status_code)
-</pre>
-Uitvoer:
-<pre>
->>> Login zonder credentials...
->>> Login met correcte credentials...
->>> Login met incorrecte credentials...
-</pre>
-P.s.: Handjevol HTTP Status-codes:
-* 200: OK
-* 301: Moved Permanently
-* 307: Temporary Redirected
-* 400: Bad Request
-* 401: Unauthorized
-* 403: Forbidden - Ik geloof dat je dit krijgt als je Google Search probeert te ''scrapen''
-* 404: Not found
-* 504: Gateway Timeout - Krijg ik regelmatig op https://couchsurfing.com
-== Voorbeeld NewEgg - Scraping ==
-[https://www.youtube.com/watch?v=XQgXKtPSzUI]
 === Doel ===
@@ Regel 116: / Regel 42: @@
 * Prijzen.
-=== Script ===
+=== Oorspronkelijk ===
+Deze code sluit het meest aan op de YouTube-video, inclusief gebruik van ''urllib'' en aliassen.
 <pre>
@@ Regel 125: / Regel 53: @@
 #
 # Source: https://www.youtube.com/watch?v=XQgXKtPSzUI
+#
+# Goals
+#######
+#
+# Create a list with the following info per product:
+#
+# * Brand
+# * Title
+# * Price
+# * SKU (if available)
+# * EAN-code (if available)
+#
+print ("\n\n\n")
+print (">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>")
+print (">>> 100-Newegg.py")
+print (">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>")
+print ("\n\n")
 ###################################################################
-# Libraries
+# Load libraries
 ###################################################################
 #
@@ Regel 283: / Regel 229: @@
+# Try out some stuff...
-# Try out some some stuff...
 ############################
 #
@@ Regel 294: / Regel 239: @@
+##############################################################
+# OK - Have a closer look at cs[10]
+##############################################################
+#
+# cs = p_soup.findAll("div",{"class":"item-container"})
+# type(cs)	# <class 'bs4.element.ResultSet'>
+# len(cs)		# Aantal elementen = 40
+# * This 10th item is a good example, as it has a price (not all items have prices)
+#
+# c=cs[10]
+# type(c)   # bs4.element.tag
+# Fetch brand
+############################################
+#
+# Like this:
+#
+# <div class="item-info"> »
+# <a
+#    class="item-brand"
+#    href="https://www.newegg.com/global/nl-en/GIGABYTE/BrandStore/ID-1314">
+#    <img
+#       alt="GIGABYTE"
+#       src="//c1.neweggimages.com/Brandimage_70x28//Brand1314.gif"
+#       title="GIGABYTE"/>
+# </a>
+#
+# c_brand = c.find(class_="item-brand").img['alt']
+# print ("c10 - Brand: "+c_brand)
+# # Product name
+# ############################################
+# #
+# c_name = c.find(class_="item-title").text
+# print("c10 - Name: "+c_name)
+# # Fetch price
+# ############################################
+# #
+# c_price = c.find(class_="price-current").strong.text
+# print ("c10 - Price: "+c_price)
+##############################################################
+# Iterate over items in cs
+##############################################################
+#
 # Create a resultset with all "item-container" div classes
 ##########################################################
@@ Regel 301: / Regel 295: @@
 #   That's included here as a dictionary: {"argument":"value"}
 #
+print (">>> Create resultset cs...")
 cs = p_soup.findAll("div",{"class":"item-container"})
-type(cs)	# → <class 'bs4.element.ResultSet'>
+type(cs)	# <class 'bs4.element.ResultSet'>
-len(cs)		# → Aantal elementen = 40
+len(cs)		# Aantal elementen = 40
+i=0
+for c in cs:
+	print("\n\n>>>>>>>>> Volgende element")
+	i=i+1
+	print(i)
+	if (c.find(class_="item-brand") is not None):
+		c_brand = c.find(class_="item-brand").img['alt']
+		print ("Brand: "+c_brand)
+	if (c.find(class_="item-title") is not None):
+		c_name = c.find(class_="item-title").text
+		print("Name: "+c_name)
+	if (c.find(class_="price-current") is not None):
+		c_price = c.find(class_="price-current").strong.text
+		print ("Price: "+c_price)
+</pre>
+=== Refactored ===
+* Met ''requests'' ipv. ''urllib''
+* Zonder aliassen
+* Beetje korter.
+<pre>
+#! /usr/bin/python3
+#
+# Newegg webcrawling-example - Refactored
+###################################################################
+#
+print ("\n\n\n")
+print (">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>")
+print (">>> 105-Newegg-refactored.py")
+print (">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>")
+print ("\n\n")
+###################################################################
+# Load libraries
+###################################################################
+#
+# Beautiful Soup
+################
+#
+from bs4 import BeautifulSoup
+# Webclient
+################
+#
+import requests
+###################################################################
+# Fetch a webpage
+###################################################################
+#
+# The page we want to scrape
+############################
+#
+url = 'https://www.newegg.com/global/nl-en/p/pl?d=graphics+card'
+# Download the page to text object p
+#########################################
+#
+# * type(requests.get(url): <class 'requests.models.Response'>
+# * Attribute "text" converts this directly to a string
+#
+p = requests.get(url).text
+###################################################################
+# Process the webpage
+###################################################################
+#
+# Parse this object as an html-object (and not like e.g., an XML-
+# or FTP-object)
+#
+p_soup = BeautifulSoup(p, "html.parser")
+# Wat voor klasse is p_soup?
+############################
+#
+type(p_soup)
+#
+# → <class 'bs4.BeautifulSoup'>
+# Wat voor methodes heeft p_soup?
+#################################
+#
+dir(p_soup)
+#
+# Reply:
+#
+# ['ASCII_SPACES', 'DEFAULT_BUILDER_FEATURES', 'NO_PARSER_SPECIFIED_WARNING',
+# 'ROOT_TAG_NAME', '__bool__', '__call__', '__class__', '__contains__', '__copy__',
+# '__delattr__', '__delitem__', '__dict__', '__dir__', '__doc__', '__eq__',
+# '__format__', '__ge__', '__getattr__', '__getattribute__', '__getitem__',
+# '__getstate__', '__gt__', '__hash__', '__init__', '__iter__', '__le__', '__len__',
+# '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__',
+# '__repr__', '__setattr__', '__setitem__', '__sizeof__', '__str__',
+# '__subclasshook__', '__unicode__', '__weakref__', '_all_strings',
+# '_check_markup_is_url', '_feed', '_find_all', '_find_one', '_is_xml',
+# '_lastRecursiveChild', '_last_descendant', '_linkage_fixer',
+# '_most_recent_element', '_namespaces', '_popToTag', '_should_pretty_print',
+# 'append', 'attrs', 'builder', 'can_be_empty_element', 'cdata_list_attributes',
+# 'childGenerator', 'children', 'clear', 'contains_replacement_characters',
+# 'contents', 'currentTag', 'current_data', 'declared_html_encoding', 'decode',
+# 'decode_contents', 'decompose', 'descendants', 'encode', 'encode_contents',
+# 'endData', 'extend', 'extract', 'fetchNextSiblings', 'fetchParents',
+# 'fetchPrevious', 'fetchPreviousSiblings', 'find', 'findAll', 'findAllNext',
+# 'findAllPrevious', 'findChild', 'findChildren', 'findNext', 'findNextSibling',
+# 'findNextSiblings', 'findParent', 'findParents', 'findPrevious',
+# 'findPreviousSibling', 'findPreviousSiblings', 'find_all', 'find_all_next',
+# 'find_all_previous', 'find_next', 'find_next_sibling', 'find_next_siblings',
+# 'find_parent', 'find_parents', 'find_previous', 'find_previous_sibling',
+# 'find_previous_siblings', 'format_string', 'formatter_for_name', 'get', 'getText',
+# 'get_attribute_list', 'get_text', 'handle_data', 'handle_endtag', 'handle_starttag',
+# 'has_attr', 'has_key', 'hidden', 'index', 'insert', 'insert_after', 'insert_before',
+# 'isSelfClosing', 'is_empty_element', 'is_xml', 'known_xml', 'markup', 'name',
+# 'namespace', 'new_string', 'new_tag', 'next', 'nextGenerator', 'nextSibling',
+# 'nextSiblingGenerator', 'next_element', 'next_elements', 'next_sibling',
+# 'next_siblings', 'object_was_parsed', 'original_encoding', 'parent',
+# 'parentGenerator', 'parents', 'parse_only', 'parserClass', 'parser_class',
+# 'popTag', 'prefix', 'preserve_whitespace_tag_stack', 'preserve_whitespace_tags',
+# 'prettify', 'previous', 'previousGenerator', 'previousSibling',
+# 'previousSiblingGenerator', 'previous_element', 'previous_elements',
+# 'previous_sibling', 'previous_siblings', 'pushTag', 'recursiveChildGenerator',
+# 'renderContents', 'replaceWith', 'replaceWithChildren', 'replace_with',
+# 'replace_with_children', 'reset', 'select', 'select_one', 'setup', 'smooth',
+#'string', 'strings', 'stripped_strings', 'tagStack', 'text', 'unwrap', 'wrap']
-# print(cs[5]) # → HTML-code
+# Try out some stuff...
+############################
+#
+p_soup.h1			# → <h1 class="page-title-text">"graphics card"</h1>
+p_soup.p   			# First p-tag
+p_soup.meta 		# First meta tag
+p_soup.body			# Gewoon, de body van de pagina :)
+p_soup.body.span	# First span-tag
 ##############################################################
-# Have a closer look at cs[10]
+# OK - Have a closer look at cs[10]
 ##############################################################
 #
+# cs = p_soup.findAll("div",{"class":"item-container"})
+# type(cs)	# <class 'bs4.element.ResultSet'>
+# len(cs)		# Aantal elementen = 40
 # * This 10th item is a good example, as it has a price (not all items have prices)
 #
-c=cs[10]
+# c=cs[10]
+# type(c)   # bs4.element.tag
+# Fetch brand
+############################################
+#
+# Like this:
+#
+# <div class="item-info"> »
+# <a
+#    class="item-brand"
+#    href="https://www.newegg.com/global/nl-en/GIGABYTE/BrandStore/ID-1314">
+#    <img
+#       alt="GIGABYTE"
+#       src="//c1.neweggimages.com/Brandimage_70x28//Brand1314.gif"
+#       title="GIGABYTE"/>
+# </a>
+#
+# c_brand = c.find(class_="item-brand").img['alt']
+# print ("c10 - Brand: "+c_brand)
+# # Product name
+# ############################################
+# #
+# c_name = c.find(class_="item-title").text
+# print("c10 - Name: "+c_name)
+# # Fetch price
+# ############################################
+# #
+# c_price = c.find(class_="price-current").strong.text
+# print ("c10 - Price: "+c_price)
+##############################################################
+# Iterate over items in cs
+##############################################################
+#
+# Create a resultset with all "item-container" div classes
+##########################################################
+#
+# * This is actually plain HTML code
+# * "div" has one argument-value-pair (or whatever its called).
+#   That's included here as a dictionary: {"argument":"value"}
+#
+print (">>> Create resultset cs...")
+cs = p_soup.findAll("div",{"class":"item-container"})
+type(cs)	# <class 'bs4.element.ResultSet'>
+len(cs)		# Aantal elementen = 40
+i=0
+for c in cs:
+	print("\n\n>>>>>>>>> Volgende element")
+	i=i+1
+	print(i)
+	if (c.find(class_="item-brand") is not None):
+		c_brand = c.find(class_="item-brand").img['alt']
+		print ("Brand: "+c_brand)
+	if (c.find(class_="item-title") is not None):
+		c_name = c.find(class_="item-title").text
+		print("Name: "+c_name)
+	if (c.find(class_="price-current") is not None):
+		c_price = c.find(class_="price-current").strong.text
+		print ("Price: "+c_price)
+</pre>
+=== Compact ===
+<pre>
+#! /usr/bin/python3
+#
+# Newegg webcrawling-example - Compact
+###################################################################
+#
+print ("\n\n\n")
+print (">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>")
+print (">>> 107-Newegg-compact.py")
+print (">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>")
+print ("\n\n")
-c2=c.findAll("a",{"class":"item-brand"})	# It's a list
-print(c2)
+###################################################################
+# Load libraries
+###################################################################
 #
-# [<a class="item-brand" href="https://www.newegg.com/global/nl-en/GIGABYTE/BrandStore/ID-1314">
+from bs4 import BeautifulSoup
-# <img alt="GIGABYTE" src="//c1.neweggimages.com/Brandimage_70x28//Brand1314.gif" title="GIGABYTE"/>
+import requests
-# </a>]
-c3=c2[0]
-print(c3)
+###################################################################
+# Fetch webpage
+###################################################################
+#
+url = 'https://www.newegg.com/global/nl-en/p/pl?d=graphics+card'
+p_html = requests.get(url).text
+p_soup = BeautifulSoup(p_html, "html.parser")
+###################################################################
+# Process webpage
+###################################################################
 #
-# <a class="item-brand" href="https://www.newegg.com/global/nl-en/GIGABYTE/BrandStore/ID-1314">
+cs = p_soup.findAll("div",{"class":"item-container"})
-# <img alt="GIGABYTE" src="//c1.neweggimages.com/Brandimage_70x28//Brand1314.gif" title="GIGABYTE"/>
-# </a>
+i=0
+for c in cs:
+	i=i+1
+	print("")
+	print(i)
+	if (c.find(class_="item-brand") is not None):
+		c_brand = c.find(class_="item-brand").img['alt']
+		print ("Brand: "+c_brand)
+	if (c.find(class_="item-title") is not None):
+		c_name = c.find(class_="item-title").text
+		print("Name: "+c_name)
+	if (c.find(class_="price-current") is not None):
+		c_price = c.find(class_="price-current").strong.text
+		print ("Price: "+c_price)
+</pre>
+== Complete website verwerken? - wget ==
+Het voorbeeld hierboven, betreft één pagina. Maar hoe zit het als je bv. een hele webwinkel wilt verwerken? Dan moet je dus meerdere pagina's downloaden en verwerken.
+Dit kan ongetwijfeld mbv. ''wget''. Bv.:
+ wget -r -l25 https://example.com
+Ook een standaard .htaccess-login-procedure is geen probleem:
+ wget --user 'xxx' --password 'yyy' http://example.strompf.com
+=== Non-HTML-objecten? ===
+Wget lijkt in staat te zijn om ook andersoortige objecten te downloaden. Zag er niet direct indrukwekkend uit, maar dat gaat wel lukken (in dit geval vermoedelijk zip-bestanden)
+=== WordPress-authenticatie ===
+Maar nu een website waar authenticatie niet plaatsvindt via .htaccess, maar via de gebruikelijk WordPress-login-procedure. Dan moet je iets doen met een cookie die toegang verschaft.
+Geen probleem:
-print(type(c3))
+<pre>
+#!/bin/bash
 #
-# <class 'bs4.element.Tag'>
+#################################################
+# Setup - This is verified
+#################################################
-# → And now I'm lost. How do I get the "title" attribute???
+site="https://example.com/"
+login_address="$site/wp-login.php"
+log="xxx"
+pwd="yyy"
+cookies="/tmp/cookies.txt"
+agent="Mozilla/5.0"
+#################################################
+# Authenticate
+#################################################
+#
+wget \
+    --user-agent="$agent" \
+    --save-cookies $cookies \
+    --keep-session-cookies \
+    --delete-after \
+    --post-data="log=$log&pwd=$pwd&testcookie=1" \
+    "$login_address"
-##############################################################
+#################################################
-# Loop through all containers
+# Download!
-##############################################################
+#################################################
+#
+# OK - 140 - Single page "aaa"
+##############################
+#
+# wget \
+#     --user-agent="$agent" \
+#     --load-cookies $cookies \
+#     "https://example.com/download/aaa/"
+# 170-Complete site download
+####################################
 #
-# I suspect there are more eligent ways of looping through a
+# * A a basis for further efforts
-# hierarchical data set - PHP can already do that, so surely
-# Python can do that even better
 #
-# print("********************** loop")
+wget \
+    -r -l10 \
+    --user-agent="$agent" \
+    --load-cookies $cookies \
+    "https://example.com/"
+</pre>
-# i=0
+=== Zip-bestanden downloaden? ===
-# while i < len(cs):
-# 	Print("Item "+i)
+Da's een koud kunstje: Als ik de betreffende links heb gefiltered (bv. met BeautifulSoup), dan kan ik die gebruiken als argument voor wget - Da's vermoedelijk waar wget oorspronkelijk voor is bedoeld: Downloaden van bestanden vanaf de command line.
-# 	c=cs[i]
+== JavaScrip & cloacking ==
-# 	brand_container=c.findAll("a",{"class":"item-brand"})
+Ik heb de indruk dat ik met ''cloaking'' te maken heb op [https://www.sutureonline.com/vicryl-plus-suture-3-0-vcp293h-fs-2-45-cm-undyed deze pagina], ihb. JavaScript-code die client-sided geëxecuteerd wordt om tot de volledige HTML-pagina te komen.
-# 	print("Type brand_container: ".type(brand_container))
+BS kan geen JavaScript executeren. Daar bestaan aparte oplossingen voor. Ihb. ''Selenium'' (''Selinium''?) - Een ''user agent'' die JavaScript kan executeren.
-# 	# print("Length brand_container: "+len(brand_container))
-# 	print("Title: "+c.img["title"])	# Complete product title
+Zie ook
-# 	i += 1
-</pre>
-== Beautiful Soup - Hun eigen tutorial ==
+* https://stackoverflow.com/questions/45282009/how-to-web-scraping-from-a-javascript-website
+* https://stackoverflow.com/questions/35050746/web-scraping-how-to-access-content-rendered-in-javascript-via-angular-js
+* https://www.quora.com/Can-beautifulsoup-scrape-javascript-rendered-webpages
-* De eigen documentatie van Beautiful Soup, op https://www.crummy.com/software/BeautifulSoup/bs4/doc/, vind ik geweldig.
+=== Oplossingen ===
-* Beautiful Soup fietst een complex HTML-document om naar een complexe hiërarchische boom van Python-objecten.
-=== NavigableString ===
+* Broncode van gerenderde pagina's kopiëren?
+* Selenium gebruiken?
 == Zie ook ==
@@ Regel 377: / Regel 682: @@
 * [[HTML-filtering in Python]]
 * [[Pip (Python) | Pip]]
+* [[Price Monitoring]]
 * [[Print (Python)]]
+* [[Requests (Python)]]
 == Bronnen ==
@@ Regel 386: / Regel 693: @@
 * https://www.youtube.com/watch?v=tb8gHvYlCFs - Computerbrowsing
-=== Post (ipv. get) ===
+''' Hele site downloaden '''
-* https://www.pythoneasy.com/learn/python-how-to-send-a-http-post-and-put-request-in-python-9
+* https://stackoverflow.com/questions/10885708/write-a-python-script-that-goes-through-the-links-on-a-page-recursively
-* https://stackoverflow.com/questions/111945/is-there-any-way-to-do-http-put-in-python
-=== Request-library ===
+''' wget & login '''
-* https://2.python-requests.org//en/latest/index.html
+* https://stackoverflow.com/questions/22614331/authenticate-on-wordpress-with-wget
+* https://askubuntu.com/questions/29079/how-do-i-provide-a-username-and-password-to-wget