Beautiful Soup: An Intro
Hello Beautiful Soup
We will be using a python library called Beautiful Soup which helps give like us to scrape data from different websites.
Install Beautiful Soup by
>> pip install beautifulsoup4
>> pip install lxml
Lets first grab the source code for our website, say google.in
import bs4 as bs
import urllib.request
sauce = urllib.request.urlopen('https://google.in/').read()
Now you may want to view the code that we grabbed by
print(sauce)
soup = bs.BeautifulSoup(sauce,'lxml')
print(soup)
#printing the title
#print(soup.title)
#print(soup.title.name)
#print(soup.title.string)
#print(soup.title.parent.name)
#Try out the above mentioned print statements one by one to explore more.
#get specific value for <p>
print(soup.p)
#to find all paragraph tags <p>
print(soup.find_all('p'))
#to iterate through all the findings
for paragraph in soup.find_all('p'):
print(paragraph.string)
print(str(paragraph.text))
"""
The difference between string and text is that string produces a NavigableString object, and text is just typical unicode text. Notice that, if there are child tags in the paragraph item that we're attempting to use .string on, we will get None returned.
"""
#say you want to find all the links or anchor tags from the html,
for url in soup.find_all('a'):
print(url.get('href'))
More stuff on the way...
Comments
Post a Comment