Beautiful Soup: An Intro

Hello Beautiful Soup

We will be using a python library called Beautiful Soup which helps give like us to scrape data from different websites.

Install Beautiful Soup by 
       >> pip install beautifulsoup4
       >> pip install lxml

 Lets first grab the source code for our website, say google.in
import bs4 as bs
import urllib.request
sauce = urllib.request.urlopen('https://google.in/').read()

Now you may want to view the code that we grabbed by 
print(sauce)


So to make a beautiful soup object of google.in
soup = bs.BeautifulSoup(sauce,'lxml')
print(soup)
#printing the title
#print(soup.title)
#print(soup.title.name)
#print(soup.title.string)
#print(soup.title.parent.name)
#Try out the above mentioned print statements one by one to explore more.


#get specific value for <p>
print(soup.p)
#to find all paragraph tags <p>
print(soup.find_all('p'))
#to iterate through all the findings
for paragraph in soup.find_all('p'):
    print(paragraph.string)
    print(str(paragraph.text))
"""
The difference between string and text is that string produces a NavigableString object, and text is just typical unicode text. Notice that, if there are child tags in the paragraph item that we're attempting to use .string on, we will get None returned.
"""
#say you want to find all the links or anchor tags from the html,
for url in soup.find_all('a'):
    print(url.get('href'))

More stuff on the way...

Comments

Popular Posts