Beautiful Soup: An Intro

January 13, 2019

Beautiful Soup: An Intro

Hello Beautiful Soup

We will be using a python library called Beautiful Soup which helps give like us to scrape data from different websites.

Install Beautiful Soup by

>> pip install beautifulsoup4

>> pip install lxml

Lets first grab the source code for our website, say google.in

import bs4 as bs

import urllib.request

sauce = urllib.request.urlopen('https://google.in/').read()

Now you may want to view the code that we grabbed by

print(sauce)

So to make a beautiful soup object of google.in

soup = bs.BeautifulSoup(sauce,'lxml')

print(soup)

#printing the title

#print(soup.title)

#print(soup.title.name)

#print(soup.title.string)

#print(soup.title.parent.name)

#Try out the above mentioned print statements one by one to explore more.

#get specific value for <p>

print(soup.p)

#to find all paragraph tags <p>

print(soup.find_all('p'))

#to iterate through all the findings

for paragraph in soup.find_all('p'):

print(paragraph.string)

print(str(paragraph.text))

"""

The difference between string and text is that string produces a NavigableString object, and text is just typical unicode text. Notice that, if there are child tags in the paragraph item that we're attempting to use .string on, we will get None returned.

"""

#say you want to find all the links or anchor tags from the html,

for url in soup.find_all('a'):

print(url.get('href'))

More stuff on the way...

Search This Blog

Me Python

Beautiful Soup: An Intro

Hello Beautiful Soup

Comments

Post a Comment

Popular Posts

DUMMY TRAP Explained

Data structures in Python, Series 3 : Doubly Linked List