Product Analysis using Web Scraping Technique in Python

Web Scraping is one of the Data Scraping technique in which data is extracted from the websites for analysis.In this project we will learn the how to analyze the product in an online shop like flipkart, for example we will analyze various brands of Mobile Tablets sold in the flipkart web site and suggest the medium range product in price range.

Using the web scraping techniques we will be able to get the details prices,specifications, reviews,highlights and ratings for any product in the website.

We will be using the python modules urlopen(from urllib library) and BeautifulSoup(from bs4 library).

import bs4
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

First we need to read the contents of the web page which displays the search results of the product we are going to analyse ,the tablet and parse the html content using the BeautifulSoup module.

myurl = “https://www.flipkart.com/tablets/pr?sid=tyy,hry&marketplace=FLIPKART"
uclient = uReq(myurl)
page_html = uclient.read()
uclient.close()

psoup = soup(page_html, “html.parser”)

Below is the screen short of the web page we are going to get the data for analysis,right click on the page and do inspect, it will show the HTML document tags for each element in the page.

Right click on the Page index tab in the page select the inspect option and search for the elements listing and find the class name for that element and use it for getting the href of the link to the search result pages for all the product listed,as show in the picture below

Now find all the page references and store it in a list,as detailed in the code below

page_urls = list()
for containers in psoup.findAll(‘div’,{‘class’:’_2MImiq’}):
a_list = containers.findAll(‘a’,{‘class’:’ge-49M’})

#uncomment and use the below 2 lines to get all the search 10 pages
#for a in a_list:
# page_urls.append(‘https://www.flipkart.com’+ a[‘href’])


#use the below 2 lines to get single search page
a= a_list[0]
page_urls.append(‘https://www.flipkart.com'+ a[‘href’])

Once we get the page url link, we need to inspect each product listed in the page as below

for url in page_urls:
print(url)
uclient = uReq(url)
page_html = uclient.read()
uclient.close()
psoup = soup(page_html, “html.parser”)

prod_urls = list()
for containers in psoup.findAll(‘div’,{‘class’:’_13oc-S’}):
for a in containers:
a_list = a.findAll(‘a’,{‘class’:’_1fQZEK’})
#print(a_list[‘href’])
#prod_urls.append(‘https://www.flipkart.com’+ a_list[‘href’])

for a in a_list:
prod_urls.append(‘https://www.flipkart.com'+ a[‘href’])

for p_url in prod_urls:
uclient = uReq(p_url)
page_html = uclient.read()
uclient.close()
psoup = soup(page_html, “html.parser”)

all_procuct_items = psoup.find(‘div’, attrs={‘class’ : ‘_3k-BhJ’})
all_procuct_items = all_procuct_items.findAll(‘tr’, attrs={‘class’ : ‘_1s_Smc row’})

#container variable contains the html of product title which is stored in div tag and class is “_1AtVbE col-12–12”
container= psoup.findAll(“div”,{“class”:”_1AtVbE col-12–12"})
for product_item in container:
product_dict = dict()
rating_dict= dict()
price_dict = dict()
brandname_dict = dict()
n = product_item.findAll(“span”,{“class”:”B_NuCI”})
p = product_item.findAll(“div”,{“class”:”_30jeq3 _16Jk6d”})

r = product_item.findAll(“div”,{“class”:”_2d4LTz”})
for i in n:
product_dict[‘name’] = i.text
strtmp = i.text.split(“ “)
brandname_dict[‘brandname’] = strtmp[0]
brandnames_list.append(strtmp[0])
products_list.append(i.text)
for j in p:
jStr = j.text
jStr = jStr.replace(“₹”, “”)
jStr = jStr.replace(“,”, “”)
price_dict[‘price’] = int( jStr)
prices_list.append(int( jStr))
for k in r:
#print (i.text)
rating_dict[‘rating’] = k.text
ratings_list.append(k.text)

We can now export data extracted for the details of the products to the csv file using the code below

df = pd.DataFrame({‘Brand’:brandnames_list,’Price’:prices_list,’Ratings’:ratings_list,’ProductName’:products_list})
df.to_csv (‘export_products_dataframe.csv’)

We can use the box plot ,cat plot and bar plot available in the seaborn module for visualizing the price range of the product as shown in the pictures below,using the code below

df[‘Price’] = df[‘Price’].astype(np.float)
sns.boxplot(x=df[‘Price’])

sns.catplot(x = “Price”, # x variable name
y = “Brand”, # y variable name
hue = “Ratings”, # group variable name
data = df, # dataframe to plot
kind = “bar”)

df.groupby(‘Brand’).plot(x=’Brand’, y=’Price’)

sns.barplot(x = ‘Brand’,
y = ‘Price’,
data = df)

Product listing based on ratings and prices
Price Range chart
Price range chart for the Brand Samsung
Price range chart for the Brand Apple

AI Enthusiast

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store