Photo by Toa Heftiba on Unsplash

Vogue Runway - Web Scraping Project

Project Overview

For this project, my objective was to scrape images from Vogue Runway's fashion show collections, with a particular focus on specific seasons and brands, such as Gucci's Resort 2025 runway collection. The goal was to automate the process of collecting, downloading, and resizing these fashion show images, which would later be used for further analysis and trend forecasting. This would enable a comprehensive understanding of emerging fashion trends across multiple collections and brands.

Data Used

The data used in this project are fashion show images from the Vogue Runway website, particularly focusing on Gucci's Resort 2025 runway collection. The images come from HTML elements containing multiple image sets per collection, which need to be extracted and processed.

Vogue Runway Resort 2025 Image Archive

Data Preparation

Selenium WebDriver: Selenium WebDriver is a tool for automating web browsers. It allows you to programmatically control a browser, interact with elements like buttons, dropdowns, and forms, and extract data from dynamic web pages.
- In this project, Selenium is used to navigate the Vogue Runway website, log in, and interact with dynamically loaded content, such as clicking the "Next" button and selecting options from dropdowns. It is effective because it can simulate real user interactions and handle dynamic content that changes or loads based on user inputs, making it essential for scraping pages with JavaScript-driven elements.

BeautifulSoup: BeautifulSoup is a Python library used to parse HTML and XML documents. It creates a parse tree that allows for easy extraction of data from HTML structures using methods like find() and select().
- After Selenium loads the web page, BeautifulSoup is used to efficiently parse the page’s HTML and extract the image URLs. It’s effective because it simplifies navigating through the HTML structure, identifying specific elements (like image tags), and extracting the desired information, making it ideal for web scraping tasks that don’t need user interaction.

Requests: Requests is a Python library for sending HTTP requests to web servers. It handles all the complexity of making a web request, allowing you to download content like images, files, or web pages.
- Once the image URLs are extracted, Requests is used to download the images by sending a simple HTTP GET request. It is effective because it provides a straightforward and reliable way to retrieve web content, and it handles various aspects of downloading such as response status and error handling, ensuring that the images are properly fetched.

PIL (Python Imaging Library): PIL, or its more modern version Pillow, is a library for image processing in Python. It allows for opening, manipulating, and saving image files.
- In this project, PIL/Pillow is used to resize the downloaded images to a consistent size (300x300). This is effective for ensuring uniformity in image size for further analysis, as it makes images more manageable in terms of storage and easier to process for machine learning or visualization tasks.

OS: The OS module in Python provides a way to interact with the operating system, allowing file and directory manipulation.
- OS is used to create directories where images are saved (os.makedirs) and to handle file paths when downloading images. It’s effective because it allows the script to dynamically create and manage folders, ensuring the images are stored in an organized manner without manual intervention.

Time: The Time module provides functions for manipulating time, such as delays using time.sleep().
- Time is used to introduce delays in the script to allow the webpage to load fully before proceeding with actions like scraping or clicking buttons. It’s effective in ensuring the web scraper does not attempt to interact with elements before they are available, preventing errors and improving reliability in dynamic content

Summary of Code

Web Scraping Code:

https://github.com/karishma1480/Vogue-Runway-Webscraping/blob/main/VogueRunway_WebscrapingGit.ipynb

Setup Chrome WebDriver and Login to Vogue:

To begin this project, I set up the Chrome WebDriver, configuring it with the necessary options such as --no-sandbox and --disable-dev-shm-usage. These options were essential to ensure smooth interaction with the website. Once the WebDriver was initialized, I automated the login process for the Vogue website using my credentials. To make sure the login process worked correctly, I used WebDriverWait to ensure all page elements, including the login form, were fully loaded before proceeding with any further actions.

Navigating to the Fashion Show Collection:

After logging in, I navigated to the Resort 2025 Gucci collection page. Here, I implemented another WebDriverWait mechanism to pause the script until the main content of the page had loaded. This step was crucial to ensure that all images were available for scraping.

Extracting Image URLs Using BeautifulSoup:

I used the get_image_urls_using_bs4 function, which leverages BeautifulSoup to parse the HTML structure of the page and extract the image URLs. I specifically targeted the relevant HTML elements (li.ImageCollectionListItem-YjTJj.eOyRQu and div.ZoomableWrapper-kJXWeT img) to locate the images and store their URLs. This step formed the core of my image-scraping process.

Handling Multiple Image Sets:

To manage the collection's multiple image sets, I added logic to automatically click the "Next" button on the webpage. I created the click_next_button() function to navigate through the pages, wait for the next set of images to load, and scroll through them to make sure all images were visible and ready for extraction.

Avoiding Duplicate Images and Repeated Scraping:

I implemented a system to avoid scraping the same images multiple times by tracking previously seen image sets. The script maintained a list (seen_image_urls) and a threshold (repeat_threshold) to detect and prevent redundant scraping. This logic allowed the script to terminate gracefully once all new images had been scraped, avoiding any infinite loops or unnecessary repeated downloads.

Downloading and Resizing Images:

Once the image URLs were extracted, I downloaded the images using the download_images() function, saving them in a designated folder (resort_2025_Gucci_images). Additionally, I resized the images to 300x300 pixels using the resize_image() function to optimize them for storage and further analysis.

Handling Image Downloads and Errors:

The image download process included error handling to account for potential issues, such as invalid URLs or file-saving errors. After each image was successfully downloaded, the script printed a confirmation message to verify the process.

Termination and Clean-Up:

Finally, after all the unique images were scraped and downloaded, I ensured proper termination by closing the WebDriver. The script concluded by printing the total number of images downloaded, ensuring that all resources were properly cleaned up after the execution.

Rationale behind Selected Approach

For this project, I combined Selenium and BeautifulSoup because the Vogue Runway website contains dynamic elements driven by JavaScript, such as dropdowns for selecting brands/seasons and the "Next" button for navigating through collections. Selenium was essential for interacting with these dynamic components, while BeautifulSoup was more efficient in parsing the HTML and extracting image URLs.

Handling dynamic content and pagination with Selenium was key, as the website required interaction with the "Next" button to load more images. Using WebDriverWait ensured that elements fully loaded before interaction, making the scraper robust against errors. Error handling and time delays (using time.sleep()) were added to manage the dynamic nature of the site, ensuring that the scraper waited for content to load fully after user-like interactions.

I avoided using headless mode in Selenium for this project. Running the browser in its full UI form made debugging easier and avoided potential issues with element rendering in headless mode. I also refrained from using parallel requests or threading to prevent triggering the website's anti-scraping mechanisms or encountering rate-limiting.

When downloading images, I used the PIL/Pillow library to resize them during the scraping process, ensuring all images had a uniform size for later analysis. Lastly, I implemented set-based URL tracking to avoid duplicate image downloads when the scraper encountered repeated image sets, ensuring only unique images were saved for further analysis.

Key Challenges and Solutions Implemented

Issue with Duplicate Extraction: There was a challenge with the scraper extracting and downloading the same set of images multiple times when it reached the end of the collection. This was solved by using a tracking mechanism (seen_image_urls) and a threshold to stop the scraper when the same set of URLs appeared repeatedly.

XPath and Element Selection: Identifying the right XPath selectors for dropdowns, buttons, and images was a significant challenge. Through trial and error, correct element selectors were identified, and dynamic element interaction was made possible using Selenium’s wait and click functions.

Handling Infinite Loops: One of the initial issues was the code getting stuck on the last page of image sets and repeatedly scraping the same image. To solve this, logic was introduced to detect repeated sets of images and stop the loop when no new images were found after a certain threshold.

Identifying Dynamic Page Elements: Selecting the correct XPaths for the "Next" button and the image elements required close inspection of the HTML structure. Challenges included working with dynamically loaded content and dropdown menus that used complex selectors.

Delays in Page Loading: The script had to handle page load times carefully, as fashion show pages are often media-heavy. Proper use of time.sleep() and WebDriverWait ensured that the script only proceeded when the page was fully loaded.

Dynamic Dropdown for Brand and Season Selection:Originally, there were plans to select different brands and seasons from dropdowns dynamically. Identifying the correct elements from the dropdown proved challenging, and there were several adjustments to the selectors used.

Conclusion

In this project, I demonstrated how to efficiently scrape images from a complex webpage, handling dynamically loaded content, multiple image sets, and repetitive page structures. The code effectively extracts, downloads, and processes images from the Vogue website and provides a flexible structure for future enhancements, such as automating the scraping of multiple brands or seasons. Key challenges, such as avoiding repeated scraping loops and managing dynamic page elements, were addressed through thoughtful logic, error handling, and the integration of Selenium and BeautifulSoup. This project lays a solid foundation for future image-based data collection and trend analysis.

Next Steps - Trend Forecasting

In Phase 2, the scraped images will be preprocessed for uniformity, and key fashion features such as colors, patterns, and silhouettes will be extracted using machine learning models. The data will be annotated and analyzed to identify patterns and trends across seasons and brands. Finally, Python will be used to forecast future trends based on the extracted data and analysis.

BACK TO PROJECTS

KARISHMA NAGESHWARAN