Skip to content

Utilized machine learning techniques (TF-IDF, PCA, KMeans, t-SNE, LDA, Bokeh) to categorize 2,225 BBC news articles, efficiently clustering content and extracting topics, with promising real-world applications.

Notifications You must be signed in to change notification settings

AjayDyavathi/Text_Categorization

Repository files navigation

Text Categorization

In this project, text categorisation—the process of automatically classifying text documents into predetermined groups based on their content—is the issue that needs to be solved. This issue is crucial in a number of applications, including content-based recommendation systems, search engines, and document management systems, where the capacity to swiftly and effectively classify massive amounts of text data is essential. The goal of this project is to offer an automatic text categorisation solution because human categorisation requires a lot of time and effort, especially when working with big amounts of text data. Large amounts of text data may be efficiently managed and analysed by businesses with the help of this, which can improve understanding and decision-making. Text classification is used in search engines, content-based recommendation systems, sentiment analysis, and automated content labelling in journalism and publishing.

The dataset used in this project is the BBC News Articles dataset, which is available on Kaggle. The dataset consists of a CSV file containing 2,225 news articles from the BBC website, published between 2004 and 2005, categorised into five topics: business, entertainment, politics, sport, and tech.

The articles were collected using web scraping techniques and are stored in a CSV file. Each article is represented as a single row, with columns for the article's title, text, category, and the date it was published.

The presence of irrelevant content, such as adverts or content unrelated to the article's category, in some of the articles was one of the problems discovered during data gathering. The dataset contained duplicate copies of some of the articles as well. Data cleaning and deduplication procedures were used to address these problems, producing a clean dataset for study.

An interesting example of an article in the dataset is "Internet links to your fridge", which is categorised under the 'tech' category and discusses how the internet could be used to control home appliances like refrigerators.

Overall, the dataset is a valuable resource for text classification and clustering tasks, as it provides a diverse set of news articles covering a wide range of topics.

About

Utilized machine learning techniques (TF-IDF, PCA, KMeans, t-SNE, LDA, Bokeh) to categorize 2,225 BBC news articles, efficiently clustering content and extracting topics, with promising real-world applications.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published