Text Categorization

In this project, text categorisation—the process of automatically classifying text documents into predetermined groups based on their content—is the issue that needs to be solved. This issue is crucial in a number of applications, including content-based recommendation systems, search engines, and document management systems, where the capacity to swiftly and effectively classify massive amounts of text data is essential. The goal of this project is to offer an automatic text categorisation solution because human categorisation requires a lot of time and effort, especially when working with big amounts of text data. Large amounts of text data may be efficiently managed and analysed by businesses with the help of this, which can improve understanding and decision-making. Text classification is used in search engines, content-based recommendation systems, sentiment analysis, and automated content labelling in journalism and publishing.

The dataset used in this project is the BBC News Articles dataset, which is available on Kaggle. The dataset consists of a CSV file containing 2,225 news articles from the BBC website, published between 2004 and 2005, categorised into five topics: business, entertainment, politics, sport, and tech.

The articles were collected using web scraping techniques and are stored in a CSV file. Each article is represented as a single row, with columns for the article's title, text, category, and the date it was published.

The presence of irrelevant content, such as adverts or content unrelated to the article's category, in some of the articles was one of the problems discovered during data gathering. The dataset contained duplicate copies of some of the articles as well. Data cleaning and deduplication procedures were used to address these problems, producing a clean dataset for study.

An interesting example of an article in the dataset is "Internet links to your fridge", which is categorised under the 'tech' category and discusses how the internet could be used to control home appliances like refrigerators.

Overall, the dataset is a valuable resource for text classification and clustering tasks, as it provides a diverse set of news articles covering a wide range of topics.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
learn-ai-bbc		learn-ai-bbc
BBC_articles_cluster.png		BBC_articles_cluster.png
BBC_articles_plain.png		BBC_articles_plain.png
README.md		README.md
main.ipynb		main.ipynb
report.pdf		report.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text Categorization

About

Releases

Packages

Languages

AjayDyavathi/Text_Categorization

Folders and files

Latest commit

History

Repository files navigation

Text Categorization

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages