🐻 AI Search Open Data Berlin

Search semantically, lexically, and multilingually in Berlin's Open Data catalog.

Contents

Usage
What does the code do?
What is semantic search?
Project team
Feedback and contributing

Usage

Create a Conda environment: conda create -n aisearch python=3.9
Activate environment: conda activate aisearch
Clone this repo.
Change into the project directory.
Install packages: pip install -r requirements.txt
Create an .env file and input your OpenAI API keys:

    OPENAI_API_KEY=sk-...

Run the notebook and create the search index with the Open Source database Weaviate.
Change into the app directory: cd _streamlit_app/
Start the app: streamlit run aisearch.py

What does the code do?

This application allows you to search the Berlin Open Data catalog. It combines exact lexical keyword searches with semantic searches based on meaning and similarity. The search supports multiple languages, including all European languages and many others.

For this prototype app we use OpenAI's embeddings for convenience. We also tested these open source models with SentenceTransformers with very good results:

PM-AI/bi-encoder_msmarco_bert-base_german - 350 tokens context length
Jina AI jina-embeddings-v2-base-de - 8192 tokens context length

Note

The app sends all your search queries to an embedding interface (API) at OpenAI. Please avoid entering sensitive information that you do not want or are not permitted to share with third-party providers like OpenAI.

What is semantic search?

Unlike a lexical search, which looks for exact keywords, a semantic search considers text that is semantically similar but does not have to match the search term exactly. For example, a semantic search for the word disease can find documents containing the words illness, virus, infection, treatment, or healthcare without the word disease appearing in the documents.

Semantic search uses statistical methods and Machine Learning (ML). By analyzing large amounts of text, ML language models for semantic search have learned word and sentence similarities, enabling them to search for documents based on these similarities. While semantic search has many advantages, it is not exact but approximate. Therefore, semantic search results itself may not be complete and can include false hits or miss relevant entries.

Combining lexical and semantic to hybrid search gives you the best of both worlds. You get exact lexical but also semantically similar matches in your search results.

Project team

This is an independent project aimed at improving access to Berlin's Open Data. The original codebase was adapted from a similar project for the Canton of Zurich.

Feedback and contributing

I would love to hear from you. Please share your feedback of this repo and let me know how you use the code. You can share your ideas by opening an issue or a pull request.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
_data		_data
_imgs		_imgs
_streamlit		_streamlit
.gitignore		.gitignore
01_mdv_search.ipynb		01_mdv_search.ipynb
02_implement_search.ipynb		02_implement_search.ipynb
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🐻 AI Search Open Data Berlin

Usage

What does the code do?

What is semantic search?

Project team

Feedback and contributing

About

Releases

Packages

Languages

License

tifa365/ai_search_open_data_berlin

Folders and files

Latest commit

History

Repository files navigation

🐻 AI Search Open Data Berlin

Usage

What does the code do?

What is semantic search?

Project team

Feedback and contributing

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages