Most search engines provide information based on commercial interests and profiling. Andreas Wagner describes a European Union project called Open Web Search which aims to offer free and transparent information access.
What is Open Web Search?
The Open Web Search project was started by a group of people who were concerned that navigation in the digital world is led by a handful of big commercial players (the European search market is largely dominated by Google, for example), who don’t simply offer their services out of generosity but because they want to generate revenue from advertisements. To achieve that they put great effort into profiling users: they analyse what you are searching for and then use this information to create more targeted adverts that create more revenue for them. They also filter search results to present information that fits your world view, to make sure that you come back because you feel at home on those web pages. For some people, and for the European Commission in the context of striving for open access to information and digital sovereignty, as well as becoming independent of US-based tech giants, this is a big concern.
How did the project come about?
In 2017 the founder of the Open Search Foundation reached out to me because I was working on CERN’s institutional search. He had a visionary idea: an open web index that is free, accessible to everyone and completely transparent in terms of the algorithms that it uses. Another angle was to create a valuable resource for building future services, especially data services. Building an index of the web is a massive endeavour, especially when you consider that the estimated total number of web pages worldwide is around 50 billion.
You could argue that unbiased, transparent access to information in the digital world should be on the level of a basic right
A group of technical experts from different institutes and universities, along with the CERN IT department, began with a number of experiments that were used to get a feel for the scale of the project. For example, to see how many web pages a single server can index and to evaluate the open source projects used for crawling and indexing web pages. The results of these experiments were highly valuable when it came to replying to the Horizon Europe funding call later on.
In parallel, we started a conference series, the Open Search Symposia (OSSYM). Two years ago there was a call for funding in the framework of the European Union (EU) Horizon Europe programme dedicated to Open Web search. Together with 13 other institutions and organisations, the CERN IT department participated and we were awarded a grant. We were then able to start the project in September 2022.
What are the technical challenges in building a new search engine?
We don’t want to copy what others are doing. For one, we don’t have the resources to build a new, massive data centre. The idea is a more collaborative approach, to have a distributed system where people can join depending on their means and interests. CERN is leading work-package five “federated data infrastructure”, in which we
and our four infrastructure partners (DLR and LRZ in Germany, CSC in Finland and IT4I in the Czech Republic) provide the infrastructure to set up the system that will ultimately allow the index itself to be built in a purely distributed way. At CERN we are running the so-called URL frontier – a system that oversees what is going on in terms of crawling and preparing this index, and has a long list of URLs that should be collected. When running the crawlers, they report back on what they have found on different web pages. It’s basically bookkeeping to ensure that we coordinate activities and don’t duplicate the efforts already made by others.
Open Web Search is said to be based on European values and jurisdiction. Who and what defines these?
That’s an interesting question. Within the project there is a dedicated work package six titled “open web search ecosystem and sustainability” that covers the ethical, legal and societal aspects of open search and addresses the need for building an ecosystem around open search, including the proper governance processes for the infrastructure.
The legal aspect is quite challenging because it is all new territory. The digital world evolves much faster than a legislator can keep up! Information on the web is freely available to anyone, but the moment you start downloading and redistributing it you are taking on ownership and responsibility. So you need you take copyright into account, which is regulated by most EU countries. Criminal law is more delicate in terms of the legal content. Every country has its own rules and there is no conformity. Overall, European values include transparency, fairness for data availability and adhering to democratic core principles. We are aiming at including these European values into the core design of our solution from the very beginning.
What is the status of the project right now?
The project was launched just over a year ago. On the infrastructure side the aim was to have the components in place, meaning having workflows ready and running. It’s not fully automated yet and there is still a lot of challenging work to do, but we have a fully functional set-up, so some institutes have been able to start crawling; they feed the data and it gets stored and distributed to the participating infrastructure partners including CERN. At the CERN data centre we coordinate the crawling efforts and provide advanced monitoring. As we go forward, we will work on aspects of scalability so that there won’t be any problems when we go bigger.
What would a long-term funding model look like for this project?
You could argue that unbiased, transparent access to information in the digital world that has become so omnipresent in our daily lives should be on the level of a basic right. With that in mind, one could imagine a governmental funding scheme. Additionally, this index would be open to companies that can use it to build commercial applications on top of it, and for this use-case a back-charging model might be suitable. So, I could imagine a combination of public and usage-based funding.
In October last year the Open Search Symposium was hosted by the CERN IT department. What was the main focus there?
This is purposely not focused on one single aspect but is an interdisciplinary meeting. Participants include researchers, data centres, libraries, policy makers, legal and ethical experts, and society. This year we had some brilliant keynote speakers such as Věra Jourová, the vice president of the European Commission for Values and Transparency, and Christoph Schumann from LAION, a non-profit organisation that looks to democratise artificial intelligence models.
Ricardo Baeza-Yates (Institute for Experiential Artificial Intelligence, Northeastern University) gave a keynote speech about “Bias in Search and Recommender Systems” and Angella Ndaka (The Centre for Africa Epistemic Justice and University of Otago) talked about “Inclusion by whose terms? When being in doesn’t mean digital and web search inclusion”, the challenges of providing equal access to information to all parts of the world. We also had some of the founders of alternative search engines joining, and it was very interesting and inspiring to see what they are working on. And we had representatives from different universities looking at how research is advancing in different areas.
I see the purpose of Open Web Search as being an invaluable investment in the future
In general, OSSYM 2023 was about a wide range of topics related to internet search and information access in the digital world. We will shortly publish the proceedings of the nearly 25 scientific papers that were submitted and presented.
How realistic is it for this type of search engine to compete with the big players?
I don’t see it as our aim or purpose to compete with the big players. They have unlimited resources so they will continue what they are doing now. I see the purpose of Open Web Search as being an invaluable investment in the future. The Open Web Index could pave the way for upcoming competitors, creating new ideas and questioning the monopoly or gatekeeper roles of the big players. This could make accessing digital information more competitive and a fairer marketplace. I like the analogy of cartography: in the physical world, having access to (unbiased) maps is a common good. If you compare maps from different suppliers you still get basically the same information, which you can rely on. At present, in the digital world there is no unbiased, independent cartography available. For instance, if you look up the way to travel from Geneva to Paris online, you might have the most straightforward option suggested to you, but you might also be pointed towards diversions via restaurants, where you then might consider stopping for a drink or some food, all to support a commercial interest. An unbiased map of the digital world should give you the opportunity to decide for yourself where and how you wish to get to your destination.
The project will also help CERN to improve its own search capabilities and will provide an open-science search across CERN’s multiple information repositories. For me, it’s nice to think that we are helping to develop this tool at the place where the web was born. We want to make sure, just as CERN gave the web to the world, that this is a public right and to steer it in the right direction.