Unverified Commit e470e2a1 authored by Loïc Dachary's avatar Loïc Dachary
Browse files


parent 4475f68f
Deadline June 1st
Thematic call: https://nlnet.nl/thema/NGIZeroDiscovery.html
Your name: Loïc Dachary
Email address: ldachary@easter-eggs.com
Phone numbers: +33 6 64 03 29 07
Organisation: Eeaster-Eggs
Country: France
> Project name: Storing Efficiently Our Software Heritage
> Website / wiki: https://wiki.softwareheritage.org/wiki/A_practical_approach_to_efficiently_store_100_billions_small_objects_in_Ceph
> Abstract: Can you explain the whole project and its expected outcome(s).
"Storing Efficiently Our Software Heritage" will be a web service that provides APIs to efficiently store and retrieve the 10 billions small objects of the Software Heritage corpus (letter of recommendation attached). It will be the first implementation of the innovative object storage design that was designed earlier this year (website URL). It has the ability to ingest the corpus in bulk: it makes building search indexes an order of magnitude faster, helps with mirroring etc.
It is the first step to a more ambitious and general purpose undertaking (named BICEPS) allowing to store, search and mirror hundreds of billions of small objects. The storage layer is designed to scale out and be federated at a global scale. A given corpus can grow without being hindered by the bottlenecks associated with centralized storage. Independent actors can run BICEPS and contribute to the global corpus. The BICEPS instances are then federated and, together, effectively aggregate a corpus that a single entity (organization or government) could not produce or sustain all by itself. Federation and erasure coding are combined to for durability, even when facing catastrophic events such as earthquakes.
> Have you been involved with projects or organisations relevant to this project before? And if so, can you tell us a bit about your contributions?
I started as a core developer for the Ceph project in 2013 and authored thousands of commits to the codebase at https://github.com/ceph/ceph/. My initial focus was on introducing Erasure Coding and on the OSD which is the daemon responsible for the data storage. I'm currently involved in the Ceph stable releases team and backporting of bug fixes to the stable releases branches.
https://ceph.io/ is a software-defined storage platform. It implements object storage on a single distributed computer cluster, and provides 3-in-1 interfaces for object-, block- and file-level storage. Ceph aims primarily for completely distributed operation without a single point of failure, scalable to the exabyte level, and freely available.
> Requested Amount: 50,000€
> Explain what the requested budget will be used for? Does the project have other funding sources, both past and present? (If you want, you can in addition attach a budget at the bottom of the form)
The project is currently funded by Easter-Eggs and Software Heritage (https://www.softwareheritage.org/2021/03/11/towards-a-next-generation-object-storage-for-software-heritage/). Access to hardware resources is provided by Grid5000 https://www.grid5000.fr/ and the Ceph Sepia lab https://wiki.sepia.ceph.com/doku.php?id=hardware:smithi
I am working on completing the benchmarks (see https://git.easter-eggs.org/biceps/biceps and https://forge.softwareheritage.org/T3149 for all the details) that demonstrate the architecture can deliver the expected performances. This will be completed during the summer.
The requested amount will be used to implement the web service once the benchmarks are finished. It will pay for my salary:
* creating the web service integrated in the Software Heritage codebase https://forge.softwareheritage.org/diffusion/
* implement extensive testing to ensure it is robust and durable,
* package it so that it can conveniently be deployed and maintained in production,
* write documentation for system administrators
* work with system administrators on the pre-production version
* fix bugs that show up when the production begins
> Compare your own project with existing or historical efforts.
A few projects emerged in the past ten years to address the challenges specific to storing billions of small objects. It was first articulated in a 2010 article https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf "Finding a needle in Haystack: Facebook’s photo storage". It inspired two Free Software projects that implement the same idea, Ambry https://github.com/linkedin/ambry/wiki from LinkedIN and Seaweedfs https://github.com/chrislusf/seaweedfs/wiki. They do not, however, address the technical challenges listed below.
There are no projects with the ambition of implementing a generic purpose federated storage system at a global scale. The Porta & Bella: Portable Encrypted Storage project at https://spritelyproject.org/#porta-bella is very similar in spirit but addresses a different use case and is not designed to scale.
> What are significant technical challenges you expect to solve during the project, if any?
The project started earlier this year with a study to find a theoretical solution for an object storage that would scale out and provide the following two features for a corpus of objects with a median size around 4KB:
* Marginal space amplification (i.e. when storing an object of given size, the additional raw storage required is less than 10%)
* Fast enumeration and bulk download of the entire data set (i.e. at least an order of magnitude faster than looking up each object individually)
The result of the study is an object storage design based on Ceph that was published at https://wiki.softwareheritage.org/wiki/A_practical_approach_to_efficiently_store_100_billions_small_objects_in_Ceph . Benchmarks are currently conducted on the Grid5000 cluster to verify it has the expected properties.
The implementation of the storage web service is the last step to make this solution available to Software Heritage and, later, the general public. The ultimate goal of BICEPS (i.e. creating a strongly consistent federated storage) is an ambitious proposition that will require years of work and the technical challenges it poses are unclear at this time.
> Describe the ecosystem of the project, and how you will engage with relevant actors and promote the outcomes?
Easter-Eggs entered a partnership with Software Heritage earlier this year and it is the primary use case. Software Heritage is committed to deploy the web service and implement the storage design in production.
The project was discussed from the very beginning and in the open with Ceph developers and presented at the Ceph Developer Summit in April 2021 in cooperation with the University of Pisa. Easter-Eggs is also working with the University of Turin to setup a Ceph cluster to verify the object storage design is actually convenient to mirror over the internet, a necessary step toward a federated storage.
> What should we do in the other case, e.g. when your project is not immediately selected?
X I want NLnet Foundation to erase all information should my project proposal not be granted
Dear Loïc,
it is my pleasure to inform you that your project "Storing Efficiently Our Software Heritage" (2021-06-060) has been selected to enter the second round of the June 2021 call. While the first round is solely based on your proposal, this strict selection round is potentially interactive. As your project is looked into in more depth, the reviewers may need some additional information to properly assess your application, in which case they will contact you.
Note that proposals are reviewed with regards to urgency, relevance and value for money. Unfortunately we will not be able to fund all projects proposed, as much as we would like that. For the next three weeks we will be therefore be thoroughly evaluating the remaining proposals for the second round, during which we may ask you to supply additional details. After that we will inform you on the outcome of this second (and final) selection round.
If you meanwhile have any questions, please let us know.
Kind regards,
Subject: Questions Storing Efficiently Our Software Heritage
Dear Loïc,
you applied to the 2021-06 open call from NLnet. We have some questions regarding your project proposal Storing Efficiently Our Software Heritage.
You requested 50000 euro. Can you provide some more detail on how you arrived at this amount? Could you provide a breakdown of the main tasks, and the associated effort? What rates did you use?
How is project metadata (e.g. a project description, tags) linked to the object data which is stored? With shards being the storage and exchange mechanism among mirrors, is any search oriented data indexed and stored within Shards alongside the object data, or will that kind of data stored elsewhere? If so, can you use existing tooling for indexing objects in preparation for search (we understand there is some usage of Elastic Search, which has recently relicensed to a non-FOSS license) or will this be needed to be created separately?
What exactly would the web service that would be built do? Are there other use cases or possibilities for reuse (e.g. CAS systems)? With you being a former developer of both a search engine and of a well-known forge software: currently we are not aware of a good libre solution for universal (cross-forge and version system agnostic) searching for software and software components. This puts projects that do not use Github as the dominant actor at a disadvantage. Do you believe that this project would bring us closer to that? If not, what would be needed to add that capability?
We have already come across cases where Software Heritage held the only copy of a specific resource. We are supporting an effort to use ActivityPub for federating forges, called ForgeFed (https://nlnet.nl/project/ForgeFed). Do you see a possibility for self-hosted forges (e.g. pijul.example.com) to interact with the official SWH web service to report updates, in order to get better coverage and prevent a lot of futile scanning?
How do you see the interaction between this effort, and the effort to push a DHT based distribution mechanism (which is another project we support from SWH)?
Looking forward to your answers, and thank you very much for your timely reply.
Kind regards,
on behalf of NLnet foundation,
(the reply is not publicly available because some participants were not comfortable with publishing it)
Subject: Good news about your proposal to NLnet
Date: Fri, 6 Aug 2021 14:42:33 +0200
To: Loïc Dachary <ldachary@easter-eggs.com>
Dear Loïc,
you applied to the NGI0 Discovery open call from NLnet, round June
2021. Currently a selection of the projects is pending the final stage
review by an independent review committee to validate their
eligibility, and we are happy to inform you that this includes your
project "Storing Efficiently Our Software Heritage" (2021-06-060).
Should your project pass that final hurdle (which under normal
circumstances it should, but please do not seek external publicity
until it is officially confirmed), the selection will be made public
and we will contact you to negotiate a Memorandum of Understanding.
The final amount of the grant will be determined at that point.
We will then also need to share some information about the project
both with the general audience and with the European Commission. In
the interest of time, we ask you to prepare a **one paragraph
management summary** of the project. For examples we refer you to
https://NLnet.nl/thema/NGIZeroDiscovery.html and
We kind request you to send us this summary as soon as possible.
If you meanwhile have any questions, please let us know.
"Storing Efficiently Our Software Heritage" is a web service that provides APIs to efficiently store and retrieve the 10 billions small objects of the [Software Heritage](https://www.softwareheritage.org/) corpus. It will be the first implementation of the [innovative object storage design](https://wiki.softwareheritage.org/wiki/A_practical_approach_to_efficiently_store_100_billions_small_objects_in_Ceph) that was designed early 2021. It has the ability to ingest the corpus in bulk: it makes building search indexes an order of magnitude faster, helps with mirroring etc. It is the first step to a more ambitious and general purpose undertaking allowing to store, search and mirror hundreds of billions of small objects.
Subject: Good tidings from NLnet
Dear Loïc,
Congratulations! We have received the green light from the independent review committee. That means your project "Storing Efficiently Our Software Heritage" (2021-06-060) is one of the selected proposals eligible to receive a grant from NLnet foundation in the June 2021 NGI0 Discovery call!
We should set up a call in order to undertake the necessary further steps - leading up to a Memorandum of Understanding that includes a concrete project plan with pertinent milestones. Note that the final amount of the grant will be determined in dialogue with you, also taking into account any new insights during the negotiations.
We at NLnet Foundation are very much looking forward to working with you, together with the rest of the NGI Zero coalition - which we will tell you all about during our upcoming call. You will also find the key information in the attached document.
Can you please indicate some convenient dates in the coming weeks?
If you meanwhile have any questions, please let us know.
Hello Loïc,
We can meet tomorrow 14pm CEST in
which is a FOSS videoconferencing tool.
Best regards,
In order to have an efficient meeting I prepared a concrete task
list at https://forge.softwareheritage.org/T3432 and a draft project
If at all possible I'd very much like for these discussions to happen
publicly for transparency. However, if NLnet is not comfortable with
the idea I would completely understand.
Timeline: From October 2021 until January 2022 included.
Budget: 350€ per day
Goal: An implementation of the object storage described in
integrated in the Software Heritage codebase, with benchmark results that demonstrate the expected performances
are met.
* October 2021: learning and prefect hash table
T3522 Add winery backend: learning the codebase
T3526 Add winery backend: learning the CI
T3104 T3519 T3520 T3521 Persistent readonly perfect hash table
* November 2021: infrastructure as code for the CI and integration tests
T3523 Add winery backend: create the PostgreSQL cluster
T3524 Add winery backend: create the Ceph cluster
T3525 grid5000 tools and documentation
T3527 Self-host Software Heritage on grid5000
* December 2021: implementation and benchmarks
T3533 Winery backend: implementation
T3528 Add winery backend: grid5000 benchmark
* January 2022: optimization and publication
T3532 T3531 T3530 IO throttling
T3529 Publish object storage benchmark results
I reviewed the information package you provided and have the following comments:
* Software Bill of Materials: should it be created right away or is it a deliverable in itself?
* Payment:
* Mandatory services: Accessibility and Security. Since the outcome
of the project has no direct interactions with users or online
external services, I believe they are not in scope.
* Other services: I do no think any of them would be beneficial for this
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment