03 ClickHouse CTO Alexey: Not limited to becoming the fastest database.
By Wang Qilong As one of the Fastest OLAP column databases in the world, ClickHouse can process tens of billions of rows data in milliseconds, and the company introduced itself in official website wi
By Wang Qilong
As one of the Fastest OLAP column databases in the world, ClickHouse can process tens of billions of rows data in milliseconds, and the company introduced itself in official website with a word: fast. Alexey Milovidov, the soul of ClickHouse, is a person who puts slowness into his work. Facing the increasing demand in the database field, Alexey did not directly refer to the external products that meet the demand, but built a universal database system from scratch, which finally made ClickHouse's fast way today. New Programmer magazine specially invited Zou Xin, vice president of CSDN, to interview Alexey Milovidov and share his secret of slow down.
Alexey Milovidov, chief technology officer (CTO) of ClickHouse, worked in Yandex, a Russian search engine company for 13 years, and made great contributions to the development and promotion of ClickHouse. He is one of the core figures of ClickHouse, and also a technical speaker who has shared experiences and opinions at several technical conferences.
Zou Xin, pre-vice president of CSDN.
Fifteen years ago Yandex, also called the "Russian Google", officially launched its web analytics platform Yandex.Metrica, which originated from another Yandex product, Direct, as a tool to help advertisers count ad traffic. Installing Metrica on Yandex's search engine provided insights into the depth of users' visits, conversion rates, and the cost of attracting customers. in 2008, Metrica break away from Direct to become a standalone service that analyzes all website visitors, just as Google Analytics, Facebook Pixel, and Baidu Stats do, allowing people to analyze all of their traffic. It analyzes all website visitors and makes it easy to understand what's happening on the site.
By 2009, traditional relational data libraries dominated the market, Oracle acquired Sun, NoSQL began to emerge, SSDs went to the vendor table, Hadoop continued to gain popularity, and the term "Big Data" finally entered the Internet lexicon. Metrica has grown to become the world's second largest web analytics system when ranked by the number of websites or traffic.
Also in this year,the number of Internet users in Russia and around the world increased dramatically and Metrica began to face the challenge of analyzing data. Yandex seemed to smell something of a trend and took the opportunity to launch a database called OLAPServer. However, collecting as much flow as possible from the Internet was not so easy as Yandex thought, and countless questions arose, such as what is the volume of data? What to do with the data and how to store it? How to structure it to allow users to view flexible and customizable reports? Can you use MySQL, Postgres or Oracle to transfer this data? There is so much about that, and OLAP Server can't do. So Yandex entrusted Alexey Milovidov with this task, an engineer in the Metrica team who has been with Yandex for a year at that time.
Alexey is the leader of Metrica's development team of five people, and he joined Yandex right after graduating from Russia's oldest university, Moscow State University, in 2008. Shortly after, Alexey joined the Metrica team as an engineer at the world's second largest web analytics platform. Young and enthusiastic, Alexey was busy providing solutions to every task given by the company day by day, but he was never in the habit of thinking about his future plans.
However, the team was soon confronted internally with a big question that would affect the fate of the future project: Should we develop our own, or buy an off-the-shelf database product?
In 2009, there weren't as many products to choose as today, and the Metrica team needed to find a solution that could simply store non-aggregated events with multiple attributes and generate different reports on-the-fly without pre-aggregation.
Alexey weighed the pros and cons, then settled on the fact that the solutions available at the time simply couldn't meet the massive requirements of the world's second largest web analytics platform. Products with a long history of development invariably leave a rich intellectual legacy. "This legacy is not only about the source code. It's about organizational structure. It's also about somethings even like pricing and finances. Sometimes big companies just cannot improve."
"Sometimes you have be slightly ignorant and try something just to pain some experience." Alexey believes that it takes more practice to understand the field better than buying off-the-shelf products, otherwise it only adds pitfalls for future project maintenance, "I don't like to reason or guess at technology, if I don't have an intuitive understanding of how it works, I can not just read the documentation and think it will work this way. "
So he chose another thorny path: Building a general-purpose database system from scratch.
The best way to do more with less.
As a team leader, Alexey often needs to motivate his team, "It's really hard to outperform what has come before, but it's not impossible. By believing in your abilities, staying determined, working hard and taking risks, it is possible to create a product that surpasses existing solutions."
In the early stages of the project, the Metrica team evaluated a variety of database systems to find the best data structure for Metrica. During this process, Alexey became familiar with the concept of columnar databases. Columnar databases were not new in 2009, and they had existed since the 1980s and were represented by products such as Sybase IQ and Vertica, etc. Alexey found his way and wrote a prototype of his own. He named the prototype ClickHouse, which stands for "Clickstream Data Warehouse".
In his five-person team, any new code, and even any new commit has to be read and understood by the team first. Everyone, including Alexey himself, needs to read the new commit and understand what it means. If someone comes up with a new technical feature or solution that hasn't been proven to be viable or understandable, Alexey doesn't ban it, but suggests trying it out, experimenting with it, and evaluating its usefulness and understandability over time. If the new code proves to be viable and understandable, then the solution is adopted. In the end, the team unanimously embraced Alexey's approach, confirming the general direction of a columnar database, based on the prototype ClickHouse, which was used to validate the feasibility of generating analytics reports in real time from non-aggregated data.
It took Alexey three years to prove this hypothesis. In 2012, ClickHouse launched and immediately began supporting its only client, Metrica, the world's second largest web analytics platform. Using the development history of ClickHouse during its origins as a starting point, Zou Xin asks his questions:
Zou: I heard that ClickHouse started with five people, how big is the development team now?
Alexey: At the moment, the development team is only 25 people if you count all our C++ engineers. Of course, we also have other departments like cloud infrastructure, integration front-end. And there's also sales team.
Zou: You once said that "anything implemented in C++ is much inherently faster". Is C++ the critical part of ClickHouse's success?
Alexey: This result is actually enumerated. For example, there are some databases on the market that are written in Java and Go, but none of them are fast enough. So that leaves three options with decent performance: Rust, C, or C++.
C++ has the advantage over C of being able to organize code and abstract more efficiently, as well as being more scalable in terms of scaling up projects. And since it's inherently difficult to implement zero-cost abstractions in C, structures like hash tables would be easier and safer to implement in C++ as well.
MySQL was originally implemented in C, but is now primarily implemented in C++, while PostgreSQL has always been in C. The main reason for this is the fact that it is a very simple and easy to use C implementation.
Zou: A classic product like Excel has become good enough today that it doesn't need any more features... Is there a similar "end of development" in ClickHouse's product cycle?
Alexey: No, I think it is endless. And more features as we implement, even more arise.
Zou: But non-stop development can lead to the creation of "big ball of mud" architectures (where too many layers and components are added to the structure of the software, causing the project to become chaotic and complex), how does ClickHouse prevent this situation to ensure that it is still a well structured project?
Alexey: We have to motivate people for code removal. Often it's better to add than to subtract, and removing large amounts of code is more valuable than simply adding interesting new features. ClickHouse remains small and compact because we do more with less. In fact, ClickHouse has less than a million lines of code, whereas MySQL has millions.
Zou: People always says there's a new tool to really improve program communication.But one limiting factors is that even with today's big screen, you can probably only read about 80 lines of code in one screen. So that to be well versed in a one million line of code project, people wonder if I were a new employee in ClickHouse, what advice would you give so I can learn the inside out of the code?
Alexey: I would advise newcomers to first try to implement some small features, requiring changing a few lines of code, so that they can be familiar with how to build the code, how everything arranged around and where the changes have to be located. But sometimes you have to read and your mood will be lower and lower because you don't understand anything. I don't think you really need to worry too much about not being able to read the code at first, because if you keep reading for a few days, your understanding of the code will gradually improve, and then eventually a miracle will happen and it will dawn on you.
Zou: You suggestion is to learn by doing, and I think it's great. How many people on the ClickHouse team know all the code?
Alexey: No one. I also don't understand all the code, but many people understand 80% of it, and eventually the whole team's understanding stacks up. So even if there's some code that you don't understand, you can still work on it if you don't touch it.
What's special about ClickHouse is that it's not "special."
After four years of promotion within the company, ClickHouse has become a core Metrica back-end service. On June 15, 2016, Yandex published a post on its official blog titled "Yandex's ClickHouse: a columnar database for the Internet", that means ClickHouse turned to open source under the Apache 2.0 license.
Alexey and the development team moved the code on ClickHouse to a Github repository under the Yandex organization and started releasing community builds and accepting external contributions. They also started regular meetups to promote ClickHouse and build a community around it. The first ClickHouse meetup started in Eastern Europe in 2016, and following the open source wave, ClickHouse has held meetups in Western Europe, the US, and China. With the first offline meetup in China in 2018, ClickHouse drew strong interest from Chinese developers, with 400 attendees on site and 1,000 online viewers.
Zou: Why has ClickHouse evolved from an internal project to an public open source project? What makes it special?
Alexey: Before ClickHouse was open sourced, it was first popular within the company. Yandex was a big company with about 10,000 employees and countless departmental systems, and ClickHouse was initially developed specifically for just one of those departments. But many other department started to use it. This made me realize that ClickHouse had the potential to be more than just an internal project.
ClickHouse is special precisely because it is not "special". Unlike other data structures that are specialized and designed for specific tasks, ClickHouse found a way to generalize it to a database management system. In other words, the strength of ClickHouse lies in its versatility and ability to adapt to different use cases.
Zou: What are the differences between ClickHouse after open source and before?
Alexey: It's not really that different, maybe we need to take the addition of external customers more seriously, but the whole open source release process and release program and other practices remain the same.
We will now prioritize issues based on severity and impact. And if there is an issue such as a crash in the production environment, it must be resolved as soon as possible, while minor issues such as some configurations will be dealt with at a lower priority. We will have the same release cycle for all our customers, whether it's a paid or clone project it's released monthly. We also have a continuous integration system that runs 2 to 3 million test cases per day to ensure the quality of new releases.
Zou: What interesting things happened when ClickHouse was going to be open source? For example, how did people discover ClickHouse?
Alexey: Here's an interesting thing before source. On many local Russian conferences at that time, we have implemented a column oriented database for our company to do something like click analytic. It is proprietary and it goes nowhere. So when we announced an open source columnar database to handle clickstream analytics shortly after, it was an immediate sensation.
Zou: So it seems that in a fully open market, good product will eventually win. I think maybe that's the advantage of an open source environment.
Alexey: Yeah.
Zou: In a sense, is it ture that being open source also give people the use of confidence. Because they can look inside the code and find out what's wrong?
Alexey: Absolutely. I think it is one of the advantages. Even if some people will never look at the source code, they just understand that they don't depend on a single corporation. But if they want, they can.
Zou: In the open source environment, does ClickHouse have trainings or get together for all the ClickHouse users?
Alexey: Unfortunately, no. There were a few events for developers. It has been difficult to organize similar events because of various restrictions in previous years.
Zou: But maybe from now on, it's kind of more likely to match these events. Because sometimes I feel like face to face real person interaction is important because they finally connect email address to a real person. Then it can really help with the chemistry among the team members.
Alexey: I remember we did a Hackathon for ClickHouse, so everyone has to implement the feature in a day. And surprisingly, it was successful. However, organizing this event took us a lot of effort, because as problem solvers, we had to think about all the potential solutions for these features in advance.
The only people who can present a project well are the developers themselves
In 2019, just like the Metrica of Yandex shed Direct, ClickHouse moved down from the Yandex Github organization to a separate ClickHouse organization. So new changes happened.
In 2021, the company named ClickHouse was officially founded in Delaware and headquartered in the San Francisco Bay Area. It evolved from an "internal Yandex project" to an "open source Yandex project" and eventually to an "open source ClickHouse company project". In 2022, ClickHouse's European office, also their only offline office opened in Amsterdam, the Netherlands, launched early plans for cloud services.
Employees of ClickHouse come from a dozen different countries, successfully navigating time zone differences, languages and cultures. This piqued Zou's curiosity: How does Alexey lead his team to do this?
Zou: ClickHouse has employees in more than 10 countries around the world. How do you communicate remotely?
Alexey: We use Github - and Slack (a foreign social app for enterprise business, where coworking is mainly done by creating channels), unfortunately I find that using Slack can easily lead to work distractions, and many people get addicted to chatting in channels.
Zou: It's true that in distributed teams, people tend to mix life and work like people telling jokes in this channel and they also talk about workingin the same channel. I' m not sure if that's true in ClickHouse?
Alexey: Yes, I think it's true. Because most of our employees work remotely, but we also set up an office in Amsterdam. Sometimes people just meet in the kitchen, so we don't have to put everything in Slack, even leading to other coworkers working in otherplaces sometimes missing part of the communication.
Zouxin: So how did you guys solve the time zone problem? For example, if you have collaborators in the US and the time difference is too much, maybe you will never be online at the same time.
Alexey: European coworkers have to adapt to the United States and people in United States have to adapt to European time. People in the United States will start working at 6:00 or 7:00 a.m., so we will start working at 4:00 p.m.
The most difficult part is when I start working with people from the United States, we have meetings and writing codes as late as 5:00 am in Europe, then people from China (around 12:00 pm Beijing time) start asking me questions.
Zou: When you visit China for interviews, as you are doing now, will someone from the company be in charge of code review and decision making?
Alexey: We don't have a specific person in charge of this task, code review tasks are assigned to all team members and everyone on the team is required to perform weekly code reviews. This is to ensure the quality of the code and minimize errors, as well as to foster cooperation and mutual understanding between team members.
We have an daily and weekly release schedule to ensure that all team members are completing relevant tasks at the same time, and this approach allows the team to work more efficiently and fluidly.
Zou: How do ClickHouse engineers, the people who really busy with coding and development of ClickHouse separate their time between development work and community outreach? Or do you have a dedicated team for this work?
Alexey: We don't have a separate team and I think engineers should present by themselves at community events and show people what they've got and what they're planning.
It's really not easy to do both, so instead of sending someone to a conference in China without any public speaking experience, I'd start by having someone try speaking at smaller events in Europe as a way of gaining experience and confidence.
Zou: How do you engage with different levels of people (core contributors, team members, experts, mid-level enthusiasts and visitors) in an open source project these days?
Alexey: If you talk to someone with multiple levels of experience anfd some seasonal professional, you can tell them everything directly. But for unexperienced people, I think it's important to be as helpful and friendly as possible.
Zou: Sometimes I think it's hard becaues when you only know that person by a username.
Alexey: It's as simple as going through the person's pull request in detail. This gives you a better idea of the person's experience level, and also allows you to quickly recognize who the junior contributors are and help them out.
AI's impact on the future: the best of the best
Alexey said that creating an open source project is building a technology without borders, which means software that everyone can use, and a community that can build up everywhere.
Now in 2023, Alexey comes to China himself. As Alexey told his story of how he grown up in ClickHouse, Zou talked to him about the global AI boom, changes in computer education, and the unique problems of the programmer industry in China. What kind of sparks will be generated when the values of Chinese and Russian programmers collide?
Zou: Future programmers will ask Copilot and ChatGPT to help, and almost all the algorithms and data structure were already implemented by other people and have been tested. So why shoule I still need to learn those basic programming skills?
Alexey: Maybe you should not, but AI will make those who learn the basics more valuable. They will be able to figure out and fix the AI when it will gone wrong. In the future, perhaps there will be less demand for low-quality engineers, but the demand for high-quality engineers will even continue to increase.
Zou: Do you think people still need to learn all that classical computer theory in college courses?
Alexey: Maybe, but learning theory is not for everyone. So don't try to get all people enjoy programming and understand it, maybe some people just don't need it even in the computer science department. Just make some introduce to them, but don't insist on full understanding.
Zou: The developer community often debates about what is the first programming language for students? Some will start with C and Basic, and now there are a lot of people using Python even Java. So What's your opinion?
Alexey: My first programming language was Basic. I think that it should invite more people to programming in the future. So the way of programming should be easy to learn and visualize. Maybe something easy to write, like games. so that people can be attracted to learn programming. So, I think the first programming language for students should be Python or even Javascript, which is one of the best languages.
Zou: But this part is very flexible.
Alexey: Yeah.
Zou: So visual/video game programming can be more fulfilling. I think the main problem is that programming software nowadays consists of black and white windows, which can easily make young people think that programming is boring. So in the future it may be necessary to change the programming software itself to make the act of programming more fun, as you mentioned.
Alexey: Exactly, I think people will start with something that interests them. And then real engineers will thinkl how to hack systems, making little games, doing low-level stuff, or exploring the underpinnings of applications. That's how you get more people enjoy programming.
Zou: In China, there are 10 million college graduates every year, and about 10% of them are majoring in computer science related fields, such as computer science, software engineering, embedded systems, etc. So many people worry that they will only be able to program for 10-15 years and retire around 35 years old. So many people are worried that they will only be able to program for 10-15 years and will retire around the age of 35, which Chinese developers call the "35-year-old phenomenon". Is there similar concerns for people in europe?
Alexey: If you are a good programmer with unique experience and a lot of understandings that no one have, there's no need to worry. Such programmers are just like those experienced lawyers or doctors who always find the right path for themselves.
Zou: So the key is to really develop an area of expertise. But what if one's area of expertise happens to be an older technology like PHP programming?
Alexey: It's also not a big problem. For example, the demand for COBOL programmers remains high, and many COBOL programmers are over 70 years old.
Zou: What advice would you give to college students who are studying programming, computer science, or software engineering?
Alexey: You have to love it and enjoy it.
Zou: But sometimes interest can change, many people enter the computer industry only to find that they do not like it. Eventually the interests change like they just fade out.
Alexey: Yeah, it's not only possible, especially if you have to do a lot of business logic. In the future, AI is going to make this better, it can take care of the boring parts.
Zou: So I think that's fine and then maybe those people can really have their own interest that they really enjoy. As you mentioned, and the key is you have to like it.
更多推荐
所有评论(0)