Luke Zhu

Hit 1000 Spanish words.

Recently I completed by VN vocab dataset and moved on to Spanish. I hit 1000 words faster than expected. The Spanish vocab feels easier. I think it’s because of 3 things (1) cognates with English, (2) sentence comprehension, and (3) word length. In general, it makes sense that not all languages are of the same difficulty.

Will not take JLPT N2 in July, but language learning progress update

Since the JLPT is not offered in the US this July, I will not take it. I may take it in December. To compensate, I will show some progress here. Language learning is a bit unique type of learning, especially 2nd language learning. Most学习方式要求你多用脑子。学第二语言的时候，有时最好少用脑子。 In other words, it is I/O-heavy, not CPU-heavy. I will demonstrate my progress in 日本 and tiếng Việt in this post. I have not spent any time writing in Japanese or Vietnamese, so how will I do so? ...

Hsk 6

I passed the HSK6 on 11/2024. This is one year after I passed the HSK5. I hope to take the JLPT N2 next summer and pass. If I do pass, this is a sign that my learning techniques are strong and worth sharing.

Papers 9/26

Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics This paper argues that the shared-nothing architecture, or the lake + warehouse is less attractive than a unified “lakehouse” model. It is true that Snowflake and Databricks are quite usable and cost-effective. There are a few parts of the paper that are debatable. The first thing is the assertion that object storage is cheaper than block storage: ...

Papers 9/24

I skimmed a couple of papers today. Log20: Fully Automated Optimal Placement of Log Printing Statements under Specified Overhead Threshold This paper is about automated log printing. It does not seem to be a popular industry practice, but there is potential to the approach. They use info about the paths in the call graph to deduce the “informativeness” of LPS placement. The devise an automated system (called an LPS) and tested it on HDFS, HBase, and a few other large Java projects. They were able to successfully output automated logs without much overhead. ...

大厦

2008年，一个地铁，10个大厦。 2023年，六个地铁，100多大厦。如何？造的时候就不学，学的时候就不造。如何造和学？如果我想造，那也许我应该懂长期打工人的脑子。如果我想学，那也许我应该懂高效学生的脑子。一天只能有一个脑子，那一月也只能有一个脑子吗？

Json

I have tried to not look at code for a few days. However, I can’t escape JSON. This reminds of how much JSON I have seen in my short career. JSON is both easy and hard to deal with. JSON both requires less and more of code to deal with. I wonder if it is time to move JSON from the “application layer” into the “database layer”. However, besides just being faster at processing JSON, the “database layer” needs to be more usable. ...

Debugging 9900 Lines of Copied Code

With >9900 lines of code written, I now have a functioning C tokenizer and preprocessor. However, I still have a broken parser and an unimplemented code generator. How do I debug the parser code, to make it work? I can debug the code using the debugger and println statements. However, is it worth the time? I don’t believe improve my debugging speed is worth the time. Instead, I plan on taking a break from coding, and come back to the project down the road. ...

Small Programming Language Thought

I believe that there is an argument to introduce “modes” to programming languages. Read mode #1 (review mode): Code in review mode is precise and verbose. It is easy for a review to identify bugs and performance issues. Write mode: Code in write mode is concise and imprecise. It is easy for a user to turn thoughts in their head into runnable code. Read mode #2 (learn mode): Code in learn mode is heavily duplicated. Functions are inlined and comments are duplicated. The duplication naturally allows the user to familiarize themselves with the new abstractions in the codebase. Reads and writes are handled differently by databases and storage systems, for good performance reasons. So, why shouldn’t reading code and writing code be handled differently too? Perhaps the efficiency gains from having separate modes is not worth it. ...

5000 Lines of Rust

Over the past week, I have been translating the chibicc C compiler to Rust. The C compiler is roughly 6.5k lines of code. Translating code is repetitive and tiring. There is little feeling of reward. So why do I do it? Here are the main reasons why: I want to be able to write compiler-like tools (SQL parsers, query compilers, VSCode extensions, etc.). Languages and compilers are powerful tools. To write, remembering what code in a compiler looks like is useful, especially code written by an experienced C++ compiler developer. I want to be able to write Rust code more quickly. Most databases contain 100,000-1,000,000 lines of code. In order to write a small database within a year, one needs to be comfortable adding >500 new lines of code every day. Sqlite: 134734 lines of C Postgres: 840319 of C Noria (research database): 74230 lines of Rust So far it has been useful, but there may be better ideas. I guess I will slow down and start thinking about some other projects ...