Over the past week, I have been translating the chibicc C compiler to Rust. The C compiler is roughly 6.5k lines of code.

Translating code is repetitive and tiring. There is little feeling of reward. So why do I do it?

Here are the main reasons why:

  1. I want to be able to write compiler-like tools (SQL parsers, query compilers, VSCode extensions, etc.). Languages and compilers are powerful tools. To write, remembering what code in a compiler looks like is useful, especially code written by an experienced C++ compiler developer.
  2. I want to be able to write Rust code more quickly. Most databases contain 100,000-1,000,000 lines of code. In order to write a small database within a year, one needs to be comfortable adding >500 new lines of code every day.
    • Sqlite: 134734 lines of C
    • Postgres: 840319 of C
    • Noria (research database): 74230 lines of Rust

So far it has been useful, but there may be better ideas. I guess I will slow down and start thinking about some other projects

Here are some things I’ve learned so far:

  1. Translating 500+ lines of code a day is tiring. Will it get less tiring?
  2. Tracing is useful but generates too much data (1GB/run) when you instrument every function. Can a custom-built column store and dataflow engine make it easier to debug code using traces? I think it is possible.
  3. Compilers become complicated and error-prone because a lot of code shares the same path. If you “partition” the control-flow properly, you can reduce the complexity of the main path. However, you then deal with the problems introduced by partitioning. There are many ways to divide a codebase.
  4. Compilers are composed of 4 main parts: tokenizers, parsers, optimizers, and code generators. A C compiler also has a preprocessor for handling macros.
  5. Compilers typically scan across tokens multiple times, applying small rules at each stage. Loading tokens from disk in small batches and storing them in a contiguous buffer should lead to good performance. The chibicc compiler stores things in linked lists for simplicity, but that may lead to more cache misses that needed.
  6. Parsing is a bit more complicated. I am unsure what data layout would lead to the least number of cache misses. I will need to look at the literature to see if cache misses are a bottleneck though.
  7. Compilers typically have some recursive steps. Recursive tokenization and of course recursive-descent parsing.

This week I hit 5000 lines. I expect there are ~5000 more lines to write.

use core::str;
use std::{
    cell::{Cell, RefCell},
    collections::{HashMap, HashSet, LinkedList},
    ffi::{c_char, CString},
    fs,
    io::{stdin, Read},
    ops::{Index, Range, RangeFull},
    os::unix::fs::MetadataExt,
    path::{Path, PathBuf},
    rc::Rc,
};

use chrono::{DateTime, Local};
use libc::{isxdigit, strtod};
use tracing::{instrument, Level};
use tracing_subscriber::fmt::format::FmtSpan;

//
// Custom logging (using the tracing library)
//
#[instrument(ret)]
pub fn setup_tracing() {
    let subscriber = tracing_subscriber::fmt::Subscriber::builder()
        .with_span_events(FmtSpan::NEW | FmtSpan::CLOSE)
        .with_ansi(false)
        .with_line_number(true)
        .json()
        .with_span_list(false)
        .finish();

    tracing::subscriber::set_global_default(subscriber).unwrap();
}

//
// Custom, rust-only data structures
//

#[derive(Clone)]
struct Location {
    idx: usize,
    end_idx: usize,
    contents: Rc<Vec<u8>>,
}

impl std::fmt::Debug for Location {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        let s = self.as_str();
        let mut i = std::cmp::min(30, s.len());
        while !s.is_char_boundary(i) {
            i += 1;
        }
        f.debug_struct("Location")
            .field("idx", &self.idx)
            .field("end_idx", &self.end_idx)
            .field("contents[idx..end_idx]", &&self.as_str()[..i])
            .finish()
    }
}

impl Location {
    #[instrument(level = "debug")]
    fn len(&self) -> usize {
        return self.end_idx - self.idx;
    }

    #[instrument(level = "debug")]
    fn as_bytes(&self) -> &[u8] {
        return &self.contents[self.idx..self.end_idx];
    }

    #[instrument(level = "debug")]
    fn as_str(&self) -> &str {
        let u8 = &self.contents[self.idx..self.end_idx];
        core::str::from_utf8(u8).unwrap()
    }
}

impl Index<usize> for Location {
    type Output = u8;

    #[instrument(level = "debug")]
    fn index(&self, index: usize) -> &Self::Output {
        return self.contents.index(self.idx + index);
    }
}

impl Iterator for Location {
    type Item = u8;

    #[instrument(level = "debug", ret)]
    fn next(&mut self) -> Option<Self::Item> {
        if self.idx >= self.end_idx {
            return None;
        } else {
            let ret = self.contents[self.idx];
            self.idx += 1;
            return Some(ret);
        }
    }
}

//
// strings.rs
//

//
// tokenize.rs

#[derive(PartialEq, Debug, Clone)]
enum TokenKind {
    Ident,   // Identifiers
    Punct,   // Punctuators
    Keyword, // Keywords
    Str,     // String literals
    Num,     // Numeric literals
    PPNum,   // Preprocessing numbers
}

#[derive(Clone)]
struct File {
    name: String,
    file_no: i32,
    contents: Rc<Vec<u8>>,
    display_name: String,
    line_delta: i32,
}

impl std::fmt::Debug for File {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        f.debug_struct("File")
            .field("name", &self.name)
            .field("file_no", &self.file_no)
            .field("display_name", &self.display_name)
            .field("line_delta", &self.line_delta)
            .finish()
    }
}

#[derive(Clone, Debug)]
enum TokenFields {
    None,
    Int(Type, i64),
    Float(Type, f64),
    Str(Type, Box<[u8]>), // Can be UTF-8, UTF-16, or UTF-32.
}

#[derive(Clone)]
pub struct Token {
    kind: TokenKind,
    loc: Location, // Token location (equivalent to a char*)

    file: Rc<RefCell<File>>,
    filename: Option<String>,
    line_no: i32,
    line_delta: i32,
    at_bol: bool,              // If token is at beginning of line, true
    has_space: bool,           // If token follows space, true
    hideset: Vec<Hideset>,     // For macro expansion
    origin: Option<Rc<Token>>, // If this is expanded from a macro, the original token
    other: TokenFields,
}

impl std::fmt::Debug for Token {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        f.debug_struct("Token")
            .field("kind", &self.kind)
            .field("loc", &self.loc)
            .field("other", &self.other)
            .field("line_no", &self.line_no)
            .field("line_delta", &self.line_delta)
            .field("at_bol", &self.at_bol)
            .field("has_space", &self.has_space)
            .finish_non_exhaustive()
    }
}

impl std::fmt::Display for Token {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        return write!(f, "Token({:?}, {})", self.kind, self.loc.as_str());
    }
}

//
// preprocess.c
//
#[derive(Debug, Clone)]
struct Hideset {
    name: String,
}

enum ObjFields {
    None,
    LVar {
        // Local variable
        offset: i32,
    },
    GVar {
        // Global variable or function
        is_function: bool,
        is_definition: bool,
        is_static: bool,
        // Global variable
        is_tentative: bool,
        is_tls: bool,
        init_data: Vec<u8>,
        rel: Option<Relocation>,
    },
    Func {
        // Global variable or function
        is_function: bool,
        is_definition: bool,
        is_static: bool,
        // Function
        is_inline: bool,
        params: Vec<Rc<RefCell<Obj>>>,
        body: Rc<RefCell<Node>>,
        locals: Vec<Rc<RefCell<Obj>>>,
        va_area: Rc<RefCell<Obj>>,
        alloca_bottom: Rc<RefCell<Obj>>,
        stack_size: i32,
    },
    StackInlineFunc {
        // Stack inline function
        is_live: bool,
        is_root: bool,
        refs: Vec<String>,
    },
}

//
// parse.c
//
struct Obj {
    next: Rc<RefCell<LinkedList<Rc<RefCell<Obj>>>>>,
    name: String,
    ty: Type,
    tok: Option<Token>, // Representative token
    is_local: bool,     // local or glocal/function
    align: usize,       // alignment
    other: ObjFields,
}

// Global variable can be initialized either by a constant expression
// or a pointer to another global variable. This struct represents the
// latter.
#[derive(Debug)]
struct Relocation {
    offset: i32,
    label: String, // TODO: pointer?
    addend: i32,
}

#[derive(Debug)]
enum NodeBinaryKind {
    Add,
    Sub,
    Mul,
    Div,
    Neg,
    Mod,
    BitAnd, // &
    BitOr,  // |
    BitXor, // ^
    Shl,    // <<
    Shr,    // >>
    Eq,     // =
    Ne,     // !=
    Lt,
    Le,
    Assign,
    Comma,
}

#[derive(Debug)]
enum NodeUnaryKind {
    Addr,
    Deref,
    Not,    // !
    BitNot, // ~
    LogAnd, // &&
    LogOr,  // ||
}

#[derive(Debug)]
enum NodeStmtKind {
    ExprStmt, // Expression statement
    StmtExpr, // Statement expression
}

#[derive(Debug)]
enum NodeAtomicKind {
    Cas,
    Exch, // Atomic exchange
}

#[derive(Debug)]
enum NodeCondKind {
    Cond, // ?:
    If,
    For,
    Do,
    Switch,
    Case,
    Goto,
    GotoExpr, // "goto" labels-as-values"
    Label,    // labeled statement
    LabelVal, // Labels-as-values
}

enum NodeFields {
    NullExpr, // Do nothing
    Binary {
        kind: NodeBinaryKind,
        lhs: Rc<RefCell<Node>>,
        rhs: Rc<RefCell<Node>>,
    },
    Member {
        member: Member,
    },
    Unary {
        kind: NodeUnaryKind,
        expr: Rc<RefCell<Node>>,
    },
    Return,
    Cond {
        kind: NodeCondKind,
        // if or for statement
        cond: Rc<RefCell<Node>>,
        then: Rc<RefCell<Node>>,
        els: Rc<RefCell<Node>>,
        init: Rc<RefCell<Node>>,
        inc: Rc<RefCell<Node>>,
        // "break" and "continue" labels
        brk_label: String,
        cont_label: String,
        // Switch
        case_next: Rc<RefCell<Node>>,
        default_case: Rc<RefCell<Node>>,
        // Case
        begin: i32,
        end: i32,
    },
    Block, // { .. }
    Label {
        // Goto or labeled statement, or labels-as-values
        label: String,
        unique_label: String,
        goto_next: Rc<RefCell<Node>>,
    },
    FnCall {
        // Function call
        func_ty: Type,
        args: Vec<Rc<RefCell<Node>>>,
        pass_by_stack: bool,
        ret_buffer: Rc<RefCell<Obj>>,
    },
    Stmt {
        kind: NodeStmtKind,
        expr: Option<Rc<RefCell<Node>>>,
        body: Vec<Rc<RefCell<Node>>>,
    },
    Var(Rc<RefCell<Obj>>),
    VlaPtr(Rc<RefCell<Obj>>),
    Int(i64),
    Float(i64),
    Cast(Rc<RefCell<Node>>),
    Cas {
        // Atomic compare-and-swap
        addr: Rc<RefCell<Node>>,
        old: Rc<RefCell<Node>>,
        new: Rc<RefCell<Node>>,
    },
    Exch {
        lhs: Rc<RefCell<Node>>,
        rhs: Rc<RefCell<Node>>,
    },
    MemZero,
    Asm(String),
}

impl std::fmt::Debug for NodeFields {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        match self {
            Self::NullExpr => write!(f, "NullExpr"),
            Self::Binary { kind, lhs, rhs } => f
                .debug_struct("Binary")
                .field("kind", kind)
                // .field("lhs", lhs)
                // .field("rhs", rhs)
                .finish(),
            Self::Member { member } => f.debug_struct("Member").field("member", member).finish(),
            Self::Unary { kind, expr } => f
                .debug_struct("Unary")
                .field("kind", kind)
                // .field("expr", expr)
                .finish(),
            Self::Return => write!(f, "Return"),
            Self::Cond {
                kind,
                cond,
                then,
                els,
                init,
                inc,
                brk_label,
                cont_label,
                case_next,
                default_case,
                begin,
                end,
            } => f
                .debug_struct("Cond")
                .field("kind", kind)
                // .field("cond", cond)
                // .field("then", then)
                // .field("els", els)
                // .field("init", init)
                // .field("inc", inc)
                // .field("brk_label", brk_label)
                // .field("cont_label", cont_label)
                // .field("case_next", case_next)
                // .field("default_case", default_case)
                // .field("begin", begin)
                // .field("end", end)
                .finish(),
            Self::Block => write!(f, "Block"),
            Self::Label {
                label,
                unique_label,
                goto_next,
            } => f
                .debug_struct("Label")
                // .field("label", label)
                // .field("unique_label", unique_label)
                // .field("goto_next", goto_next)
                .finish(),
            Self::FnCall {
                func_ty,
                args,
                pass_by_stack,
                ret_buffer,
            } => f
                .debug_struct("FnCall")
                .field("func_ty", func_ty)
                // .field("args", args)
                .field("pass_by_stack", pass_by_stack)
                // .field("ret_buffer", ret_buffer)
                .finish(),
            Self::Stmt { kind, body, expr } => f
                .debug_struct("Stmt")
                .field("kind", kind)
                // .field("body", body)
                // .field("expr", expr)
                .finish(),
            Self::Var(arg0) => f
                .debug_tuple("Var") /* .field(arg0) */
                .finish(),
            Self::VlaPtr(arg0) => f
                .debug_tuple("VlaPtr") /* .field(arg0) */
                .finish(),
            Self::Int(arg0) => f.debug_tuple("Int").field(arg0).finish(),
            Self::Float(arg0) => f.debug_tuple("Float").field(arg0).finish(),
            Self::Cast(arg0) => f.debug_tuple("Cast").field(arg0).finish(),
            Self::Cas { addr, old, new } => f
                .debug_struct("Cas")
                // .field("addr", addr)
                // .field("old", old)
                // .field("new", new)
                .finish(),
            Self::Exch { lhs, rhs } => f
                .debug_struct("Exch")
                // .field("lhs", lhs)
                // .field("rhs", rhs)
                .finish(),
            Self::MemZero => write!(f, "MemZero"),
            Self::Asm(arg0) => f.debug_tuple("Asm").field(arg0).finish(),
        }
    }
}

#[derive(Debug)]
struct Node {
    kind: NodeFields,
    ty: Option<Type>,
    tok: Token, // Representative token
}

//
// type.rs
//

#[derive(Clone, Debug)]
enum TypeKind {
    Void,
    Bool,
    Char,
    Short,
    Int,
    Long,
    Float,
    Double,
    LDouble,
    Enum,
    Ptr {
        base: Box<Type>,
    },
    Func {
        return_ty: Box<Type>,
        params: Vec<Type>,
        is_variadic: bool,
        next: Option<Box<Type>>,
    },
    Array {
        base: Box<Type>,
        len: usize,
    },
    Vla {
        base: Box<Type>,
        vla_len: usize,
        vla_size: Option<usize>,
    }, // Variable-length array
    Struct {
        members: Vec<Member>,
        is_flexible: bool,
        is_packed: bool,
    },
    Union {
        members: Vec<Member>,
        is_flexible: bool,
    },
}

#[derive(Clone)]
struct OneOrManyTypes {
    ty: Vec<Type>,
}

#[derive(Clone)]
struct TypeDeclaration {
    name: Box<Token>,
    name_pos: Box<Token>,
}

#[derive(Clone)]
struct Type {
    kind: TypeKind,
    size: usize, // sizeof() value
    align: usize,
    is_unsigned: bool,
    is_atomic: bool,
    origin: Option<Rc<Type>>, // for type compatibility check
    decl: Option<TypeDeclaration>,
}

impl std::fmt::Debug for Type {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        f.debug_struct("Type")
            .field("kind", &self.kind)
            .field("size", &self.size)
            .field("align", &self.align)
            .field("is_unsigned", &self.is_unsigned)
            .finish_non_exhaustive()
    }
}

#[derive(Clone, Debug)]
struct Member {
    ty: Type,
    tok: Token,
    name: Token,
    idx: i32,
    align: i32,
    offset: i32,
    is_bitfield: bool,
    bit_offset: i32,
    bit_width: i32,
}

#[derive(Debug)]
pub struct Tokenize {
    current_file: Option<Rc<RefCell<File>>>, // Input file
    input_files: Vec<Rc<RefCell<File>>>,     // A list of all input files
}

impl Tokenize {
    #[instrument(ret)]
    pub fn new() -> Self {
        Tokenize {
            current_file: None,
            input_files: Vec::new(),
        }
    }
}

// type.c

const TY_VOID: Type = new_type(TypeKind::Void, 1, 1, false);
const TY_BOOL: Type = new_type(TypeKind::Bool, 1, 1, false);

const TY_CHAR: Type = new_type(TypeKind::Char, 1, 1, false);
const TY_SHORT: Type = new_type(TypeKind::Short, 2, 2, false);
const TY_INT: Type = new_type(TypeKind::Int, 4, 4, false);
const TY_LONG: Type = new_type(TypeKind::Long, 8, 8, false);

const TY_UCHAR: Type = new_type(TypeKind::Char, 1, 1, true);
const TY_USHORT: Type = new_type(TypeKind::Short, 1, 1, true);
const TY_UINT: Type = new_type(TypeKind::Int, 4, 4, true);
const TY_ULONG: Type = new_type(TypeKind::Long, 8, 8, true);

const TY_FLOAT: Type = new_type(TypeKind::Float, 4, 4, false);
const TY_DOUBLE: Type = new_type(TypeKind::Double, 8, 8, false);
const TY_LDOUBLE: Type = new_type(TypeKind::LDouble, 16, 16, false);

const fn new_type(kind: TypeKind, size: usize, align: usize, is_unsigned: bool) -> Type {
    return Type {
        kind,
        size,
        align,
        is_unsigned,
        is_atomic: false,
        origin: None,
        decl: None,
    };
}

#[instrument(ret)]
fn is_integer(ty: &Type) -> bool {
    match ty.kind {
        TypeKind::Bool | TypeKind::Char | TypeKind::Short | TypeKind::Int | TypeKind::Long => true,
        TypeKind::Enum => true,
        TypeKind::Void
        | TypeKind::Float
        | TypeKind::Double
        | TypeKind::LDouble
        | TypeKind::Ptr { .. }
        | TypeKind::Func { .. }
        | TypeKind::Array { .. }
        | TypeKind::Vla { .. }
        | TypeKind::Struct { .. }
        | TypeKind::Union { .. } => false,
    }
}

#[instrument(ret)]
fn is_flonum(ty: &Type) -> bool {
    let k = ty.kind.clone();
    // return k == TypeKind::Float || k == TypeKind::Double || k == TypeKind::Double;
    match ty.kind {
        TypeKind::Float | TypeKind::Double | TypeKind::LDouble => true,
        TypeKind::Void
        | TypeKind::Bool
        | TypeKind::Char
        | TypeKind::Short
        | TypeKind::Int
        | TypeKind::Long
        | TypeKind::Enum
        | TypeKind::Ptr { .. }
        | TypeKind::Func { .. }
        | TypeKind::Array { .. }
        | TypeKind::Vla { .. }
        | TypeKind::Struct { .. }
        | TypeKind::Union { .. } => false,
    }
}

#[instrument(ret)]
fn is_numeric(ty: &Type) -> bool {
    return is_integer(ty) || is_flonum(ty);
}

#[instrument(ret)]
fn is_compatible(t1: &Type, t2: &Type) -> bool {
    if std::ptr::eq(t1, t2) {
        return true;
    }
    if let Some(o) = &t1.origin {
        return is_compatible(&o, t2);
    }
    if let Some(o) = &t2.origin {
        return is_compatible(t1, &o);
    }
    if std::mem::discriminant(&t1.kind) == std::mem::discriminant(&t2.kind) {
        return false;
    }
    match &t1.kind {
        TypeKind::Void => return false,
        TypeKind::Bool => return false,
        TypeKind::Char | TypeKind::Short | TypeKind::Int | TypeKind::Long => {
            return t1.is_unsigned == t2.is_unsigned;
        }
        TypeKind::Float | TypeKind::Double | TypeKind::LDouble => {
            return true;
        }
        TypeKind::Enum => return false,
        TypeKind::Ptr { base: base1 } => {
            let base2 = match &t2.kind {
                TypeKind::Void
                | TypeKind::Bool
                | TypeKind::Char
                | TypeKind::Short
                | TypeKind::Int
                | TypeKind::Long
                | TypeKind::Float
                | TypeKind::Double
                | TypeKind::LDouble
                | TypeKind::Enum => {
                    return false;
                }
                TypeKind::Ptr { base } => base,
                TypeKind::Func { .. } => {
                    return false;
                }
                TypeKind::Array { base, len } => base,
                TypeKind::Vla {
                    base,
                    vla_len,
                    vla_size,
                } => base,
                TypeKind::Struct {
                    members,
                    is_flexible,
                    is_packed,
                } => {
                    return false;
                }
                TypeKind::Union { .. } => {
                    return false;
                }
            };
            return is_compatible(&base1, &base2);
        }
        TypeKind::Func {
            return_ty: return_ty1,
            params: params1,
            is_variadic: is_variadic1,
            next: _,
        } => {
            if let TypeKind::Func {
                return_ty: return_ty2,
                params: params2,
                is_variadic: is_variadic2,
                next: _,
            } = &t2.kind
            {
                if !is_compatible(&return_ty1, &return_ty2) {
                    return false;
                }
                if is_variadic1 != is_variadic2 {
                    return false;
                }
                if params1.len() != params2.len() {
                    return false;
                }
                for (p1, p2) in params1.iter().zip(params2.iter()) {
                    if !is_compatible(&p1, &p2) {
                        return false;
                    }
                }
                return true;
            } else {
                return false;
            }
        }
        TypeKind::Array { .. } => {
            // let t1b = t1.base.as_ref().unwrap();
            // let t2b = t2.base.as_ref().unwrap();
            // if !is_compatible(&t1b, &t2b) {
            //     return false;
            // }
            // // TODO: what is going on here?
            // return t1.array_len.unwrap() < 0
            //     && t2.array_len.unwrap() < 0
            //     && t1.array_len == t2.array_len;
            return false;
        }
        // TODO: can these types be compatible?
        TypeKind::Vla { .. } => return false,
        TypeKind::Struct { .. } => return false,
        TypeKind::Union { .. } => return false,
    }
}

#[instrument(ret)]
fn copy_type(ty: &Type) -> Type {
    let mut ret = ty.clone();
    ret.origin = Some(Rc::new(ty.clone()));
    return ret;
}

#[instrument(ret)]
fn pointer_to(base: &Type) -> Type {
    return new_type(
        TypeKind::Ptr {
            base: Box::new(base.clone()),
        },
        8,
        8,
        true,
    );
}

#[instrument(ret)]
fn func_type(return_ty: &Type) -> Type {
    return new_type(
        TypeKind::Func {
            return_ty: Box::new(return_ty.clone()),
            params: Vec::new(),
            is_variadic: false,
            next: None,
        },
        1,
        1,
        false,
    );
}

#[instrument(ret)]
fn array_of(base: &Type, len: usize) -> Type {
    return new_type(
        TypeKind::Array {
            base: Box::new(base.clone()),
            len: len,
        },
        base.size * len,
        base.align,
        false,
    );
}

#[instrument(ret)]
fn enum_type() -> Type {
    return new_type(TypeKind::Enum, 4, 4, false);
}

#[instrument(ret)]
fn struct_type() -> Type {
    return new_type(
        TypeKind::Struct {
            members: Vec::new(),
            is_flexible: false,
            is_packed: false,
        },
        0,
        1,
        false,
    );
}

#[instrument(ret)]
fn get_common_type(ty1: &Type, ty2: &Type) -> Type {
    match &ty1.kind {
        TypeKind::Ptr { base, .. } => {
            return pointer_to(&base);
        }
        TypeKind::Array { base, .. } => {
            return pointer_to(&base);
        }
        TypeKind::Vla { base, .. } => {
            return pointer_to(&base);
        }
        TypeKind::Void
        | TypeKind::Bool
        | TypeKind::Char
        | TypeKind::Short
        | TypeKind::Int
        | TypeKind::Long
        | TypeKind::Float
        | TypeKind::Double
        | TypeKind::LDouble
        | TypeKind::Enum
        | TypeKind::Func { .. }
        | TypeKind::Struct { .. }
        | TypeKind::Union { .. } => {}
    }

    if let TypeKind::Func { .. } = &ty1.kind {
        return pointer_to(ty1);
    }
    if let TypeKind::Func { .. } = &ty2.kind {
        return pointer_to(ty2);
    }

    if matches!(ty1.kind, TypeKind::LDouble) || matches!(ty2.kind, TypeKind::LDouble) {
        return TY_LDOUBLE;
    }
    if matches!(ty2.kind, TypeKind::Double) || matches!(ty2.kind, TypeKind::Double) {
        return TY_DOUBLE;
    }
    if matches!(ty1.kind, TypeKind::Float) || matches!(ty2.kind, TypeKind::Float) {
        return TY_FLOAT;
    }

    let mut ty1m: Type = ty1.clone();
    let mut ty2m: Type = ty2.clone();

    if ty1m.size < 4 {
        ty1m = TY_INT;
    }
    if ty2.size < 4 {
        ty2m = TY_INT;
    }
    if ty1m.size != ty2m.size {
        if ty1m.size < ty2m.size {
            return ty2m;
        } else {
            return ty1m;
        }
    }

    if ty2m.is_unsigned {
        return ty2m;
    }

    return ty1m;
}

// add_type is implemented in the parse.c section

#[cfg(test)]
mod test_type {
    use super::{is_compatible, TY_CHAR, TY_LONG, TY_ULONG};

    #[test]
    fn test_is_compatible() {
        let cases = [((TY_CHAR, TY_CHAR), true), ((TY_ULONG, TY_LONG), false)];

        for ((t1, t2), e) in cases {
            let a = is_compatible(&t1, &t2);
            assert_eq!(a, e);
        }
    }
}

// unicode.c

#[instrument(ret, level = "Debug")]
fn encode_utf8(buf: &mut [u8], c: u32) -> usize {
    #[instrument(ret)]
    fn one_zero_6_bits(c: u32) -> u8 {
        return 0b10000000 | (c & 0b111111) as u8;
    }

    if c <= 0x7F {
        buf[0] = c as u8;
        return 1;
    } else if c <= 0x7FF {
        buf[0] = 0b11000000 | (c >> 6) as u8;
        buf[1] = one_zero_6_bits(c);
        return 2;
    } else if c <= 0xFFFF {
        buf[0] = 0b11100000 | (c >> 12) as u8;
        buf[1] = one_zero_6_bits(c >> 6);
        buf[2] = one_zero_6_bits(c);
        return 3;
    } else if c <= 0x10FFFF {
        buf[0] = 0b11110000 | (c >> 18) as u8;
        buf[1] = one_zero_6_bits(c >> 12);
        buf[2] = one_zero_6_bits(c >> 6);
        buf[3] = one_zero_6_bits(c);
        return 4;
    } else {
        // TODO: error?
        todo!();
    }
}

#[instrument(ret, level = "Debug")]
fn decode_utf8(b: &[u8]) -> Result<(u32, &[u8]), &str> {
    // TODO: handle out of bounds errors?
    if b[0] <= 0x7F {
        return Ok((b[0] as u32, &b[1..]));
    } else {
        let (first_bits, len) = {
            if b[0] >= 0b11110000 {
                (b[0] & 0b111, 4)
            } else if b[0] >= 0b11100000 {
                (b[0] & 0b1111, 3)
            } else if b[0] >= 0b11000000 {
                (b[0] & 0b11111, 2)
            } else {
                // TODO: bubble up to display the token which triggered the error?
                return Err("invalid UTF-8 sequence");
            }
        };

        let mut c = first_bits as u32;
        for i in 1..len {
            if b[i] < 0b10000000 {
                // TODO: bubble up to display the token which triggered the error?
                return Err("invalid UTF-8 sequence");
            }
            c = (c << 6) | (b[i] & 0b111111) as u32
        }

        Ok((c, &b[len..]))
    }
}

#[cfg(test)]
mod test_utf8 {
    use super::{decode_utf8, encode_utf8};

    #[test]
    fn utf8() {
        let points: Vec<u32> = vec![0x7F, 0x80, 0x7FF, 0x800, 0xFFFF, 0x10000, 0x10FFFF];

        for p in points {
            let mut buffer = [0u8; 4];
            encode_utf8(&mut buffer, p);
            let r = decode_utf8(&buffer);
            assert!(r.is_ok());
            let (point_result, new) = r.unwrap();
            assert_eq!(point_result, p);
            assert!(new.len() < 4);
        }
    }
}

#[instrument(ret, level = "debug")]
fn in_range(range: &[(u32, u32)], c: u32) -> bool {
    for (low, high) in range {
        if *low <= c && c <= *high {
            return true;
        }
    }
    return false;
}

// [https://www.sigbus.info/n1570#D] C11 allows not only ASCII
// but some multibyte characters in certain Unicode ranges to be used in
// an identifier.
//
// This function returns true if a given character is acceptable as the
//first character of an identifier.
#[instrument(ret, level = "debug")]
fn is_ident1(c: u32) -> bool {
    let range = [
        ('_' as u32, '_' as u32),
        ('a' as u32, 'z' as u32),
        ('A' as u32, 'Z' as u32),
        ('$' as u32, '$' as u32),
        (0x00A8, 0x00A8),
        (0x00AA, 0x00AA),
        (0x00AD, 0x00AD),
        (0x00AF, 0x00AF),
        (0x00B2, 0x00B5),
        (0x00B7, 0x00BA),
        (0x00BC, 0x00BE),
        (0x00C0, 0x00D6),
        (0x00D8, 0x00F6),
        (0x00F8, 0x00FF),
        (0x0100, 0x02FF),
        (0x0370, 0x167F),
        (0x1681, 0x180D),
        (0x180F, 0x1DBF),
        (0x1E00, 0x1FFF),
        (0x200B, 0x200D),
        (0x202A, 0x202E),
        (0x203F, 0x2040),
        (0x2054, 0x2054),
        (0x2060, 0x206F),
        (0x2070, 0x20CF),
        (0x2100, 0x218F),
        (0x2460, 0x24FF),
        (0x2776, 0x2793),
        (0x2C00, 0x2DFF),
        (0x2E80, 0x2FFF),
        (0x3004, 0x3007),
        (0x3021, 0x302F),
        (0x3031, 0x303F),
        (0x3040, 0xD7FF),
        (0xF900, 0xFD3D),
        (0xFD40, 0xFDCF),
        (0xFDF0, 0xFE1F),
        (0xFE30, 0xFE44),
        (0xFE47, 0xFFFD),
        (0x10000, 0x1FFFD),
        (0x20000, 0x2FFFD),
        (0x30000, 0x3FFFD),
        (0x40000, 0x4FFFD),
        (0x50000, 0x5FFFD),
        (0x60000, 0x6FFFD),
        (0x70000, 0x7FFFD),
        (0x80000, 0x8FFFD),
        (0x90000, 0x9FFFD),
        (0xA0000, 0xAFFFD),
        (0xB0000, 0xBFFFD),
        (0xC0000, 0xCFFFD),
        (0xD0000, 0xDFFFD),
        (0xE0000, 0xEFFFD),
    ];

    return in_range(&range, c);
}

// Return true if a given character is acceptable as a non-first
// character of an identifier.
#[instrument(ret, level = "debug")]
fn is_ident2(c: u32) -> bool {
    let range = [
        ('0' as u32, '9' as u32),
        ('$' as u32, '$' as u32),
        (0x0300, 0x036F),
        (0x1DC0, 0x1DFF),
        (0x20D0, 0x20FF),
        (0xFE20, 0xFE2F),
    ];
    return is_ident1(c) || in_range(&range, c);
}

// Returns the number of columns needed to display a given
// character in a fixed-width font.
//
// Based on Based on https://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c.
#[instrument(ret)]
fn char_width(c: u32) -> usize {
    let range1 = [
        (0x0000, 0x001F),
        (0x007f, 0x00a0),
        (0x0300, 0x036F),
        (0x0483, 0x0486),
        (0x0488, 0x0489),
        (0x0591, 0x05BD),
        (0x05BF, 0x05BF),
        (0x05C1, 0x05C2),
        (0x05C4, 0x05C5),
        (0x05C7, 0x05C7),
        (0x0600, 0x0603),
        (0x0610, 0x0615),
        (0x064B, 0x065E),
        (0x0670, 0x0670),
        (0x06D6, 0x06E4),
        (0x06E7, 0x06E8),
        (0x06EA, 0x06ED),
        (0x070F, 0x070F),
        (0x0711, 0x0711),
        (0x0730, 0x074A),
        (0x07A6, 0x07B0),
        (0x07EB, 0x07F3),
        (0x0901, 0x0902),
        (0x093C, 0x093C),
        (0x0941, 0x0948),
        (0x094D, 0x094D),
        (0x0951, 0x0954),
        (0x0962, 0x0963),
        (0x0981, 0x0981),
        (0x09BC, 0x09BC),
        (0x09C1, 0x09C4),
        (0x09CD, 0x09CD),
        (0x09E2, 0x09E3),
        (0x0A01, 0x0A02),
        (0x0A3C, 0x0A3C),
        (0x0A41, 0x0A42),
        (0x0A47, 0x0A48),
        (0x0A4B, 0x0A4D),
        (0x0A70, 0x0A71),
        (0x0A81, 0x0A82),
        (0x0ABC, 0x0ABC),
        (0x0AC1, 0x0AC5),
        (0x0AC7, 0x0AC8),
        (0x0ACD, 0x0ACD),
        (0x0AE2, 0x0AE3),
        (0x0B01, 0x0B01),
        (0x0B3C, 0x0B3C),
        (0x0B3F, 0x0B3F),
        (0x0B41, 0x0B43),
        (0x0B4D, 0x0B4D),
        (0x0B56, 0x0B56),
        (0x0B82, 0x0B82),
        (0x0BC0, 0x0BC0),
        (0x0BCD, 0x0BCD),
        (0x0C3E, 0x0C40),
        (0x0C46, 0x0C48),
        (0x0C4A, 0x0C4D),
        (0x0C55, 0x0C56),
        (0x0CBC, 0x0CBC),
        (0x0CBF, 0x0CBF),
        (0x0CC6, 0x0CC6),
        (0x0CCC, 0x0CCD),
        (0x0CE2, 0x0CE3),
        (0x0D41, 0x0D43),
        (0x0D4D, 0x0D4D),
        (0x0DCA, 0x0DCA),
        (0x0DD2, 0x0DD4),
        (0x0DD6, 0x0DD6),
        (0x0E31, 0x0E31),
        (0x0E34, 0x0E3A),
        (0x0E47, 0x0E4E),
        (0x0EB1, 0x0EB1),
        (0x0EB4, 0x0EB9),
        (0x0EBB, 0x0EBC),
        (0x0EC8, 0x0ECD),
        (0x0F18, 0x0F19),
        (0x0F35, 0x0F35),
        (0x0F37, 0x0F37),
        (0x0F39, 0x0F39),
        (0x0F71, 0x0F7E),
        (0x0F80, 0x0F84),
        (0x0F86, 0x0F87),
        (0x0F90, 0x0F97),
        (0x0F99, 0x0FBC),
        (0x0FC6, 0x0FC6),
        (0x102D, 0x1030),
        (0x1032, 0x1032),
        (0x1036, 0x1037),
        (0x1039, 0x1039),
        (0x1058, 0x1059),
        (0x1160, 0x11FF),
        (0x135F, 0x135F),
        (0x1712, 0x1714),
        (0x1732, 0x1734),
        (0x1752, 0x1753),
        (0x1772, 0x1773),
        (0x17B4, 0x17B5),
        (0x17B7, 0x17BD),
        (0x17C6, 0x17C6),
        (0x17C9, 0x17D3),
        (0x17DD, 0x17DD),
        (0x180B, 0x180D),
        (0x18A9, 0x18A9),
        (0x1920, 0x1922),
        (0x1927, 0x1928),
        (0x1932, 0x1932),
        (0x1939, 0x193B),
        (0x1A17, 0x1A18),
        (0x1B00, 0x1B03),
        (0x1B34, 0x1B34),
        (0x1B36, 0x1B3A),
        (0x1B3C, 0x1B3C),
        (0x1B42, 0x1B42),
        (0x1B6B, 0x1B73),
        (0x1DC0, 0x1DCA),
        (0x1DFE, 0x1DFF),
        (0x200B, 0x200F),
        (0x202A, 0x202E),
        (0x2060, 0x2063),
        (0x206A, 0x206F),
        (0x20D0, 0x20EF),
        (0x302A, 0x302F),
        (0x3099, 0x309A),
        (0xA806, 0xA806),
        (0xA80B, 0xA80B),
        (0xA825, 0xA826),
        (0xFB1E, 0xFB1E),
        (0xFE00, 0xFE0F),
        (0xFE20, 0xFE23),
        (0xFEFF, 0xFEFF),
        (0xFFF9, 0xFFFB),
        (0x10A01, 0x10A03),
        (0x10A05, 0x10A06),
        (0x10A0C, 0x10A0F),
        (0x10A38, 0x10A3A),
        (0x10A3F, 0x10A3F),
        (0x1D167, 0x1D169),
        (0x1D173, 0x1D182),
        (0x1D185, 0x1D18B),
        (0x1D1AA, 0x1D1AD),
        (0x1D242, 0x1D244),
        (0xE0001, 0xE0001),
        (0xE0020, 0xE007F),
        (0xE0100, 0xE01EF),
    ];
    if in_range(&range1, c) {
        return 0;
    }

    let range2 = [
        (0x1100, 0x115F),
        (0x2329, 0x2329),
        (0x232A, 0x232A),
        (0x2E80, 0x303E),
        (0x3040, 0xA4CF),
        (0xAC00, 0xD7A3),
        (0xF900, 0xFAFF),
        (0xFE10, 0xFE19),
        (0xFE30, 0xFE6F),
        (0xFF00, 0xFF60),
        (0xFFE0, 0xFFE6),
        (0x1F000, 0x1F644),
        (0x20000, 0x2FFFD),
        (0x30000, 0x3FFFD),
    ];

    if in_range(&range2, c) {
        return 2;
    }

    return 0;
}

#[instrument(ret)]
fn display_width(p: &str) -> usize {
    let mut w = 0;
    for c in p.chars() {
        w += char_width(c as u32);
    }
    return w;
}

//
// tokenize.c
//

#[instrument(ret)]
fn error(filename: &str, line_no: i32, loc: &Location, msg: &str) {
    let contents = loc.contents.as_ref();
    let line_idx = {
        let mut idx = loc.idx;
        while idx > 0 && contents[idx - 1] != b'\n' {
            idx -= 1;
        }
        idx
    };
    let end_idx = {
        let mut idx = loc.idx;
        while idx < contents.len() && contents[idx] != b'\n' {
            idx += 1;
        }
        idx
    };

    // Print out the line.
    let line = core::str::from_utf8(&contents[line_idx..end_idx]).unwrap();
    let header = format!("{}:{}: ", filename, line_no);
    let indent = header.len();
    eprint!("{}{}\n", header, line);

    // Show the error message below the token.
    let spaces = indent + {
        let line_to_loc = core::str::from_utf8(&contents[line_idx..loc.idx]).unwrap();
        display_width(line_to_loc)
    };
    eprint!("{}^ {}\n", " ".repeat(spaces), msg); // how many spaces?
}

impl Tokenize {
    #[instrument(ret)]
    fn error(&self, loc: &Location, msg: &str) {
        let line_no = {
            let mut n = 1;
            for (i, b) in loc.contents.iter().enumerate() {
                if i == loc.idx {
                    break;
                }
                if *b == b'\n' {
                    n += 1;
                }
            }
            n
        };

        error(
            self.current_file.as_ref().unwrap().borrow().name.as_str(),
            line_no,
            &loc,
            msg,
        );
        panic!(); // TODO: replace with process exit?
    }
}

impl Token {
    #[instrument(ret)]
    fn error(&self, msg: &str) {
        let file = &self.file;
        error(&file.borrow().name, self.line_no, &self.loc, msg);
        panic!(); // TODO: replace with process exit?
    }
    #[instrument(ret)]
    fn warn(&self, msg: &str) {
        let file = &self.file;
        error(&file.borrow().name, self.line_no, &self.loc, msg);
    }
}

#[cfg(test)]
mod test_errors {

    #[test]
    fn errors() {
        // TODO:
    }
}

impl Token {
    #[instrument(ret, level = Level::DEBUG)]
    pub fn new(
        kind: TokenKind,
        current_file: Rc<RefCell<File>>,
        loc: Location,
        at_bol: bool,
        has_space: bool,
    ) -> Self {
        Token {
            kind,
            loc,
            file: current_file,
            filename: None,
            line_no: -1, // TODO: unitialized?
            line_delta: 0,
            at_bol: at_bol,
            has_space: has_space,
            hideset: Vec::new(),
            origin: None,
            other: TokenFields::None,
        }
    }
}

#[instrument(ret, level = "debug")]
fn read_ident(start: &[u8]) -> usize {
    let (c, mut next) = decode_utf8(start).unwrap(); // TODO: unwrap
    if !is_ident1(c) {
        return 0;
    }

    while next.len() > 0 {
        let (c, nextnext) = decode_utf8(next).unwrap(); // TODO:
        if !is_ident2(c) {
            break;
        }
        next = nextnext;
    }

    return start.len() - next.len();
}

#[cfg(test)]
mod test_read_ident {
    use super::read_ident;

    #[test]
    fn test_read_ident() {
        let idents = [("asdfadsf1", 9), ("123adsfas", 0)];
        for (s, elen) in idents {
            let alen = read_ident(s.as_bytes());
            assert_eq!(elen, alen);
        }
    }
}

#[instrument(ret)]
fn from_hex(c: u8) -> u8 {
    if b'0' <= c && c <= b'9' {
        return c - b'0';
    } else if b'a' <= c && c <= b'f' {
        return c - b'a' + 10;
    } else {
        return c - b'A' + 10;
    }
}

#[instrument(ret, level = "debug")]
fn read_punct(p: &[u8]) -> usize {
    let kw = [
        "<<=", ">>=", "...", "==", "!=", "<=", ">=", "->", "+=", "-=", "*=", "/=", "++", "--",
        "%=", "&=", "|=", "^=", "&&", "||", "<<", ">>", "##",
    ];
    for w in kw {
        if p.starts_with(w.as_bytes()) {
            return w.len();
        }
    }

    return unsafe {
        // TODO: is p[0] a u8?
        if libc::ispunct(p[0] as libc::c_int) > 0 {
            1
        } else {
            0
        }
    };
}

impl Token {
    #[instrument(ret, level = Level::DEBUG)]
    fn is_keyword(&self) -> bool {
        let set: HashSet<&str> = HashSet::from([
            "return",
            "if",
            "else",
            "for",
            "while",
            "int",
            "sizeof",
            "char",
            "struct",
            "union",
            "short",
            "long",
            "void",
            "typedef",
            "_Bool",
            "enum",
            "static",
            "goto",
            "break",
            "continue",
            "switch",
            "case",
            "default",
            "extern",
            "_Alignof",
            "_Alignas",
            "do",
            "signed",
            "unsigned",
            "const",
            "volatile",
            "auto",
            "register",
            "restrict",
            "__restrict",
            "__restrict__",
            "_Noreturn",
            "float",
            "double",
            "typeof",
            "asm",
            "_Thread_local",
            "__thread",
            "_Atomic",
            "__attribute__",
        ]);

        return set.contains(self.loc.as_str());
    }
}

impl Tokenize {
    // Escape sequences are defined using themselves here.
    // '\n' is implemented using '\n'. This tautological definition
    // works because the compiler that compiles our compiler knows what
    // '\n' actually is. In other words, we "inherit" the ASCII code of '\n'
    // from the compiler that compiles our compiler, so we don't have to teach
    // the actual code here.
    //
    // This fact has huge implications not only for the correctness
    // of the compiler but also for the security of the generated code.
    // For more info, read "Reflections on Trusting Trust" by Ken Thompson
    // https://github.com/rui314/chibicc/wiki/thompson1984.pdf
    #[instrument(ret)]
    fn read_escaped_char<'a>(&'a self, p: &mut Location) -> u32 {
        if b'0' <= p[0] && p[0] <= b'7' {
            // Read an octal number.
            let mut c = (p[0] - b'0') as u32;
            let mut i: usize = 1;
            p.next();
            while i < 3 {
                if b'0' <= p[0] && p[0] <= b'7' {
                    c <<= 3;
                    c += (p[0] - b'0') as u32;
                    p.next();
                }
                i += 1;
            }
            return c;
        } else if p[0] == b'x' {
            // Read a hexadecimal number.
            p.next();
            if unsafe { libc::isxdigit(p[0] as i32) == 0 } {
                // TODO: preemptively replace with Location?
                self.error(p, "invalid hex escape sequence");
            }
            let mut c: u32 = 0;
            while p.len() > 0 && unsafe { libc::isxdigit(p[0] as i32) > 0 } {
                c <<= 4;
                c += from_hex(p[0]) as u32;
                p.next();
            }
            return c;
        } else {
            let c = match p[0] {
                b'a' => 7 as u32,
                b'b' => 8 as u32,
                b't' => b'\t' as u32,
                b'n' => b'\n' as u32,
                b'v' => 11 as u32,
                b'f' => 12 as u32,
                b'r' => b'\r' as u32,
                b'e' => 27,
                _ => p[0] as u32,
            };
            p.next();

            return c;
        }
    }

    #[instrument(ret)]
    fn string_literal_end(&self, loc: &Location) -> Location {
        let mut end = loc.clone();
        while let Some(c) = end.next() {
            if c == b'"' {
                break;
            }
            if c == b'\n' || c == b'\0' {
                self.error(loc, "unclosed string literal");
            }
            if c == b'\\' {
                end.next();
            }
        }
        end.idx -= 1;
        assert!(end[0] == b'"');
        return end;
    }

    #[instrument(ret)]
    fn read_string_literal(
        &self,
        start: &Location,
        quote: &Location,
        at_bol: bool,
        has_space: bool,
    ) -> Token {
        let mut q = quote.clone();
        q.next();
        let end = self.string_literal_end(&q);
        q.end_idx = end.idx;
        let mut buf: Vec<u8> = Vec::with_capacity(end.idx - quote.idx);

        while let Some(c) = q.next() {
            if c == b'\\' {
                let ec: u32 = self.read_escaped_char(&mut q);
                buf.push(ec as u8); // TODO: unsafe cast?
            } else {
                buf.push(c);
            }
        }

        let mut tok = Token::new(
            TokenKind::Str,
            self.current_file.clone().unwrap(),
            Location {
                idx: start.idx,
                end_idx: end.idx + 1,
                contents: end.contents,
            },
            at_bol,
            has_space,
        );

        tok.other = TokenFields::Str(array_of(&TY_CHAR, buf.len() + 1), buf.into_boxed_slice());

        return tok;
    }

    // Read a UTF-8-encoded string literal and transcode it in UTF-16.
    //
    // UTF-16 is yet another variable-width encoding fro Unicode. Code
    // points smaller than U+10000 are encoded in 2 bytes. Code points
    // equal to or larger than that are encoded in 4 bytes. Each 2 bytes
    // in the 4 byte sequence is called "surrogate", and a 4 byte sequence
    // is called a "surrogate pair".
    #[instrument(ret)]
    fn read_utf16_string_literal(
        &self,
        start: &Location,
        quote: &Location,
        at_bol: bool,
        has_space: bool,
    ) -> Token {
        let mut q = quote.clone();
        q.next();
        let end = self.string_literal_end(&q);
        q.end_idx = end.idx;

        let mut buf = Vec::with_capacity(end.idx - start.idx);

        while q.idx < end.idx {
            if q[0] == b'\\' {
                q.next();
                let ec = self.read_escaped_char(&mut q);
                buf.push(ec as u16); // TODO: unsafe cast?
            } else {
                let p = q.as_bytes();
                let (cd, next_p) = decode_utf8(p).unwrap();
                q.idx += p.len() - next_p.len();
                if cd < 0x10000 {
                    // Encode a code point in 2 bytes.
                    buf.push(cd as u16);
                } else {
                    // Encode a code point in 4 bytes.
                    let cs = cd - 0x10000;
                    let b1 = 0xd800 + ((cs >> 10) & 0x3ff);
                    let b2 = 0xdc00 + (cs & 0x3ff);
                    buf.push(b1 as u16);
                    buf.push(b2 as u16);
                }
            }
        }

        let mut tok = Token::new(
            TokenKind::Str,
            self.current_file.clone().unwrap(),
            Location {
                idx: start.idx,
                end_idx: end.idx + 1,
                contents: q.contents,
            },
            at_bol,
            has_space,
        );
        tok.other = TokenFields::Str(array_of(&TY_USHORT, buf.len() + 1), unsafe {
            std::mem::transmute(buf.into_boxed_slice())
        });
        return tok;
    }

    // Read a UTF-8-encoded string literal and transcode it in UTF-32.
    //
    // UTF-32 is a fixed-width encoding for Unicode. Each code point is
    // encoded in 4 bytes.
    #[instrument(ret)]
    fn read_utf32_string_literal(
        &self,
        start: &Location,
        quote: &Location,
        ty: &Type,
        at_bol: bool,
        has_space: bool,
    ) -> Token {
        let mut q = quote.clone();
        q.next();
        let end = self.string_literal_end(&q);
        q.end_idx = end.idx;
        let mut buf: Vec<u32> = Vec::with_capacity(end.idx - start.idx); // Why quote.idx instead of start in original code?

        while q.idx < end.idx {
            if q[0] == b'\\' {
                q.next();
                let ec = self.read_escaped_char(&mut q);
                buf.push(ec);
            } else {
                let p = q.as_bytes();
                let (dc, next_p) = decode_utf8(p).unwrap();
                q.idx += p.len() - next_p.len();
                buf.push(dc);
            }
        }

        let mut tok = Token::new(
            TokenKind::Str,
            self.current_file.clone().unwrap(),
            Location {
                idx: start.idx,
                end_idx: end.idx + 1,
                contents: q.contents,
            },
            at_bol,
            has_space,
        );
        tok.other = TokenFields::Str(array_of(ty, buf.len() + 1), unsafe {
            std::mem::transmute(buf.into_boxed_slice())
        });
        return tok;
    }

    #[instrument(ret)]
    fn read_char_literal(
        &self,
        start: &Location,
        quote: &Location,
        ty: &Type,
        at_bol: bool,
        has_space: bool,
    ) -> Token {
        let mut q = quote.clone();
        q.next();
        let first_literal_byte = q[0];
        if first_literal_byte == b'\0' {
            self.error(start, "unclosed char literal");
        }
        let c32 = {
            if first_literal_byte == b'\\' {
                q.next();
                self.read_escaped_char(&mut q)
            } else {
                let p = q.as_bytes();
                let (c32, nextp) = decode_utf8(p).unwrap();
                q.idx += p.len() - nextp.len();
                c32
            }
        };

        if q.find(|c| *c == b'\'').is_none() {
            self.error(start, "unclosed char literal");
        }
        let end = q;

        let mut tok = Token::new(
            TokenKind::Num,
            self.current_file.clone().unwrap(),
            Location {
                idx: start.idx,
                end_idx: end.idx, // end excees q
                contents: end.contents,
            },
            at_bol,
            has_space,
        );
        tok.other = TokenFields::Int(ty.clone(), c32 as i64);

        return tok;
    }
}

impl Token {
    #[instrument(ret)]
    fn convert_pp_int(&mut self) -> bool {
        // TODO: functionality is different because p stops at loc.end_idx
        // while the original code make look at addresses after loc.end_idx
        let mut p = self.loc.as_bytes();
        // Read a binary, octal, decimal, or hexadecimal number.
        let mut base = 10;
        if p.len() >= 2
            && p[..2].eq_ignore_ascii_case("0x".as_bytes())
            && unsafe { isxdigit(p[2] as i32) > 0 }
        {
            p = &p[2..];
            base = 16;
        } else if p.len() >= 2
            && p[..2].eq_ignore_ascii_case("0b".as_bytes())
            && (p[2] == b'0' || p[2] == b'1')
        {
            p = &p[2..];
            base = 2;
        } else if p[0] == b'0' {
            base = 8;
        }

        let mut pint = &p[..];
        let mut flag = false;
        for i in 0..p.len() {
            if p[i].eq_ignore_ascii_case(&b'l') || p[i].eq_ignore_ascii_case(&b'u') {
                pint = &p[..i];
                p = &p[i..];
                flag = true;
                break;
            }
        }
        if !flag {
            p = &p[p.len()..];
        }

        let (val, l, u) = match u64::from_str_radix(core::str::from_utf8(&pint).unwrap(), base) {
            Ok(val_unsigned) => {
                let val = val_unsigned as i64;
                if p.starts_with("LLU".as_bytes())
                    || p.starts_with("LLu".as_bytes())
                    || p.starts_with("llU".as_bytes())
                    || p.starts_with("llu".as_bytes())
                    || p.starts_with("ULL".as_bytes())
                    || p.starts_with("Ull".as_bytes())
                    || p.starts_with("uLL".as_bytes())
                    || p.starts_with("ull".as_bytes())
                {
                    p = &p[3..];
                    (val, true, true)
                } else if p.len() >= 2
                    && (p[..2].eq_ignore_ascii_case("lu".as_bytes())
                        || p[..2].eq_ignore_ascii_case("ul".as_bytes()))
                {
                    p = &p[2..];
                    (val, true, true)
                } else if p.starts_with("LL".as_bytes()) || p.starts_with("ll".as_bytes()) {
                    p = &p[2..];
                    (val, true, false)
                } else if p.len() > 0 && p[0].eq_ignore_ascii_case(&b'l') {
                    p = &p[1..];
                    (val, true, false)
                } else if p.len() > 0 && p[0].eq_ignore_ascii_case(&b'u') {
                    p = &p[1..];
                    (val, false, true)
                } else {
                    (val, false, false)
                }
            }
            Err(_) => {
                return false;
            }
        };

        if p.len() != 0 {
            return false;
        }

        let ty: Type = {
            if base == 10 {
                if l & u {
                    TY_ULONG
                } else if l {
                    TY_LONG
                } else if u {
                    if val >> 32 != 0 {
                        TY_ULONG
                    } else {
                        TY_UINT
                    }
                } else {
                    if val >> 31 != 0 {
                        TY_LONG
                    } else {
                        TY_INT
                    }
                }
            } else {
                if l & u {
                    TY_ULONG
                } else if l {
                    if val >> 63 != 0 {
                        TY_ULONG
                    } else {
                        TY_LONG
                    }
                } else if u {
                    if val >> 32 != 0 {
                        TY_ULONG
                    } else {
                        TY_UINT
                    }
                } else if val >> 63 != 0 {
                    TY_ULONG
                } else if val >> 32 != 0 {
                    TY_LONG
                } else if val >> 31 != 0 {
                    TY_UINT
                } else {
                    TY_INT
                }
            }
        };
        self.kind = TokenKind::Num;
        self.other = TokenFields::Int(ty, val);
        return true;
    }

    // The definition of the numeric literal at the preprocessing stage
    // is more relaxed than the definition of that at the later stages.
    // In order to handle that, a numeric literal is tokenized as a
    // "pp-number" token first and then converted to a regular number
    // token after preprocessing.
    //
    // This function converts a pp-number token to a regular number token.
    #[instrument(ret)]
    fn convert_pp_number(&mut self) {
        if self.convert_pp_int() {
            return;
        }

        let (val, bytes_read) = unsafe {
            let s = CString::new(self.loc.as_bytes()).unwrap();
            let start = s.as_ptr();
            let end: Cell<*mut c_char> = Cell::new(std::ptr::null_mut());
            let val = strtod(s.as_ptr(), end.as_ptr());
            let len = end.get().byte_offset_from(start);
            (val, len as usize)
        };
        let mut end = &self.loc.as_bytes()[bytes_read..];

        let (ty, val) = {
            if end.len() > 0 && end[0].eq_ignore_ascii_case(&b'f') {
                end = &end[1..];
                (TY_FLOAT, val)
            } else if end.len() > 0 && end[0].eq_ignore_ascii_case(&b'l') {
                end = &end[1..];
                (TY_LDOUBLE, val)
            } else {
                (TY_DOUBLE, val)
            }
        };

        if end.len() != 0 {
            self.error("invalid numeric constant");
        }

        self.kind = TokenKind::Num;
        self.other = TokenFields::Float(ty, val);
    }
}

impl Tokenize {
    // We take in a vector instead
    #[instrument(skip_all)]
    fn convert_pp_tokens(tokens: &mut Vec<Token>) {
        for t in tokens {
            if t.is_keyword() {
                t.kind = TokenKind::Keyword;
            } else if t.kind == TokenKind::PPNum {
                t.convert_pp_number();
            }
        }
    }

    // Initialize line info for all tokens.
    //
    // Note: we take a vector, and assume the tokens are sorted.
    #[instrument(skip_all)]
    fn add_line_number(&self, tokens: &mut Vec<Token>) {
        let p: &[u8] = &self.current_file.as_ref().unwrap().borrow().contents;
        let mut n = 1;
        let mut token_idx = 0;
        for (i, c) in p.into_iter().enumerate() {
            if token_idx == tokens.len() {
                // assert_eq!(*c, b'\n');
                continue;
            }
            if i == tokens[token_idx].loc.idx {
                tokens[token_idx].line_no = n;
                token_idx += 1;
            }
            if *c == b'\n' {
                n += 1;
            }
        }
    }
}

#[cfg(test)]
mod test_read {
    use std::{cell::RefCell, rc::Rc};

    use super::{File, Location, Tokenize};

    #[test]
    fn test_read_escaped_character() {
        let cases = [
            ("044", 36),  // octet
            ("xFF", 255), // hex
            ("\n", '\n' as u32),
            ("u", b'u' as u32), //
        ];

        let t = fake_tokenize();

        for (s, e) in cases {
            let mut loc = Location {
                idx: 0,
                end_idx: s.len(),
                contents: Rc::new(s.into()),
            };

            let a = t.read_escaped_char(&mut loc);
            assert_eq!(loc.idx, loc.contents.len());
            assert_eq!(a, e);
        }
    }

    fn fake_tokenize() -> Tokenize {
        let fake_file = File {
            name: "fake".into(),
            file_no: -1,
            contents: Rc::new("fake".into()),
            display_name: "fake".into(),
            line_delta: 0,
        };

        let t = Tokenize {
            current_file: Some(Rc::new(RefCell::new(fake_file))),
            input_files: Vec::new(),
        };
        t
    }
}

impl Tokenize {
    #[instrument(ret)]
    fn tokenize_string_literal(
        &self,
        tok: &Token,
        basety: &Type,
        at_bol: bool,
        has_space: bool,
    ) -> Token {
        let t = if basety.size == 2 {
            self.read_utf16_string_literal(&tok.loc, &tok.loc, at_bol, has_space)
        } else {
            self.read_utf32_string_literal(&tok.loc, &tok.loc, basety, at_bol, has_space)
        };
        return t;
    }

    #[instrument(ret)]
    fn tokenize(&mut self, file: Rc<RefCell<File>>) -> TokenList {
        self.current_file = Some(file.clone());

        let mut loc = Location {
            idx: 0,
            end_idx: file.borrow().contents.len(),
            contents: file.borrow().contents.clone(),
        };

        let mut tokens: Vec<Token> = Vec::new();
        let mut at_bol = true;
        let mut has_space: bool = false;

        while loc.idx < loc.end_idx {
            // Skip line comments.
            if loc.as_bytes().starts_with("//".as_bytes()) {
                loc.idx += 2;
                while loc[0] != b'\n' {
                    loc.idx += 1;
                }
                has_space = true;
            }
            // Skip block comments
            else if loc.as_bytes().starts_with("/*".as_bytes()) {
                #[instrument(ret, level = "debug")]
                fn strstr(s: &[u8], m: &[u8]) -> Option<usize> {
                    for i in 0..s.len() - 1 {
                        let mut mf = true;
                        for j in 0..m.len() {
                            if s[i + j] != m[j] {
                                mf = false;
                                break;
                            }
                        }
                        if mf {
                            return Some(i);
                        }
                    }
                    return None;
                }
                let q = strstr(&loc.as_bytes()[2..], "*/".as_bytes());
                match q {
                    Some(i) => {
                        loc.idx += 2 + i;
                        loc.idx += 2;
                        has_space = true
                    }
                    None => {
                        self.error(&loc, "unclosed block comment");
                        return TokenList {
                            tokens: Vec::new(),
                            idx: 0,
                        };
                    }
                }
            }
            // Skip newline.
            else if loc[0] == b'\n' {
                loc.idx += 1;
                at_bol = true;
                has_space = false;
            }
            // Skip whitespace characters.
            else if loc[0].is_ascii_whitespace() {
                loc.idx += 1;
                has_space = true;
            } else {
                let at_bol_copy = at_bol;
                let has_space_copy = has_space;
                let mut push_token = |tok| {
                    tokens.push(tok);
                    at_bol = false;
                    has_space = false;
                };

                // Numeric literal
                if loc[0].is_ascii_digit() || (loc[0] == b'.' && loc[1].is_ascii_digit()) {
                    let start: usize = loc.idx;
                    loc.idx += 1;
                    loop {
                        if loc.len() >= 2
                            && loc[0] != 0
                            && loc[1] != 0
                            && "eEpP".as_bytes().contains(&loc[0])
                            && "+-".as_bytes().contains(&loc[1])
                        {
                            loc.idx += 2;
                        } else if loc.len() > 0
                            && (loc[0].is_ascii_alphanumeric() || loc[0] == b'.')
                        {
                            loc.idx += 1;
                        } else {
                            break;
                        }
                    }
                    push_token(Token::new(
                        TokenKind::PPNum,
                        self.current_file.clone().unwrap(),
                        Location {
                            idx: start,
                            end_idx: loc.idx,
                            contents: loc.contents.clone(),
                        },
                        at_bol_copy,
                        has_space_copy,
                    ));
                }
                // String literal
                else if loc[0] == b'"' {
                    let tok = self.read_string_literal(&loc, &loc, at_bol_copy, has_space_copy);
                    loc.idx = tok.loc.end_idx;
                    push_token(tok);
                }
                // UTF-8 string literal
                else if loc.as_bytes().starts_with("u8\"".as_bytes()) {
                    let mut quote = loc.clone();
                    quote.idx += 2;
                    let tok = self.read_string_literal(&loc, &quote, at_bol_copy, has_space_copy);
                    loc.idx = tok.loc.end_idx;
                    push_token(tok);
                }
                // UTF-16 string literal
                else if loc.as_bytes().starts_with("u\"".as_bytes()) {
                    let mut quote = loc.clone();
                    quote.idx += 1;
                    let tok =
                        self.read_utf16_string_literal(&loc, &quote, at_bol_copy, has_space_copy);
                    loc.idx = tok.loc.end_idx;
                    push_token(tok);
                }
                // Wide string literal
                else if loc.as_bytes().starts_with("L\"".as_bytes()) {
                    let mut quote = loc.clone();
                    quote.idx += 1;
                    let tok = self.read_utf32_string_literal(
                        &loc,
                        &quote,
                        &TY_INT,
                        at_bol_copy,
                        has_space_copy,
                    );
                    loc.idx = tok.loc.end_idx;
                    push_token(tok);
                }
                // UTF-32 string literal
                else if loc.as_bytes().starts_with("U\"".as_bytes()) {
                    let mut quote = loc.clone();
                    quote.idx += 1;
                    let tok = self.read_utf32_string_literal(
                        &loc,
                        &quote,
                        &TY_UINT,
                        at_bol_copy,
                        has_space_copy,
                    );
                    loc.idx = tok.loc.end_idx;
                    push_token(tok);
                }
                // Character literal
                else if loc[0] == b'\'' {
                    let tok =
                        self.read_char_literal(&loc, &loc, &TY_INT, at_bol_copy, has_space_copy);
                    // TODO: what is the meaning of cur->val = (char) cur->val;
                    loc.idx = tok.loc.end_idx;
                    push_token(tok);
                }
                // UTF16- Character literal
                else if loc.as_bytes().starts_with("u'".as_bytes()) {
                    let mut quote = loc.clone();
                    quote.idx += 1;
                    let tok = self.read_char_literal(
                        &loc,
                        &quote,
                        &TY_USHORT,
                        at_bol_copy,
                        has_space_copy,
                    );
                    loc.idx = tok.loc.end_idx;
                    push_token(tok);
                }
                // Wide character literal
                else if loc.as_bytes().starts_with("L'".as_bytes()) {
                    let mut quote = loc.clone();
                    quote.idx += 1;
                    let tok =
                        self.read_char_literal(&loc, &quote, &TY_INT, at_bol_copy, has_space_copy);
                    loc.idx = tok.loc.end_idx;
                    push_token(tok);
                }
                // UTF-32 character literal
                else if loc.as_bytes().starts_with("U'".as_bytes()) {
                    let mut quote = loc.clone();
                    quote.idx += 1;
                    let tok: Token =
                        self.read_char_literal(&loc, &quote, &TY_UINT, at_bol_copy, has_space_copy);
                    loc.idx = tok.loc.end_idx;
                    push_token(tok);
                }
                // Identifier or keyword
                else {
                    let ident_len = read_ident(&loc.as_bytes());
                    if ident_len > 0 {
                        let tok = Token::new(
                            TokenKind::Ident,
                            self.current_file.clone().unwrap(),
                            Location {
                                idx: loc.idx,
                                end_idx: loc.idx + ident_len,
                                contents: loc.contents.clone(),
                            },
                            at_bol_copy,
                            has_space_copy,
                        );
                        loc.idx = tok.loc.end_idx;
                        push_token(tok);
                        continue;
                    }

                    let punct_len = read_punct(loc.as_bytes());
                    if punct_len > 0 {
                        let tok = Token::new(
                            TokenKind::Punct,
                            self.current_file.clone().unwrap(),
                            Location {
                                idx: loc.idx,
                                end_idx: loc.idx + punct_len,
                                contents: loc.contents.clone(),
                            },
                            at_bol_copy,
                            has_space_copy,
                        );
                        loc.idx = tok.loc.end_idx;
                        push_token(tok);
                        continue;
                    }

                    self.error(&loc, "invalid token");
                }
            }
        }

        self.add_line_number(&mut tokens);
        return TokenList {
            tokens: tokens,
            idx: 0,
        };
    }
}

#[instrument]
fn read_file(path: &str) -> Option<Vec<u8>> {
    let mut fp: Box<dyn Read>;
    if path == "-" {
        fp = Box::new(stdin());
    } else {
        match std::fs::File::open(path) {
            Ok(bf) => {
                if bf.metadata().unwrap().is_dir() {
                    return None;
                }
                fp = Box::new(bf);
            }
            Err(_) => {
                return None;
            }
        }
    }

    let mut buf = Vec::new();

    // Read the entire file.
    loop {
        let mut buf2: [u8; 4096] = [0; 4096];
        let n = fp.read(&mut buf2).unwrap();
        if n == 0 {
            break;
        }
        buf.extend_from_slice(&buf2[..n]);
    }

    // Make sure that the last line is properly terminated with '\n'.
    if buf.len() == 0 || *buf.last().unwrap() != b'\n' {
        buf.push(b'\n');
    }

    return Some(buf);
}

impl Tokenize {
    #[instrument(ret)]
    fn get_input_files(&self) -> Vec<Rc<RefCell<File>>> {
        return self.input_files.clone();
    }
}

impl File {
    #[instrument(ret, skip(contents))]
    fn new(name: String, file_no: i32, contents: Vec<u8>) -> Self {
        let display_name = name.clone();
        return File {
            name,
            file_no,
            contents: Rc::new(contents),
            display_name: display_name,
            line_delta: 0,
        };
    }
}

// Replaces \r or \r\n with \n.
#[instrument(skip_all)]
fn canonicalize_newline(p: &mut Vec<u8>) {
    let mut i = 0;
    let mut j = 0;
    while i < p.len() {
        if p[i] == b'\r' && p[i + 1] == b'\n' {
            // TODO: unsafe
            p[j] = b'\n';
            i += 2;
            j += 1;
        } else if p[i] == b'\r' {
            p[j] = b'\n';
            i += 1;
            j += 1;
        } else {
            p[j] = p[i];
            i += 1;
            j += 1;
        }
    }
    p.truncate(j);
}

// Removes backslashes followed by a newline.
#[instrument(skip_all)]
fn remove_backslash_newline(p: &mut Vec<u8>) {
    let mut i = 0;
    let mut j = 0;

    // We want to keep the number of newline characters so that
    // the logical line number matches the physical one.
    // This counter maintains the number of newlines we have removed.
    let mut n = 0;

    while i < p.len() {
        if p[i] == b'\\' && p[i + 1] == b'\n' {
            i += 2;
            n += 1;
        } else if p[i] == b'\n' {
            p[j] = b'\n';
            i += 1;
            j += 1;
            while n > 0 {
                p[j] = b'\n';
                j += 1;
                n -= 1;
            }
        } else {
            p[j] = p[i];
            i += 1;
            j += 1;
        }
    }

    while n > 0 {
        p[j] = b'\n';
        j += 1;
        n -= 1;
    }

    p.truncate(j);
}

#[instrument(ret, level = "debug")]
fn read_universal_char(p: &[u8], len: usize) -> u32 {
    let mut c: u32 = 0;
    for i in 0..len {
        if unsafe { isxdigit(p[i] as i32) == 0 } {
            return 0;
        }
        c = (c << 4) | (from_hex(p[i]) as u32);
    }
    return c;
}

// Replace \u or \U escape sequences with corresponding UTF-8 bytes
#[instrument(skip_all)]
fn convert_universal_chars(p: &mut Vec<u8>) {
    let mut i = 0;
    let mut j = 0;
    while i < p.len() {
        if p[i..].starts_with("\\u".as_bytes()) {
            let c = read_universal_char(&p[i + 2..], 4);
            if c > 0 {
                i += 6;
                j += encode_utf8(&mut p[j..], c);
            } else {
                p[j] = p[i];
                i += 1;
                j += 1;
            }
        } else if p[i..].starts_with("\\U".as_bytes()) {
            let c = read_universal_char(&p[i + 2..], 8);
            if c > 0 {
                i += 10;
                j += encode_utf8(&mut p[j..], c);
            } else {
                p[j] = p[i];
                i += 10;
                j += 10;
            }
        } else if p[0] == b'\\' {
            p[j] = p[i];
            i += 1;
            j += 1;
            p[j] = p[i];
            i += 1;
            j += 1;
        } else {
            p[j] = p[i];
            i += 1;
            j += 1;
        }
    }
    p.truncate(j);
}

impl Tokenize {
    #[instrument]
    pub fn tokenize_file(&mut self, path: &str) -> Option<TokenList> {
        let op = read_file(path);
        if op.is_none() {
            return None;
        }
        let mut contents = op.unwrap();

        // UTF-8 texts may start with a 3-byte "BOM" marker sequence.
        // If exists, just skip them because they are useless bytes.
        // (It is actually not recommended to add BOM markers to UTF-8
        // texts, but it's not uncommon particularly on Windows.)

        if contents[..3].eq(&[b'\xef', b'\xbb', b'\xbf']) {
            contents = contents[3..].into_iter().map(|x| *x).collect();
        }

        canonicalize_newline(&mut contents);
        remove_backslash_newline(&mut contents);
        convert_universal_chars(&mut contents);

        // Save the filename for assembler .file directive.
        let file = Rc::new(RefCell::new(File::new(
            path.into(),
            self.input_files.len() as i32,
            contents,
        )));
        self.input_files.push(file.clone());

        return Some(self.tokenize(file));
    }
}

#[cfg(test)]
mod test_tokenize {
    use std::{cell::RefCell, fmt::Write, rc::Rc};

    use crate::chibicc::chibicc::{read_punct, setup_tracing};

    use super::{File, Tokenize};

    #[test]
    fn test_read_punc() {
        let s = r#"<= 0x7F) {
    buf[0] ="#;
        let len = read_punct(s.as_bytes());
        assert_eq!(len, 2);
    }

    #[test]
    fn test_tokenize() {
        setup_tracing();
        // Contents is a string that hits most branches
        // of tokenize.
        let contents = r#"
// Line Comment
/* 
    Block comment
*/
// Numbers
12345;
123.45;
10e+7;
0xFF;
0b11110000;

// String literals
strcmp("αβγ", "\u03B1\u03B2\u03B3");
strcmp("日本語", "\u65E5\u672C\u8A9E");
"u8\"a\"";
strcmp(u8"abc", "abc");
({ char16_t x[] = u"αβ"; x[1]; }))
memcmp(L"🍣", "c\363\001\0\0\0\0\0", 8);
sizeof(L"\xffzzz");
memcmp(U"abc", "a\0\0\0b\0\0\0c\0\0\0\0\0\0\0", 16);
memcmp(U"🍣", "c\363\001\0\0\0\0\0", 8);

// Char literals
L'\0';
L'a';
L'\xffffffff'>>31;
L'あ';
u'\0';
u'\xffff'>>15;
U'\xffffffff'>>31;
U'🍣';

// Identifier
var variablename123 = 1234;

// Punctuation
if (variablename123 == true) {
}
 "#;
        let expected_out = r#"Token(Punct, *)
Token(Punct, /)
Token(PPNum, 12345)
Token(Punct, ;)
Token(PPNum, 123.45)
Token(Punct, ;)
Token(PPNum, 10e+7)
Token(Punct, ;)
Token(PPNum, 0xFF)
Token(Punct, ;)
Token(PPNum, 0b11110000)
Token(Punct, ;)
Token(Ident, strcmp)
Token(Punct, ()
Token(Str, "αβγ")
Token(Punct, ,)
Token(Str, "\u03B1\u03B2\u03B3")
Token(Punct, ))
Token(Punct, ;)
Token(Ident, strcmp)
Token(Punct, ()
Token(Str, "日本語")
Token(Punct, ,)
Token(Str, "\u65E5\u672C\u8A9E")
Token(Punct, ))
Token(Punct, ;)
Token(Str, "u8\"a\"")
Token(Punct, ;)
Token(Ident, strcmp)
Token(Punct, ()
Token(Str, u8"abc")
Token(Punct, ,)
Token(Str, "abc")
Token(Punct, ))
Token(Punct, ;)
Token(Punct, ()
Token(Punct, {)
Token(Ident, char16_t)
Token(Ident, x)
Token(Punct, [)
Token(Punct, ])
Token(Punct, =)
Token(Str, u"αβ")
Token(Punct, ;)
Token(Ident, x)
Token(Punct, [)
Token(PPNum, 1)
Token(Punct, ])
Token(Punct, ;)
Token(Punct, })
Token(Punct, ))
Token(Punct, ))
Token(Ident, memcmp)
Token(Punct, ()
Token(Str, L"🍣")
Token(Punct, ,)
Token(Str, "c\363\001\0\0\0\0\0")
Token(Punct, ,)
Token(PPNum, 8)
Token(Punct, ))
Token(Punct, ;)
Token(Ident, sizeof)
Token(Punct, ()
Token(Str, L"\xffzzz")
Token(Punct, ))
Token(Punct, ;)
Token(Ident, memcmp)
Token(Punct, ()
Token(Str, U"abc")
Token(Punct, ,)
Token(Str, "a\0\0\0b\0\0\0c\0\0\0\0\0\0\0")
Token(Punct, ,)
Token(PPNum, 16)
Token(Punct, ))
Token(Punct, ;)
Token(Ident, memcmp)
Token(Punct, ()
Token(Str, U"🍣")
Token(Punct, ,)
Token(Str, "c\363\001\0\0\0\0\0")
Token(Punct, ,)
Token(PPNum, 8)
Token(Punct, ))
Token(Punct, ;)
Token(Num, L'\0')
Token(Punct, ;)
Token(Num, L'a')
Token(Punct, ;)
Token(Num, L'\xffffffff')
Token(Punct, >>)
Token(PPNum, 31)
Token(Punct, ;)
Token(Num, L'あ')
Token(Punct, ;)
Token(Num, u'\0')
Token(Punct, ;)
Token(Num, u'\xffff')
Token(Punct, >>)
Token(PPNum, 15)
Token(Punct, ;)
Token(Num, U'\xffffffff')
Token(Punct, >>)
Token(PPNum, 31)
Token(Punct, ;)
Token(Num, U'🍣')
Token(Punct, ;)
Token(Ident, var)
Token(Ident, variablename123)
Token(Punct, =)
Token(PPNum, 1234)
Token(Punct, ;)
Token(Ident, if)
Token(Punct, ()
Token(Ident, variablename123)
Token(Punct, ==)
Token(Ident, true)
Token(Punct, ))
Token(Punct, {)
Token(Punct, })
Token(EOF, )
"#;

        let f = Rc::new(RefCell::new(File {
            name: "test".into(),
            file_no: 0,
            contents: Rc::new(Vec::from(contents.as_bytes())),
            display_name: "test".into(),
            line_delta: 0,
        }));

        let mut t = Tokenize {
            current_file: Some(f.clone()),
            input_files: Vec::from([f.clone()]),
        };

        let v = t.tokenize(f);
        let mut actual_out = String::new();
        for t in v.tokens {
            actual_out.write_str(t.to_string().as_str());
            actual_out.write_char('\n');
        }

        println!("{}", actual_out);

        {
            let expected_lines: Vec<&str> = expected_out.split("\n").collect();
            let actual_lines: Vec<&str> = actual_out.split("\n").collect();

            let mut fail = false;
            for (i, line) in expected_lines.iter().enumerate() {
                if i >= actual_lines.len() {
                    break;
                }
                if *line != actual_lines[i] {
                    fail = true;
                    println!(
                        "line not equal: i={}\n\texpected={}\n\tactual={}",
                        i, line, actual_lines[i]
                    );
                }
            }
            if expected_lines.len() != actual_lines.len() {
                fail = true;
                println!(
                    "len(expected_lines) != len(actual_lines): {} {}",
                    expected_lines.len(),
                    actual_lines.len()
                );
            }
            assert!(!fail);
        }
    }
}

//
// preprocess.c
//

// The preprocessor takes a list of tokens as an input and returns a
// new list of tokens as an output.
//
// The preprocessing language is designed in such a way that there's
// guarnteed to stop even if there is a recursive macro.
// Informally speaking, a macro is applied only once for each token.
// That is, if a macro token T appears in a result of direct or
// indirect macro expansion of T, T won't be expanded any further.
// For example, if T is defined as U, and U is defined as T, then
// token T is expanded to U and athen to T and the macro expansion
// stops at that point.
//
// To achieve the above behavior, we attach for each token a set of
// macro names from which the token is expanded. The set is called
// "hideset". Hideset is initially empty, and every time we expand a
// macro, the macro name is added to the resulting tokens' hideset.
//
// The above macro expansion algorithm is explained in this document
// written by Dave Prossor, which is used as a bais for the standard's
// wording.
// https://github.com/rui314/chibicc/wiki/cpp.algo.pdf

#[derive(Debug)]
struct MacroParam {
    name: String,
}

#[derive(Clone, Debug)]
struct MacroArg {
    name: String,
    is_va_args: bool,
    tok: Rc<Vec<Token>>,
}

type MacroHandlerFn = fn(&Token, &mut Preprocess) -> Token;

#[derive(Debug)]
struct Macro {
    name: String,
    is_objlike: bool, // Object-like or function-like
    params: Vec<MacroParam>,
    va_args_name: Option<String>,
    body: Option<Rc<TokenList>>,
    handler: Option<MacroHandlerFn>,
}

// `#if` can be nested, so we use a stack to managed nested `#if`s.
#[derive(Debug, PartialEq)]
enum CondInclCtx {
    InThen,
    InElif,
    InElse,
}

#[derive(Debug)]
struct CondIncl {
    ctx: CondInclCtx,
    tok: Rc<Token>,
    included: bool,
}

pub struct Preprocess {
    macros: HashMap<String, Rc<Macro>>,
    cond_incl: Vec<Rc<RefCell<CondIncl>>>,
    pragma_once: HashSet<String>,
    include_paths: Vec<String>,
    include_next_idx: usize,
    include_cache: HashMap<String, String>,
    include_guards: HashMap<String, String>,
    counter_macro_idx: i32,
    base_file: String,
}

impl std::fmt::Debug for Preprocess {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        f.debug_struct("Preprocess")
            .field("macros.key()", &self.macros.keys())
            .field(
                "cond_incl",
                &self
                    .cond_incl
                    .iter()
                    .map(|c| format!("[{}, {:?}]", c.borrow().tok, c.borrow().ctx))
                    .collect::<Vec<String>>(),
            )
            .field("pragma_once", &self.pragma_once)
            .field("include_paths", &self.include_paths)
            .field("include_next_idx", &self.include_next_idx)
            .field("include_cache", &self.include_cache)
            .field("include_guards", &self.include_guards)
            .field("counter_macro_idx", &self.counter_macro_idx)
            .field("base_file", &self.base_file)
            .finish()
    }
}

impl Preprocess {
    pub fn new(include_paths: Vec<String>) -> Preprocess {
        return Preprocess {
            macros: HashMap::new(),
            cond_incl: Vec::new(),
            pragma_once: HashSet::new(),
            include_paths: include_paths,
            include_next_idx: 0,
            include_cache: HashMap::new(),
            include_guards: HashMap::new(),
            counter_macro_idx: 0,
            base_file: String::new(),
        };
    }
}

impl Token {
    #[instrument(ret, level=Level::DEBUG)]
    fn equal(&self, op: &str) -> bool {
        self.loc.as_bytes() == op.as_bytes()
    }

    #[instrument(ret, level=Level::DEBUG)]
    fn is_hash(&self) -> bool {
        return self.at_bol && self.equal("#");
    }
}

#[derive(Clone)] // TODO: reduce clone() calls
pub struct TokenList {
    tokens: Vec<Token>,
    idx: usize,
}

impl std::fmt::Debug for TokenList {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        f.debug_struct("TokenIterator")
            .field("tokens[idx]", &self.tokens.get(self.idx))
            .field("idx", &self.idx)
            .field("len()", &self.len())
            .finish()
    }
}

impl Index<Range<usize>> for TokenList {
    type Output = [Token];

    fn index(&self, index: Range<usize>) -> &Self::Output {
        let r: Range<usize> = Range {
            start: index.start + self.idx,
            end: index.end + self.idx,
        };
        return &self.tokens[r];
    }
}

impl Index<RangeFull> for TokenList {
    type Output = [Token];

    fn index(&self, _: RangeFull) -> &Self::Output {
        let len = self.tokens.len() - self.idx;
        return &self[0..len];
    }
}

impl Index<usize> for TokenList {
    type Output = Token;

    fn index(&self, index: usize) -> &Self::Output {
        return &self.tokens[self.idx + index];
    }
}

impl TokenList {
    fn len(&self) -> usize {
        return self.tokens.len() - self.idx;
    }
}

impl std::ops::IndexMut<usize> for TokenList {
    fn index_mut(&mut self, index: usize) -> &mut Self::Output {
        return &mut self.tokens[self.idx + index];
    }
}

impl TokenList {
    #[instrument(ret)]
    fn skip(&mut self, op: &str) {
        let tok = &self[0];
        if !tok.equal(op) {
            tok.error(&format!("expected '{:?}'", op));
        }
        self.idx += 1;
    }
    #[instrument(ret)]
    fn consume(&mut self, op: &str) -> bool {
        let tok = &self[0];
        if tok.equal(op) {
            self.idx += 1;
            return true;
        }
        return false;
    }

    // Some preprocessor directives such as #include allow extraneous
    // tokens before newline. This function skips such tokens.
    #[instrument(ret)]
    fn skip_line(&mut self) {
        if self.len() == 0 {
            return;
        }
        let tok = &self[0];
        if tok.at_bol {
            return;
        }
        tok.warn("extra token"); // TODO: issues with nested macros?
        while !self[0].at_bol {
            self.idx += 1;
        }
    }
}

impl Hideset {
    #[instrument(ret, level = Level::DEBUG)]
    fn new(name: String) -> Self {
        return Hideset { name };
    }
}

#[instrument(ret, level=Level::DEBUG)]
fn hideset_union(hs1: &Vec<Hideset>, hs2: &Vec<Hideset>) -> Vec<Hideset> {
    let mut r = hs1.clone();
    r.extend_from_slice(&hs2);
    return r;
}

#[instrument(ret, level=Level::DEBUG)]
fn hideset_contains(hs: &Vec<Hideset>, s: &str) -> bool {
    for h in hs.iter() {
        if h.name == s {
            return true;
        }
    }
    return false;
}

#[instrument(ret)]
fn hideset_intersection(hs1: &Vec<Hideset>, hs2: &Vec<Hideset>) -> Vec<Hideset> {
    let mut r = Vec::new();
    for x in hs1.iter() {
        if hideset_contains(hs2, &x.name) {
            r.push(Hideset::new(x.name.clone()))
        }
    }
    return r;
}

impl TokenList {
    #[instrument(ret)]
    fn add_hideset(&mut self, hs: &Vec<Hideset>) {
        for i in self.idx..self.tokens.len() {
            self.tokens[i].hideset = hideset_union(&self.tokens[i].hideset, hs);
        }
    }

    #[instrument(ret)]
    fn append(&mut self, tok: &TokenList) {
        self.tokens.extend_from_slice(&tok.tokens[tok.idx..]);
    }

    #[instrument(ret)]
    fn skip_cond_incl2(&mut self) {
        while self.len() > 0 {
            if self[0].is_hash()
                && (self[1].equal("if") || self[1].equal("ifdef") || self[1].equal("ifndef"))
            {
                self.idx += 2;
                self.skip_cond_incl2();
                continue;
            }
            if self[0].is_hash() && self[1].equal("endif") {
                self.idx += 2;
                return;
            }
            self.idx += 1;
        }
    }

    // Skip until next `#else`, `#elif`, or `#endif`.
    // Nested `#if` and `#endif` are skipped.
    #[instrument(ret)]
    fn skip_cond_incl(&mut self) {
        while self.len() > 0 {
            if self[0].is_hash()
                && (self[1].equal("if") || self[1].equal("ifdef") || self[1].equal("ifndef"))
            {
                self.idx += 2;
                self.skip_cond_incl2();
                continue;
            }

            if self[0].is_hash()
                && (self[1].equal("elif") || self[1].equal("else") || self[1].equal("endif"))
            {
                break;
            }

            self.idx += 1;
        }
    }
}

#[instrument(ret)]
fn quote_string(s: &str) -> String {
    let mut bufsize = 3;
    for b in s.bytes() {
        if b == b'\\' || b == b'"' {
            bufsize += 1;
        }
        bufsize += 1;
    }

    let mut p = String::with_capacity(bufsize);
    p.push('"');
    for b in s.bytes() {
        if b == b'\\' || b == b'"' {
            p.push('\\');
        }
        p.push(b as char);
    }
    p.push('"');
    return p;
}

impl Token {
    #[instrument(ret)]
    fn new_str_token(s: &str, tmpl: &Token) -> Token {
        let buf = quote_string(s);
        let mut t = Tokenize::new();
        let f = File::new(
            tmpl.file.borrow().name.clone(),
            tmpl.file.borrow().file_no,
            buf.into_bytes(),
        );
        let tokens: TokenList = t.tokenize(Rc::new(RefCell::new(f)));
        assert!(tokens.tokens.len() == 1);
        return tokens[0].clone();
    }
}

impl TokenList {
    #[instrument(ret)]
    fn copy_line(&mut self) -> TokenList {
        let mut r = Vec::new();
        while !self[0].at_bol {
            r.push(self[0].clone());
            self.idx += 1;
        }
        return TokenList { tokens: r, idx: 0 };
    }
}

impl Token {
    #[instrument(ret)]
    fn new_num_token(val: i32, tmpl: &Token) -> Token {
        let buf = format!("{}\n", val);
        let mut t = Tokenize::new();
        let f = File::new(
            tmpl.file.borrow().name.clone(),
            tmpl.file.borrow().file_no,
            buf.into_bytes(),
        );
        let tokens: TokenList = t.tokenize(Rc::new(RefCell::new(f)));
        assert!(tokens.tokens.len() == 1);
        return tokens[0].clone();
    }
}

impl TokenList {
    #[instrument(ret)]
    fn read_const_expr(&mut self, pp: &Preprocess) -> TokenList {
        let mut tok = self.copy_line();
        let mut ret = Vec::new();
        while tok.len() > 0 {
            // "defined(foo)" or "defined foo" becomes "1" if macro "foo"
            // is defined. Otherwise "0".
            if tok[0].equal("defined") {
                let start = tok[0].clone();
                tok.idx += 1;
                let has_paren = tok.consume("(");
                if tok[0].kind != TokenKind::Ident {
                    start.error("macro name must be an identifier");
                }
                let m = pp.find_macro(&tok[0]);
                tok.idx += 1;
                if has_paren {
                    tok.skip(")");
                }

                ret.push(Token::new_num_token(
                    if m.is_some() { 1 } else { 0 },
                    &start,
                ));
                continue;
            }

            ret.push(tok[0].clone());
            tok.idx += 1;
        }

        return TokenList {
            tokens: ret,
            idx: 0,
        };
    }

    #[instrument(ret)]
    fn eval_const_expr(&mut self, pp: &mut Preprocess, tokenize: &mut Tokenize) -> i64 {
        let start = self[0].clone();
        self.idx += 1;
        let mut expr = self.read_const_expr(pp);
        expr = preprocess2(expr, pp, tokenize);

        if expr.len() == 0 {
            start.error("no expression");
        }

        // [https://www.sigbus.info/n1570#6.10.1p4] The standard requires
        // we replace remaining non-macro identifiers with "0" before
        // evaluating a constant expression. For example, `#if foo` is
        // equivalent to `#if 0` if foo is not defined.
        let expr_orig = expr.join(None);
        {
            let mut i: usize = 0;
            while i < expr.len() {
                if expr[i].kind == TokenKind::Ident {
                    expr[i] = Token::new_num_token(0, &expr[i]);
                }
                i += 1;
            }
        }

        // Convert pp-numbers to regular numbers
        Tokenize::convert_pp_tokens(&mut expr.tokens);

        eprintln!(
            "WARNING: if evaluation not implemented yet, returning 0: {} ||| {}",
            expr_orig,
            expr.join(None)
        );
        return 0;

        // TODO: implement after implementing parse
        // expr.const_expr
        /*
         Token *rest2;
           long val = const_expr(&rest2, expr);
           if (rest2->kind != TK_EOF)
               error_tok(rest2, "extra token");
         return val;
        */
        todo!();
    }
}

impl Preprocess {
    #[instrument(ret)]
    fn push_cond_incl(&mut self, tok: Rc<Token>, included: bool) -> Rc<RefCell<CondIncl>> {
        let ci = CondIncl {
            ctx: CondInclCtx::InThen,
            tok: tok.clone(),
            included,
        };
        let circ = Rc::new(RefCell::new(ci));
        self.cond_incl.push(circ.clone());
        return circ;
    }

    #[instrument(ret, level=Level::DEBUG)]
    fn find_macro(&self, tok: &Token) -> Option<Rc<Macro>> {
        if tok.kind != TokenKind::Ident {
            return None;
        }
        return self.macros.get(tok.loc.as_str()).map(|x| x.clone());
    }

    #[instrument(ret, level=Level::DEBUG)]
    fn add_macro(&mut self, m: Macro) -> Rc<Macro> {
        let name = m.name.clone();
        let mrc = Rc::new(m);
        self.macros.insert(name, mrc.clone());
        return mrc;
    }
}

impl TokenList {
    #[instrument(ret)]
    fn read_macro_params(&mut self) -> (Vec<MacroParam>, Option<String>) {
        let mut va_args_name = None;
        let mut params = Vec::new();
        while !self[0].equal(")") {
            if params.len() != 0 {
                self.skip(",");
            }

            if self[0].equal("...") {
                va_args_name = Some("__VA_ARGS__".into());
                self.idx += 1;
                self.skip(")");
                return (params, va_args_name);
            }

            if self[0].kind != TokenKind::Ident {
                self[0].error("expected an identifier");
            }

            if self[1].equal("...") {
                va_args_name = Some(String::from(self[0].loc.as_str()));
                self.idx += 2;
                self.skip(")");
                return (params, va_args_name);
            }

            let m = MacroParam {
                name: self[0].loc.as_str().into(),
            };
            params.push(m);
            self.idx += 1;
        }

        self.idx += 1;
        return (params, va_args_name);
    }

    #[instrument(ret)]
    fn read_macro_definition(&mut self, pp: &mut Preprocess) {
        if self[0].kind != TokenKind::Ident {
            self[0].error("macro name must be an identifier");
        }
        let name = String::from(self[0].loc.as_str());
        self.idx += 1;

        if !self[0].has_space && self[0].equal("(") {
            // Function-like macro
            self.idx += 1;
            let (params, va_args_name) = self.read_macro_params();
            let tokens = self.copy_line();
            let m = Macro {
                name,
                is_objlike: false,
                params,
                va_args_name,
                body: Some(Rc::new(tokens)),
                handler: None,
            };
            pp.add_macro(m);
        } else {
            let tokens = self.copy_line();
            let m = Macro {
                name,
                is_objlike: true,
                params: Vec::new(),
                va_args_name: None,
                body: Some(Rc::new(tokens)),
                handler: None,
            };
            pp.add_macro(m);
        }
    }

    #[instrument(ret)]
    fn read_macro_arg_one(&mut self, read_rest: bool) -> MacroArg {
        let mut tokens = Vec::new();
        let mut level = 0;

        loop {
            if level == 0 && self[0].equal(")") {
                break;
            }
            if level == 0 && !read_rest && self[0].equal(",") {
                break;
            }
            if self.len() == 0 {
                self[0].error("premature end of input");
            }

            if self[0].equal("(") {
                level += 1;
            } else if self[0].equal(")") {
                level -= 1;
            }

            tokens.push(self[0].clone());
            self.idx += 1;
        }

        let arg = MacroArg {
            name: String::new(),
            is_va_args: false,
            tok: Rc::new(tokens),
        };

        return arg;
    }

    #[instrument(ret)]
    fn read_macro_args(
        &mut self,
        params: &Vec<MacroParam>,
        va_args_name: Option<String>,
    ) -> Vec<MacroArg> {
        self.idx += 2;
        let mut args = Vec::new();

        for p in params {
            if args.len() != 0 {
                self.skip(",");
            }
            let mut a = self.read_macro_arg_one(false);
            a.name = p.name.clone();
            args.push(a);
        }

        if let Some(va_args_name) = va_args_name {
            let mut arg = if self[0].equal(")") {
                MacroArg {
                    name: String::new(),
                    is_va_args: true,
                    tok: Rc::new(vec![]),
                }
            } else {
                if args.len() > 0 {
                    self.skip(",");
                }
                self.read_macro_arg_one(true)
            };
            arg.name = va_args_name;
            arg.is_va_args = true;
            args.push(arg);
        }

        self.skip(")");
        self.idx -= 1;

        return args;
    }
}

#[instrument(ret, level=Level::DEBUG)]
fn find_arg(args: &Vec<MacroArg>, tok: &Token) -> Option<MacroArg> {
    for a in args {
        if a.name == String::from(tok.loc.as_str()) {
            return Some(a.clone());
        }
    }
    return None;
}

impl TokenList {
    // Concatenates all tokens in `self` and returns a new string
    #[instrument(ret)]
    fn join(&self, end: Option<&Token>) -> String {
        let mut len = 0;
        for t in &self[..] {
            if end.is_some() && t.loc.idx == end.unwrap().loc.idx {
                break;
            }
            if len == 0 || t.has_space {
                len += 1;
            }
            len += t.loc.len();
        }

        let mut buf = String::with_capacity(len);

        // Copy token texts.
        for t in &self[..] {
            if end.is_some() && t.loc.idx == end.unwrap().loc.idx {
                break;
            }

            if buf.len() > 0 && t.has_space {
                buf.push(' ');
            }
            buf.push_str(t.loc.as_str());
        }
        return buf;
    }

    // Concatenate all tokens in `self` and return a new string token.
    // This function is used for the stringizing operator (#).
    #[instrument(ret)]
    fn stringize(&self, hash: &Token) -> Token {
        // Create a new string token. We need to set some value for
        // its source location for error reporting function, so we use
        // a macro name token as a template.
        let s = self.join(None);
        return Token::new_str_token(&s, hash);
    }
}

impl Token {
    #[instrument(ret)]
    fn paste(&self, other: &Token) -> Token {
        let buf = format!("{}{}", self.loc.as_str(), other.loc.as_str());

        let mut t = Tokenize::new();
        let f = File::new(
            self.file.borrow().name.clone(),
            self.file.borrow().file_no,
            buf.clone().into_bytes(),
        );
        let tokens = t.tokenize(Rc::new(RefCell::new(f)));
        if tokens.tokens.len() != 1 {
            self.error(format!("pasting forms '{}', an invalid token", buf).as_str());
        }
        return tokens[0].clone();
    }
}

#[instrument(ret)]
fn has_varargs(args: &Vec<MacroArg>) -> bool {
    for ap in args {
        if ap.name == "__VA_ARGS" {
            return ap.tok.len() > 0;
        }
    }
    return false;
}

// Replace func-like macro params with given arguments.
impl TokenList {
    #[instrument(ret)]
    fn subst(
        mut self,
        args: &Vec<MacroArg>,
        pp: &mut Preprocess,
        tokenize: &mut Tokenize,
    ) -> TokenList {
        let mut v = Vec::new();

        while self.len() > 0 {
            let tok = &self[0];
            if tok.equal("#") {
                let arg = find_arg(args, &self[1]);
                if arg.is_none() {
                    self[1].error("'#' is not followed by a macro parameter");
                }
                let argtl: TokenList = TokenList {
                    tokens: arg.unwrap().tok.to_vec(),
                    idx: 0,
                };
                v.push(argtl.stringize(&self[0]));
                self.idx += 2;
                continue;
            }

            // [GNU] If __VA_ARG__ is empty, `,##__VA_ARGS__` is expanded
            // to the empty token list. Otherwise, it's expanded to `,` and
            // __VA_ARGS.
            if tok.equal(",") && self[1].equal("##") {
                let arg = find_arg(args, &self[2]);
                if arg.is_some() {
                    if let Some(arg) = arg {
                        if arg.tok.len() == 0 {
                            self.idx += 3;
                        } else {
                            v.push(tok.clone());
                            self.idx += 2;
                        }
                        continue;
                    }
                }
            }
            if tok.equal("##") {
                if v.len() == 0 {
                    tok.error("'##' cannot appear at the start of macro expansion");
                }
                if self.len() == 1 {
                    tok.error("'##' cannot appear at the end of macro expansion");
                }
                let arg = find_arg(args, &self[1]);
                if let Some(a) = arg {
                    if a.tok.len() > 0 {
                        let last = v.last_mut().unwrap();
                        *last = last.paste(&a.tok[0]);
                        for t in &a.tok[1..] {
                            v.push(t.clone());
                        }
                    }
                    self.idx += 2;
                    continue;
                }

                let last = v.last_mut().unwrap();
                *last = last.paste(&self[1]);
                self.idx += 2;
                continue;
            }

            let arg = find_arg(args, &tok);
            if arg.is_some() && self.len() > 1 && self[1].equal("##") {
                let rhs = &self[2];
                let arg = arg.unwrap();
                if arg.tok.len() == 0 {
                    let arg2 = find_arg(args, rhs);
                    if let Some(arg2) = arg2 {
                        for t in arg2.tok.iter() {
                            v.push(t.clone());
                        }
                    } else {
                        v.push(rhs.clone());
                    }
                    self.idx += 3;
                    continue;
                }

                for t in arg.tok.iter() {
                    v.push(t.clone());
                }
                self.idx += 1;
                continue;
            }

            // If __VA_ARG is empty, __VA_OPT(x) is expanded to the
            // empty token list. Otherwise, __VA_OPT__(x) is expanded to x.
            if tok.equal("__VA_OPT__") && self[1].equal("(") {
                self.idx += 2;
                let arg = self.read_macro_arg_one(true);
                if has_varargs(args) {
                    for t in arg.tok.iter() {
                        v.push(t.clone());
                    }
                }
                //
                self.skip(")"); // idx += 1;
                continue;
            }

            // Handle a macro token. Macro arguments are completely macro-expanded
            // before they are substituted into a macro body.
            if let Some(a) = arg {
                let mut t: TokenList = preprocess2(
                    TokenList {
                        tokens: a.tok.to_vec(),
                        idx: 0,
                    },
                    pp,
                    tokenize,
                );
                if t.len() > 0 {
                    t[0].at_bol = tok.at_bol;
                    t[0].has_space = tok.has_space;
                    for tok in t.tokens {
                        v.push(tok);
                    }
                }
                self.idx += 1;
                continue;
            }
            // Handle a non-macro token
            v.push(tok.clone());
            self.idx += 1;
            continue;
        }

        return TokenList { tokens: v, idx: 0 };
    }

    // If tok is a macro, expand it and return true.
    // Otherwise, do nothing and return false.
    #[instrument(ret)]
    fn expand_macro(mut self, pp: &mut Preprocess, tokenize: &mut Tokenize) -> (TokenList, bool) {
        if hideset_contains(&self[0].hideset, self[0].loc.as_str()) {
            return (self, false);
        }

        let m = pp.find_macro(&self[0]);
        if m.is_none() {
            return (self, false);
        }
        let m = m.unwrap();

        // Built-in dynamic macro application such as __LINE__
        if let Some(h) = m.handler {
            let t: Token = h(&self[0], pp);
            let mut tokens = TokenList {
                tokens: vec![t],
                idx: 0,
            };
            self.idx += 1;
            tokens.append(&self);
            return (tokens, true);
        }

        // Object-like macro application
        if m.is_objlike {
            let hs = hideset_union(&self[0].hideset, &vec![Hideset::new(m.name.clone())]);
            let start_tok: Token = self[0].clone();
            let mut body = m.body.as_ref().unwrap().as_ref().clone();
            body.add_hideset(&hs); // TODO: should hideset be added to the macro body?
            for t in body.tokens.iter_mut() {
                t.origin = Some(Rc::new(start_tok.clone()))
            }
            self.idx += 1;
            body.append(&self);
            body[0].at_bol = start_tok.at_bol;
            body[0].has_space = start_tok.has_space;
            return (body, true);
        }

        // If a funclike macro token is not followed by an argument list,
        // treat it as a normal identifier.
        if self.len() == 1 || !self[1].equal("(") {
            return (self, false);
        }

        // Function-like macro application
        let macro_token = Rc::new(self[0].clone());
        let args = self.read_macro_args(&m.params, m.va_args_name.clone());
        let r_paren = self[0].clone();

        // Tokens that consist of a func-like macro invocation may have different
        // hidsets, and if that's the case, it's not clear what the hideset
        // for the new tokens should be. We take the intersection of the
        // macro token and the closing parenthesis and use it as a new hideset
        // as explained in the Dave Prossor's algorithm.
        let mut hs = hideset_intersection(&macro_token.hideset, &r_paren.hideset);
        hs = hideset_union(&hs, &vec![Hideset::new(m.name.clone())]);
        let mut body = m.body.clone().unwrap().as_ref().clone();
        body = body.subst(&args, pp, tokenize);
        body.add_hideset(&hs);
        for t in body.tokens.iter_mut() {
            t.origin = Some(macro_token.clone());
        }
        body.append(&self);
        body[0].at_bol = macro_token.at_bol;
        body[0].has_space = macro_token.has_space;
        return (body, true);
    }
}

impl Preprocess {
    #[instrument(ret)]
    fn search_include_paths(&mut self, filename: &str) -> Option<String> {
        if filename.starts_with("/") {
            return Some(String::from(filename));
        }

        let cached = self.include_cache.get(filename);
        if let Some(c) = cached {
            return Some(c.clone());
        }

        // Search a file from the include paths.
        for (i, p) in self.include_paths.iter().enumerate() {
            let path = format!("{}/{}", p, filename);
            if !Path::new(&path).exists() {
                continue;
            }
            self.include_cache
                .insert(String::from(filename), path.clone());
            self.include_next_idx = i + 1;
            return Some(path);
        }
        return None;
    }

    #[instrument(ret)]
    fn search_include_next(&mut self, filename: &str) -> Option<String> {
        for p in &self.include_paths[self.include_next_idx..] {
            let path = format!("{}/{}", p, filename);
            if Path::new(&path).exists() {
                return Some(path);
            }

            self.include_next_idx += 1;
        }
        return None;
    }
}

impl TokenList {
    // Read an # include argument
    #[instrument(ret)]
    fn read_include_filename(
        &mut self,
        pp: &mut Preprocess,
        tokenize: &mut Tokenize,
    ) -> (Vec<u8>, bool) {
        // Pattern 1: #include "foo.h"
        if self[0].kind == TokenKind::Str {
            // A double-quoted filename for #include is a special kind of
            // token, and we don't want to interpret any escape sequences in it.
            // For example, "\f" in "C:\foo" is not a formfeed character but
            // just two non-control characters, backslash and f.
            // So, we don't want to use token->str.
            let s = Vec::from(&self[0].loc.contents[self[0].loc.idx + 1..self[0].loc.end_idx - 1]);
            self.idx += 1;
            self.skip_line();
            return (s, true);
        }

        // Pattern 2: #include <foo.h>
        if self[0].equal("<") {
            // Reconstruct a filename from a sequence of tokens between
            // "<" and ">";
            let mut i = 0;
            while i <= self.len() && !self[i].equal(">") {
                if i == self.len() || self[i].at_bol {
                    self[i].error("expected '>'")
                }
                i += 1;
            }
            let end = self[i].clone();
            self.idx += 1;
            let s = self.join(Some(&end));
            self.idx += i;
            self.skip_line();
            return (s.into_bytes(), false);
        }

        // Pattern 3: #include FOO
        // In this case FOO must be macro-expanded to either
        // a single string token or a sequence of "<" ... ">".
        if self[0].kind == TokenKind::Ident {
            let cl = self.copy_line();
            let mut tok2 = preprocess2(cl, pp, tokenize);
            return tok2.read_include_filename(pp, tokenize);
        }

        self[0].error("expected a filename");
        panic!("unreachable");
    }

    #[instrument(ret)]
    fn detect_include_guard(&self) -> Option<Vec<u8>> {
        let mut t = self.clone(); // TODO: expensive

        // Detect the first two lines.
        if !t[0].is_hash() || !t[1].equal("ifdef") {
            return None;
        }
        t.idx += 2;

        if t[0].kind != TokenKind::Ident {
            return None;
        }
        let m = String::from(t[0].loc.as_str());
        t.idx += 1;

        if !t[0].is_hash() || !t[1].equal("define") || !t[2].equal(&m) {
            return None;
        }

        while t.len() > 0 {
            if !t[0].is_hash() {
                t.idx += 1;
                continue;
            }

            if t[1].equal("endif") && t.len() == 1 {
                return Some(m.into_bytes());
            }

            if t[0].equal("if") || t[0].equal("ifdef") || t[0].equal("ifndef") {
                t.idx += 1;
                t.skip_cond_incl();
            } else {
                t.idx += 1;
            }
        }

        return None;
    }

    #[instrument(ret)]
    fn include_file(
        self,
        path: &str,
        filename_tok: &Token,
        pp: &mut Preprocess,
        t: &mut Tokenize,
    ) -> TokenList {
        // Check for "#pragma once"
        if pp.pragma_once.contains(path) {
            return self;
        }

        // If we read the same file before, and if the file was guarded
        // by the usual #ifndef ... #endif pattern, we may be able to
        // skip the file without opening it.
        let guard_name = pp.include_guards.get(path);
        if guard_name.is_some() && pp.macros.get(guard_name.unwrap()).is_some() {
            return self;
        }

        let tok2 = t.tokenize_file(path);
        if tok2.is_none() {
            // TODO: errorno?
            filename_tok.error(format!("{}: cannot open file", path).as_str())
        }

        let guard_name = self.detect_include_guard();
        if let Some(guard_name) = guard_name {
            pp.include_guards
                .insert(String::from(path), String::from_utf8(guard_name).unwrap());
        }

        let mut tok2 = tok2.unwrap();
        tok2.append(&self);
        return tok2;
    }

    #[instrument(ret)]
    fn read_line_marker(&mut self, pp: &mut Preprocess, tokenize: &mut Tokenize) {
        let start_idx = self.idx;
        let cl: TokenList = self.copy_line();
        let mut tok = pp.preprocess(cl, tokenize);

        if let TokenFields::Int(ty, val) = &tok[0].other {
            assert_eq!(tok[0].kind, TokenKind::Num);
            assert!(matches!(ty.kind, TypeKind::Int));
            let mut f = self.tokens[start_idx].file.borrow_mut();
            f.line_delta = *val as i32 - self.tokens[start_idx].line_no;

            tok.idx += 1;
            if tok.len() == 0 {
                return;
            }

            if let TokenFields::Str(_, s) = &tok[0].other {
                assert_eq!(tok[0].kind, TokenKind::Str);
                f.display_name = String::from_utf8(Vec::from(s.as_ref())).unwrap();
            } else {
                tok[0].error("filename expected");
            }
        } else {
            tok[0].error("invalid line marker");
        }
    }
}

#[instrument(ret)]
fn preprocess2(mut t: TokenList, pp: &mut Preprocess, tokenize: &mut Tokenize) -> TokenList {
    let mut ret = Vec::new();

    while t.len() > 0 {
        // If it is a macro, expand it.
        let ok: bool;
        (t, ok) = t.expand_macro(pp, tokenize);
        if ok {
            continue;
        }

        // Pass through if it is not a "#".
        if !t[0].is_hash() {
            let line_delta = t[0].file.borrow().line_delta;
            t[0].line_delta = line_delta;
            let display_name = t[0].file.borrow().display_name.clone();
            t[0].filename = Some(display_name);
            ret.push(t[0].clone()); // TODO: avoid token clones?
            t.idx += 1;
            continue;
        }

        let start: Token = t[0].clone();
        let start_next_next = t[2].clone();
        t.idx += 1;
        if t[0].equal("include") {
            t.idx += 1;
            let (filename, is_dquote) = t.read_include_filename(pp, tokenize);
            let filename = String::from_utf8(filename).unwrap();
            if !filename.starts_with("/") && is_dquote {
                let mut dirname = PathBuf::from(&start.file.borrow().name);
                dirname.pop();
                let path = format!("{}/{}", dirname.to_str().unwrap(), filename);
                if Path::new(&path).exists() {
                    t = t.include_file(&path, &start_next_next, pp, tokenize);
                    continue;
                }
            }

            let path = pp.search_include_paths(&filename);
            t = t.include_file(&path.unwrap_or(filename), &start_next_next, pp, tokenize);

            continue;
        }

        if t[0].equal("include_next") {
            t.idx += 1;
            let (filename, _) = t.read_include_filename(pp, tokenize);
            let filename = String::from_utf8(filename).unwrap();
            let path = pp.search_include_next(&filename);
            t = t.include_file(&path.unwrap_or(filename), &start_next_next, pp, tokenize);
            continue;
        }

        if t[0].equal("define") {
            t.idx += 1;
            t.read_macro_definition(pp);
            continue;
        }

        if t[0].equal("undef") {
            t.idx += 1;
            if t[0].kind != TokenKind::Ident {
                t[0].error("macro name must be an identifier");
            }
            pp.undef_macro(t[0].loc.as_str());
            t.idx += 1;
            t.skip_line();
            continue;
        }

        if t[0].equal("if") {
            let val = t.eval_const_expr(pp, tokenize);
            pp.push_cond_incl(Rc::new(start), val != 0); // why start?
            if val == 0 {
                t.skip_cond_incl();
            }
            continue;
        }

        if t[0].equal("ifdef") {
            let defined = pp.find_macro(&t[1]).is_some();
            pp.push_cond_incl(Rc::new(t[0].clone()), defined);
            t.idx += 2;
            t.skip_line();
            if !defined {
                t.skip_cond_incl();
            }
            continue;
        }

        if t[0].equal("ifndef") {
            let defined = pp.find_macro(&t[1]).is_some();
            pp.push_cond_incl(Rc::new(t[0].clone()), !defined);
            t.idx += 2;
            t.skip_line();
            if defined {
                t.skip_cond_incl();
            }
            continue;
        }

        if t[0].equal("elif") {
            if pp.cond_incl.len() == 0
                || pp.cond_incl.last().unwrap().borrow().ctx == CondInclCtx::InElse
            {
                start.error("stray #elif");
            }
            {
                let mut ci = pp.cond_incl.last().unwrap().borrow_mut();
                ci.ctx = CondInclCtx::InElif;
            }
            let included = pp.cond_incl.last().unwrap().borrow().included;
            if !included && t.eval_const_expr(pp, tokenize) > 0 {
                let mut ci = pp.cond_incl.last().unwrap().borrow_mut();
                ci.included = true;
            } else {
                t.skip_cond_incl();
            }
            continue;
        }

        if t[0].equal("else") {
            if pp.cond_incl.is_empty()
                || pp.cond_incl.last().unwrap().borrow().ctx == CondInclCtx::InElse
            {
                start.error("stray #else");
            }
            let mut ci = pp.cond_incl.last().unwrap().borrow_mut();
            ci.ctx = CondInclCtx::InElse;
            t.idx += 1;
            t.skip_line();
            if !ci.included {
                t.skip_cond_incl();
            }
            continue;
        }

        if t[0].equal("endif") {
            if pp.cond_incl.len() == 0 {
                start.error("stray #endif");
            }
            pp.cond_incl.pop();
            t.idx += 1;
            t.skip_line();
            continue;
        }

        if t[0].equal("line") {
            t.idx += 1;
            t.read_line_marker(pp, tokenize);
            continue;
        }

        if t[0].kind == TokenKind::PPNum {
            t.read_line_marker(pp, tokenize);
            continue;
        }

        if t[0].equal("pragma") && t[1].equal("once") {
            pp.pragma_once.insert(t[0].file.borrow().name.clone());
            t.idx += 2;
            t.skip_line();
            continue;
        }

        if t[0].equal("pragma") {
            loop {
                t.idx += 1;
                if t[0].at_bol {
                    break;
                }
            }
            continue;
        }

        if t[0].equal("error") {
            t[0].error("error");
        }

        // CUSTOM
        if t[0].equal("warning") {
            eprintln!("CUSTOM: skipping warning");
            t.skip_line();
            continue;
        }

        // `#`-only line is legal. It's called a null directive.
        if t[0].at_bol {
            continue;
        }

        t[0].error("invalid preprocessor directive");
    }

    return TokenList {
        tokens: ret,
        idx: 0,
    };
}

impl Preprocess {
    #[instrument(ret)]
    fn define_macro(&mut self, name: &str, buf: &str) {
        let f = File::new("<built-in>".into(), 1, Vec::from(buf.as_bytes()));
        let mut t = Tokenize::new();
        let tokens = t.tokenize(Rc::new(RefCell::new(f)));
        self.add_macro(Macro {
            name: String::from(name),
            is_objlike: true,
            params: Vec::new(),
            va_args_name: None,
            body: Some(Rc::new(tokens)),
            handler: None,
        });
    }

    #[instrument(ret)]
    fn undef_macro(&mut self, name: &str) {
        self.macros.remove(name);
    }

    #[instrument(ret)]
    fn add_builtin(&mut self, name: &str, f: MacroHandlerFn) {
        let m = Macro {
            name: String::from(name),
            is_objlike: true,
            params: Vec::new(),
            va_args_name: None,
            body: None,
            handler: Some(f),
        };
        self.add_macro(m);
    }
}

impl Token {
    #[instrument(ret)]
    fn file_macro(tmpl: &Token, _: &mut Preprocess) -> Token {
        let mut tmpl = Rc::new(tmpl.clone());
        while tmpl.origin.is_some() {
            tmpl = tmpl.origin.clone().unwrap();
        }
        return Token::new_str_token(&tmpl.file.borrow().display_name, tmpl.as_ref());
    }

    #[instrument(ret)]
    fn line_macro(tmpl: &Token, _: &mut Preprocess) -> Token {
        let mut tmpl = Rc::new(tmpl.clone());
        while let Some(t) = tmpl.origin.clone() {
            tmpl = t
        }
        let i = tmpl.line_no + tmpl.file.borrow().line_delta;
        return Token::new_num_token(i, tmpl.as_ref());
    }

    // __COUNTER__ is expanded to serial values starting from 0
    #[instrument(ret)]
    fn counter_macro(tmpl: &Token, pp: &mut Preprocess) -> Token {
        let t = Token::new_num_token(pp.counter_macro_idx, tmpl);
        pp.counter_macro_idx += 1;
        return t;
    }

    // __TIMESTAMP__ is expanded to a string describing the last
    // modification time of the current file. E.g.
    // "Fri Jul 24 01:32:50 2020"
    #[instrument(ret)]
    fn timestamp_macro(tmpl: &Token, _: &mut Preprocess) -> Token {
        let r = fs::metadata(&tmpl.file.borrow().name);
        match r {
            Ok(m) => {
                let dt = DateTime::from_timestamp(m.ctime(), 0).unwrap();
                let formatted = format!("{}", dt.format("%a %m %d %H:%M:%S %Y"));
                return Token::new_str_token(&formatted, tmpl);
            }
            Err(_) => {
                return Token::new_str_token("??? ??? ?? ??:??:?? ????", tmpl);
            }
        }
    }

    #[instrument(ret)]
    fn base_file_macro(tmpl: &Token, pp: &mut Preprocess) -> Token {
        return Token::new_str_token(&pp.base_file, tmpl);
    }
}

// __DATE__ is expanded to the current date, e.g. "May 17 2020"
#[instrument(ret)]
fn format_date(dt: DateTime<Local>) -> String {
    return format!("{}", dt.format("%a %m %Y"));
}

// __TIME__ is expanded to the current time, e.g. "13:34:03"
#[instrument(ret)]
fn format_time(dt: DateTime<Local>) -> String {
    return format!("{}", dt.format("%H:%M:%S"));
}

impl Preprocess {
    pub fn init_macros(&mut self) {
        // Define predefined macros
        self.define_macro("_LP64", "1");
        self.define_macro("__C99_MACRO_WITH_VA_ARGS", "1");
        self.define_macro("__ELF__", "1");
        self.define_macro("__LP64__", "1");
        self.define_macro("__SIZEOF_DOUBLE__", "8");
        self.define_macro("__SIZEOF_FLOAT__", "4");
        self.define_macro("__SIZEOF_INT__", "4");
        self.define_macro("__SIZEOF_LONG_DOUBLE__", "8");
        self.define_macro("__SIZEOF_LONG_LONG__", "8");
        self.define_macro("__SIZEOF_LONG__", "8");
        self.define_macro("__SIZEOF_POINTER__", "8");
        self.define_macro("__SIZEOF_PTRDIFF_T__", "8");
        self.define_macro("__SIZEOF_SHORT__", "2");
        self.define_macro("__SIZEOF_SIZE_T__", "8");
        self.define_macro("__SIZE_TYPE__", "unsigned long");
        self.define_macro("__STDC_HOSTED__", "1");
        self.define_macro("__STDC_NO_COMPLEX__", "1");
        self.define_macro("__STDC_UTF_16__", "1");
        self.define_macro("__STDC_UTF_32__", "1");
        self.define_macro("__STDC_VERSION__", "201112L");
        self.define_macro("__STDC__", "1");
        self.define_macro("__USER_LABEL_PREFIX__", "");
        self.define_macro("__alignof__", "_Alignof");
        self.define_macro("__amd64", "1");
        self.define_macro("__amd64__", "1");
        self.define_macro("__chibicc__", "1");
        self.define_macro("__const__", "const");
        self.define_macro("__gnu_linux__", "1");
        self.define_macro("__inline__", "inline");
        self.define_macro("__linux", "1");
        self.define_macro("__linux__", "1");
        self.define_macro("__signed__", "signed");
        self.define_macro("__typeof__", "typeof");
        self.define_macro("__unix", "1");
        self.define_macro("__unix__", "1");
        self.define_macro("__volatile__", "volatile");
        self.define_macro("__x86_64", "1");
        self.define_macro("__x86_64__", "1");
        self.define_macro("linux", "1");
        self.define_macro("unix", "1");

        self.add_builtin("__FILE__", Token::file_macro);
        self.add_builtin("__LINE__", Token::line_macro);
        self.add_builtin("__COUNTER__", Token::counter_macro);
        self.add_builtin("__TIMESTAMP__", Token::timestamp_macro);
        self.add_builtin("__BASE_FILE__", Token::base_file_macro);

        let dt = Local::now();
        self.define_macro("__DATE__", &format_date(dt));
        self.define_macro("__TIME__", &format_time(dt));
    }
}

#[derive(Debug, PartialEq)]
enum StringKind {
    None,
    UTF8,
    UTF16,
    UTF32,
    Wide,
}

impl Token {
    #[instrument(ret)]
    fn get_string_kind(&self) -> StringKind {
        if self.loc[0] == b'u' && self.loc[1] == b'8' {
            return StringKind::UTF8;
        }

        match self.loc[0] {
            b'"' => StringKind::None,
            b'u' => StringKind::UTF16,
            b'U' => StringKind::UTF32,
            b'L' => StringKind::Wide,
            _ => panic!("string kind locaation {}", self.loc[0] as char),
        }
    }
}

impl TokenList {
    // Concatenate adjacent string literals into a single string literal
    // as per the C spec.
    #[instrument(ret)]
    fn join_adjacent_string_literals(mut self, tokenize: &Tokenize) -> TokenList {
        // First pass: If regular string literals are adjacent to
        // wide string literals, regular string literals are converted to a wide
        // type before concatenation. In this pass, we do the conversion.
        let mut i = 0;
        while i < self.len() {
            if self[i].kind != TokenKind::Str
                || i + 1 == self.len()
                || self[i + 1].kind != TokenKind::Str
            {
                i += 1;
                continue;
            }

            let mut kind = self[i].get_string_kind();

            let mut basety = if let TokenFields::Str(base_ty, _) = &self[i].other {
                base_ty.clone()
            } else {
                panic!();
            };
            for j in i + 1..self.len() {
                if self[j].kind != TokenKind::Str {
                    break;
                }
                let k = self[j].get_string_kind();
                if kind == StringKind::None {
                    kind = k;
                    basety = if let TokenFields::Str(base_ty, _) = &self[j].other {
                        base_ty.clone()
                    } else {
                        panic!();
                    };
                } else if k != StringKind::None && kind != k {
                    self[j].error("unsupported non-standard concatenation of string literals");
                }
            }

            if basety.size > 1 {
                for j in i..self.len() {
                    if self[j].kind != TokenKind::Str {
                        break;
                    }
                    if let TokenFields::Str(base_ty, _) = &self[j].other {
                        if base_ty.size == 1 {
                            self[j] = tokenize.tokenize_string_literal(
                                &self[j],
                                &basety,
                                self[j].at_bol,
                                self[j].has_space,
                            );
                        }
                    } else {
                        panic!();
                    };
                }
            }
            while self[i].kind == TokenKind::Str {
                i += 1;
            }
        }

        // Second pass: concatenate adjacent string literals
        let mut ret = TokenList {
            tokens: Vec::new(),
            idx: 0,
        };
        let mut i = 0;
        while i < self.len() {
            if self[i].kind != TokenKind::Str
                || i + 1 == self.len()
                || self[i + 1].kind != TokenKind::Str
            {
                ret.tokens.push(self[i].clone());
                i += 1;
            } else {
                let mut j = i + 1;
                while self[j].kind == TokenKind::Str {
                    j += 1;
                }

                let mut len = 1;
                let mut buf: Vec<u8> = Vec::new();
                for k in i..j {
                    if let TokenFields::Str(ty, s) = &self[k].other {
                        buf.extend_from_slice(s.as_ref());
                        if let TypeKind::Array { len: l, .. } = ty.kind {
                            len += l - 1;
                        } else {
                            panic!();
                        }
                    } else {
                        panic!();
                    }
                }

                let mut t = self[i].clone();
                if let TokenFields::Str(ty, _) = &self[i].other {
                    if let TypeKind::Array { base, .. } = &ty.kind {
                        t.other = TokenFields::Str(array_of(&base, len), buf.into_boxed_slice());
                    } else {
                        panic!();
                    }
                } else {
                    panic!();
                }
                ret.tokens.push(t);
                i = j;
            }
        }

        return ret;
    }
}

impl Preprocess {
    #[instrument]
    pub fn preprocess(&mut self, tokens: TokenList, tokenize: &mut Tokenize) -> TokenList {
        let mut tl = preprocess2(tokens, self, tokenize);
        if self.cond_incl.len() > 0 {
            self.cond_incl
                .last()
                .unwrap()
                .borrow()
                .tok
                .error("unterminated conditional directive");
        }
        Tokenize::convert_pp_tokens(&mut tl.tokens);
        tl = tl.join_adjacent_string_literals(tokenize);

        for t in tl.tokens.iter_mut() {
            t.line_no += t.line_delta;
        }
        return tl;
    }
}

//
// parse.c (and add_type from type.c)
//

// This file contains a recursive descent parser for C.
//
// Most functions in this file are named after the symbols they are
// supposed to read from an input token list. For example, stmt() is
// responsible for reading a statement from a token list. The function
// then constructs an AST node representing a statement.
//
// Each function conceptually returns two values, an AST node and
// remaining part of the input tokens. Since C doesn't support
// multiple return values, the remaining tokens are returned to the
// caller via a pointer argument.
//
// Input tokens are represented by a linked list. Unlike many recursive
// descent parsers, we don't have the notion of the "input token stream".
// Most parsing functions don't change the global state of the parser.
// So it is very easy to lookahead arbitrary number of tokens in this
// parser.

// Scope for local variables, global variables, typedefs
// or enum constants
struct VarScope {
    var: Option<Rc<RefCell<Obj>>>,
    type_def: Option<Type>,
    enum_ty: Option<Type>,
    enum_val: Option<i32>,
}

// Represent a block scope.
struct Scope {
    // C has two block scopes, one is for variables/typedefs and
    // the other is for struct/union/enum tags.
    vars: HashMap<String, Rc<RefCell<VarScope>>>,
    tags: HashMap<String, Type>,
}

impl Scope {
    fn new() -> Scope {
        return Scope {
            vars: HashMap::new(),
            tags: HashMap::new(),
        };
    }
}

// Variable attributes such as typedef or extern
struct VarAttr {
    is_typedef: bool,
    is_static: bool,
    is_extern: bool,
    is_inline: bool,
    is_tls: bool,
    align: i32,
}

// This struct represents a variable initializer. Since initializers
// can be nested (e.g. `int[2][2] = {{1, 2}, {3, 4}}`), this struct
// is a tree data structure.
struct Initializer {
    ty: Type,
    tok: Option<Token>,
    is_flexible: bool,

    // If it's not an aggregate type and has an initializer,
    // `expr` has an initialization expression.
    expr: Option<Node>,

    // If it's an initializer for an aggregate type (e.g. array or struct),
    // children has initializers for its children.
    children: Option<Vec<Initializer>>,

    // Only one member can be initialized for a union.
    // `mem` is used to clarify which member is initialized.
    mem: Option<Member>,
}

// For local variable initializer.
struct InitDesg {
    idx: i32,
    member: Member,
    var: Obj,
}

struct Parse {
    // All local variable instances created during parsing are
    // accumulated to this list.
    locals: Rc<RefCell<LinkedList<Rc<RefCell<Obj>>>>>,
    // Likewise, global variables are accumulated to this list.
    globals: Rc<RefCell<LinkedList<Rc<RefCell<Obj>>>>>,
    scope: LinkedList<Scope>,
    // Point to the function object the parser is current parsing
    current_fn: Rc<RefCell<Obj>>,
    // List of all goto statements and labels in the current function.
    gotos: LinkedList<Rc<RefCell<Node>>>,
    labels: LinkedList<Rc<RefCell<Node>>>,
    // Current "goto" and "continue" jump targets.
    brk_label: Option<String>,
    cont_label: Option<String>,
    // Points to a node representing a switch if we are parsing
    // a switch statement. Otherwise, NULL.
    current_switch: Option<Node>,
    builtin_alloca: Rc<RefCell<Obj>>,
}

// Round up `n` to the nearest multiple of `align`. For instance,
// align_to(5, 8) returns 8 and align_to(11, 8) returns 16.
fn align_to(n: i32, align: i32) -> i32 {
    return ((n + align - 1) / align) * align;
}

fn align_down(n: i32, align: i32) -> i32 {
    return align_to(n - align + 1, align);
}

impl Parse {
    fn enter_scope(&mut self) {
        let s = Scope::new();
        self.scope.push_front(s);
    }
    fn leave_scope(&mut self) {
        self.scope.pop_front();
    }

    // Find a variable by name.
    fn find_var(&self, tok: &Token) -> Option<Rc<RefCell<VarScope>>> {
        for sc in self.scope.iter() {
            if let Some(sc2) = sc.vars.get(tok.loc.as_str()) {
                return Some(sc2.clone());
            }
        }
        return None;
    }

    fn find_tag(&self, tok: &Token) -> Option<&Type> {
        for sc in self.scope.iter() {
            if let Some(ty) = sc.tags.get(tok.loc.as_str()) {
                return Some(ty);
            }
        }
        return None;
    }
}

impl Node {
    fn new_binary(kind: NodeBinaryKind, lhs: Node, rhs: Node, tok: Token) -> Node {
        return Node {
            ty: None,
            tok,
            kind: NodeFields::Binary {
                kind,
                lhs: Rc::new(RefCell::new(lhs)),
                rhs: Rc::new(RefCell::new(rhs)),
            },
        };
    }

    fn new_unary(kind: NodeUnaryKind, expr: Node, tok: Token) -> Node {
        return Node {
            ty: None,
            tok,
            kind: NodeFields::Unary {
                kind,
                expr: Rc::new(RefCell::new(expr)),
            },
        };
    }

    fn new_num(val: i64, tok: Token) -> Node {
        return Node {
            ty: Some(TY_INT),
            tok: tok,
            kind: NodeFields::Int(val),
        };
    }

    fn new_long(val: i64, tok: Token) -> Node {
        return Node {
            ty: Some(TY_LONG),
            tok,
            kind: NodeFields::Int(val),
        };
    }

    fn new_ulong(val: i64, tok: Token) -> Node {
        return Node {
            ty: None,
            tok: tok,
            kind: NodeFields::Int(val),
        };
    }

    fn new_var_node(var: Rc<RefCell<Obj>>, tok: Token) -> Node {
        return Node {
            ty: None,
            tok,
            kind: NodeFields::Var(var),
        };
    }

    fn new_vla_ptr(var: Rc<RefCell<Obj>>, tok: Token) -> Node {
        return Node {
            ty: None,
            tok: tok,
            kind: NodeFields::VlaPtr(var),
        };
    }

    fn new_cast(expr: Rc<RefCell<Node>>, ty: Type) -> Node {
        expr.borrow_mut().add_type();
        let tok = expr.borrow().tok.clone();
        return Node {
            ty: Some(copy_type(&ty)),
            tok: tok,
            kind: NodeFields::Cast(expr),
        };
    }

    #[instrument(ret)]
    fn add_type(&mut self) {
        // For many binary operators, we implicitly promote operands so
        // that both operands have the same type. Any integral type smaller than
        // int is always promoted to int. If the type of one operand is larger
        // than the other's (e.g. "long" vs. "int"), the smaller operand will
        // be promoted to match with the other.
        //
        // This operation is called the "usual arithmetic conversion".
        #[instrument(ret)]
        fn usual_arith_conv(lhs: &Rc<RefCell<Node>>, rhs: &Rc<RefCell<Node>>) {
            let ty = get_common_type(
                lhs.borrow().ty.as_ref().unwrap(),
                rhs.borrow().ty.as_ref().unwrap(),
            );
            let new_lhs = Node::new_cast(lhs.clone(), ty.clone());
            let mut l = lhs.borrow_mut();
            *l = new_lhs;
            let new_rhs = Node::new_cast(rhs.clone(), ty);
            let mut r = rhs.borrow_mut();
            *r = new_rhs;
        }

        if self.ty.is_some() {
            return;
        }

        match &self.kind {
            NodeFields::NullExpr => {
                // TODO:
            }
            NodeFields::Binary { kind, lhs, rhs } => {
                //                     NodeBinaryKind::Neg => todo!(),
                match kind {
                    NodeBinaryKind::Add
                    | NodeBinaryKind::Sub
                    | NodeBinaryKind::Mul
                    | NodeBinaryKind::Div
                    | NodeBinaryKind::Mod
                    | NodeBinaryKind::BitAnd
                    | NodeBinaryKind::BitOr
                    | NodeBinaryKind::BitXor => {
                        usual_arith_conv(lhs, rhs);
                        self.ty = lhs.borrow().ty.clone();
                    }
                    NodeBinaryKind::Neg => {
                        let ty = get_common_type(&TY_INT, lhs.borrow().ty.as_ref().unwrap());
                        *lhs.borrow_mut() = Node::new_cast(lhs.clone(), ty.clone());
                        self.ty = Some(ty);
                    }
                    NodeBinaryKind::Shl | NodeBinaryKind::Shr => {
                        self.ty = lhs.borrow().ty.clone();
                    }
                    NodeBinaryKind::Eq
                    | NodeBinaryKind::Ne
                    | NodeBinaryKind::Lt
                    | NodeBinaryKind::Le => {
                        usual_arith_conv(lhs, rhs);
                        self.ty = Some(TY_INT);
                    }
                    NodeBinaryKind::Assign => {
                        if let TypeKind::Array { .. } = &lhs.borrow().ty.as_ref().unwrap().kind {
                            let tok = &lhs.borrow().tok;
                            tok.error("not an lvalue");
                        }
                        if let TypeKind::Struct { .. } = lhs.borrow().ty.as_ref().unwrap().kind {
                        } else {
                            *rhs.borrow_mut() =
                                Node::new_cast(rhs.clone(), lhs.borrow().ty.clone().unwrap());
                        }
                        self.ty = lhs.borrow().ty.clone();
                    }
                    NodeBinaryKind::Comma => {
                        self.ty = rhs.borrow().ty.clone();
                    }
                }
            }
            NodeFields::Member { member } => {
                self.ty = Some(member.ty.clone());
            }
            NodeFields::Unary { kind, expr } => {
                expr.borrow_mut().add_type();

                match kind {
                    NodeUnaryKind::Addr => {
                        let e = expr.borrow();
                        let ty = e.ty.as_ref().unwrap();
                        if let TypeKind::Array { base, .. } = &ty.kind {
                            self.ty = Some(pointer_to(base));
                        } else {
                            // TODO: what about VLA_ARRAY?
                            self.ty = Some(pointer_to(ty));
                        }
                    }
                    NodeUnaryKind::Deref => {
                        let e = expr.borrow();
                        let ty = e.ty.as_ref().unwrap();
                        match &ty.kind {
                            TypeKind::Ptr { base, .. }
                            | TypeKind::Array { base, .. }
                            | TypeKind::Vla { base, .. } => {
                                if matches!(base.kind, TypeKind::Void) {
                                    self.tok.error("dereferencing a void pointer");
                                }
                                self.ty = Some(base.as_ref().clone());
                            }
                            TypeKind::Void
                            | TypeKind::Bool
                            | TypeKind::Char
                            | TypeKind::Short
                            | TypeKind::Int
                            | TypeKind::Long
                            | TypeKind::Float
                            | TypeKind::Double
                            | TypeKind::LDouble
                            | TypeKind::Enum
                            | TypeKind::Func { .. }
                            | TypeKind::Struct { .. }
                            | TypeKind::Union { .. } => todo!(),
                        }
                    }
                    NodeUnaryKind::BitNot => {
                        self.ty = expr.borrow().ty.clone();
                    }
                    NodeUnaryKind::Not | NodeUnaryKind::LogAnd | NodeUnaryKind::LogOr => {
                        self.ty = Some(TY_INT);
                    }
                }
            }
            NodeFields::Return => {}
            NodeFields::Cond {
                kind,
                cond,
                then,
                els,
                init,
                inc,
                brk_label,
                cont_label,
                case_next,
                default_case,
                begin,
                end,
            } => {
                cond.borrow_mut().add_type();
                then.borrow_mut().add_type();
                els.borrow_mut().add_type();
                init.borrow_mut().add_type();
                inc.borrow_mut().add_type();

                // TODO: ND_COND
                match kind {
                    NodeCondKind::Cond => {
                        if matches!(then.borrow().ty.as_ref().unwrap().kind, TypeKind::Void)
                            || matches!(els.borrow().ty.as_ref().unwrap().kind, TypeKind::Void)
                        {
                            self.ty = Some(TY_VOID);
                        } else {
                            usual_arith_conv(then, els);
                            self.ty = then.borrow().ty.clone();
                        }
                    }
                    NodeCondKind::If
                    | NodeCondKind::For
                    | NodeCondKind::Do
                    | NodeCondKind::Switch
                    | NodeCondKind::Case
                    | NodeCondKind::Goto
                    | NodeCondKind::GotoExpr
                    | NodeCondKind::Label => {}
                    NodeCondKind::LabelVal => {
                        self.ty = Some(pointer_to(&TY_VOID));
                    }
                }
            }
            NodeFields::Block => {}
            NodeFields::Label { .. } => {}
            NodeFields::FnCall {
                func_ty,
                args,
                pass_by_stack,
                ret_buffer,
            } => {
                for n in args.iter() {
                    n.borrow_mut().add_type();
                }
                self.ty = Some(func_ty.clone());
            }
            NodeFields::Stmt { kind, body, expr } => {
                for n in body.iter() {
                    n.borrow_mut().add_type();
                }
                match kind {
                    NodeStmtKind::ExprStmt => {}
                    NodeStmtKind::StmtExpr => {
                        match body.last() {
                            Some(stmt) => {
                                let s = stmt.borrow();
                                match &s.kind {
                                    NodeFields::Stmt { kind, body, expr } => match kind {
                                        NodeStmtKind::ExprStmt => {
                                            assert!(expr.is_some());
                                            self.ty = expr.as_ref().unwrap().borrow().ty.clone();
                                            return;
                                        }
                                        NodeStmtKind::StmtExpr => {}
                                    },
                                    _ => panic!("stmt is not a NodeFields::Stmt: {:?}", s.kind),
                                }
                            }
                            None => {}
                        }
                        self.tok
                            .error("statement expression returning void is not supported");
                    }
                }
            }
            NodeFields::Var(var) | NodeFields::VlaPtr(var) => {
                self.ty = Some(var.borrow().ty.clone());
            }
            NodeFields::Cast(_) => {
                assert!(self.ty.is_some());
            }
            NodeFields::Int(_) => {
                assert!(self.ty.is_some());
            }
            NodeFields::Float(_) => {}
            NodeFields::MemZero => {}
            NodeFields::Asm(_) => todo!(),
            NodeFields::Cas { addr, old, new } => {
                addr.borrow_mut().add_type();
                old.borrow_mut().add_type();
                new.borrow_mut().add_type();
                self.ty = Some(TY_BOOL);
                if matches!(
                    addr.borrow().ty.as_ref().unwrap().kind,
                    TypeKind::Ptr { .. }
                ) {
                    addr.borrow().tok.error("pointer expected");
                }
                if matches!(old.borrow().ty.as_ref().unwrap().kind, TypeKind::Ptr { .. }) {
                    old.borrow().tok.error("pointer expected");
                }
            }
            NodeFields::Exch { lhs, rhs } => match &lhs.borrow().ty.as_ref().unwrap().kind {
                TypeKind::Ptr { base } => {
                    self.ty = Some(base.as_ref().clone());
                }
                _ => lhs.borrow().tok.error("pointer expected"),
            },
        }

        assert!(self.ty.is_some());
    }
}

impl Parse {
    fn push_scope(&mut self, name: String) -> Rc<RefCell<VarScope>> {
        let sc = Rc::new(RefCell::new(VarScope {
            var: None,
            type_def: None,
            enum_ty: None,
            enum_val: None,
        }));
        self.scope
            .front_mut()
            .unwrap()
            .vars
            .insert(name, sc.clone());
        return sc;
    }
}

impl Initializer {
    fn new(ty: Type, is_flexible: bool) -> Initializer {
        let mut init = Initializer {
            ty,
            tok: None,
            is_flexible: false,
            expr: None,
            children: None,
            mem: None,
        };
        match &init.ty.kind {
            TypeKind::Void
            | TypeKind::Bool
            | TypeKind::Char
            | TypeKind::Short
            | TypeKind::Int
            | TypeKind::Long
            | TypeKind::Float
            | TypeKind::Double
            | TypeKind::LDouble
            | TypeKind::Enum
            | TypeKind::Ptr { .. }
            | TypeKind::Func { .. } => {}
            TypeKind::Array { base, len } => {
                // TODO: when is ty.size < 0
                if is_flexible && init.ty.size < 0 {
                    init.is_flexible = true;
                    return init;
                }

                let mut children = Vec::with_capacity(*len);
                for b in 0..*len {
                    children.push(Initializer::new(base.as_ref().clone(), false));
                }
                init.children = Some(children);
                return init;
            }
            TypeKind::Vla { .. } => {} // Do nothing?
            TypeKind::Struct {
                members,
                is_flexible: tyif,
                ..
            }
            | TypeKind::Union {
                members,
                is_flexible: tyif,
                ..
            } => {
                let mut children = Vec::with_capacity(members.len());
                for (i, mem) in members.iter().enumerate() {
                    if is_flexible && *tyif && i + 1 == members.len() {
                        let child = Initializer {
                            ty: mem.ty.clone(),
                            tok: None,
                            is_flexible: true,
                            expr: None,
                            children: None,
                            mem: None,
                        };
                        children.push(child);
                    } else {
                        children.push(Initializer::new(mem.ty.clone(), false));
                    }
                }
                init.children = Some(children);
                return init;
            }
        }
        return init;
    }
}

impl Parse {
    fn new_var(
        &mut self,
        name: String,
        ty: Type,
        next: Rc<RefCell<LinkedList<Rc<RefCell<Obj>>>>>,
    ) -> Rc<RefCell<Obj>> {
        let align = ty.align;
        let var = Obj {
            next,
            name: name.clone(),
            ty,
            tok: None,
            is_local: false,
            align,
            other: ObjFields::None,
        };
        let sc = self.push_scope(name);
        let varrc = Rc::new(RefCell::new(var));
        sc.borrow_mut().var = Some(varrc.clone());
        return varrc;
    }

    fn new_lvar(&mut self, name: String, ty: Type) -> Rc<RefCell<Obj>> {
        let var = self.new_var(name, ty, self.locals.clone());
        var.borrow_mut().is_local = true;
        let mut l = self.locals.borrow_mut();
        l.push_front(var.clone());
        return var;
    }

    fn new_gvar(&mut self, name: String, ty: Type) -> Rc<RefCell<Obj>> {
        let var = self.new_var(name, ty, self.globals.clone());
        var.borrow_mut().other = ObjFields::GVar {
            is_function: false,
            is_definition: true,
            is_static: true,
            is_tentative: false,
            is_tls: false,
            init_data: Vec::new(),
            rel: None,
        };
        let mut g = self.globals.borrow_mut();
        g.push_front(var.clone());
        return var;
    }
}


// TODO: new_unique_name.