Over the past week, I have been translating the chibicc C compiler to Rust. The C compiler is roughly 6.5k lines of code.
Translating code is repetitive and tiring. There is little feeling of reward. So why do I do it?
Here are the main reasons why:
- I want to be able to write compiler-like tools (SQL parsers, query compilers, VSCode extensions, etc.). Languages and compilers are powerful tools. To write, remembering what code in a compiler looks like is useful, especially code written by an experienced C++ compiler developer.
- I want to be able to write Rust code more quickly. Most databases contain 100,000-1,000,000 lines of code. In order to write
a small database within a year, one needs to be comfortable adding >500 new lines of code every day.
- Sqlite: 134734 lines of C
- Postgres: 840319 of C
- Noria (research database): 74230 lines of Rust
So far it has been useful, but there may be better ideas. I guess I will slow down and start thinking about some other projects
Here are some things I’ve learned so far:
- Translating 500+ lines of code a day is tiring. Will it get less tiring?
- Tracing is useful but generates too much data (1GB/run) when you instrument every function. Can a custom-built column store and dataflow engine make it easier to debug code using traces? I think it is possible.
- Compilers become complicated and error-prone because a lot of code shares the same path. If you “partition” the control-flow properly, you can reduce the complexity of the main path. However, you then deal with the problems introduced by partitioning. There are many ways to divide a codebase.
- Compilers are composed of 4 main parts: tokenizers, parsers, optimizers, and code generators. A C compiler also has a preprocessor for handling macros.
- Compilers typically scan across tokens multiple times, applying small rules at each stage. Loading tokens from disk in small batches and storing them in a contiguous buffer should lead to good performance. The chibicc compiler stores things in linked lists for simplicity, but that may lead to more cache misses that needed.
- Parsing is a bit more complicated. I am unsure what data layout would lead to the least number of cache misses. I will need to look at the literature to see if cache misses are a bottleneck though.
- Compilers typically have some recursive steps. Recursive tokenization and of course recursive-descent parsing.
This week I hit 5000 lines. I expect there are ~5000 more lines to write.
use core::str;
use std::{
cell::{Cell, RefCell},
collections::{HashMap, HashSet, LinkedList},
ffi::{c_char, CString},
fs,
io::{stdin, Read},
ops::{Index, Range, RangeFull},
os::unix::fs::MetadataExt,
path::{Path, PathBuf},
rc::Rc,
};
use chrono::{DateTime, Local};
use libc::{isxdigit, strtod};
use tracing::{instrument, Level};
use tracing_subscriber::fmt::format::FmtSpan;
//
// Custom logging (using the tracing library)
//
#[instrument(ret)]
pub fn setup_tracing() {
let subscriber = tracing_subscriber::fmt::Subscriber::builder()
.with_span_events(FmtSpan::NEW | FmtSpan::CLOSE)
.with_ansi(false)
.with_line_number(true)
.json()
.with_span_list(false)
.finish();
tracing::subscriber::set_global_default(subscriber).unwrap();
}
//
// Custom, rust-only data structures
//
#[derive(Clone)]
struct Location {
idx: usize,
end_idx: usize,
contents: Rc<Vec<u8>>,
}
impl std::fmt::Debug for Location {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
let s = self.as_str();
let mut i = std::cmp::min(30, s.len());
while !s.is_char_boundary(i) {
i += 1;
}
f.debug_struct("Location")
.field("idx", &self.idx)
.field("end_idx", &self.end_idx)
.field("contents[idx..end_idx]", &&self.as_str()[..i])
.finish()
}
}
impl Location {
#[instrument(level = "debug")]
fn len(&self) -> usize {
return self.end_idx - self.idx;
}
#[instrument(level = "debug")]
fn as_bytes(&self) -> &[u8] {
return &self.contents[self.idx..self.end_idx];
}
#[instrument(level = "debug")]
fn as_str(&self) -> &str {
let u8 = &self.contents[self.idx..self.end_idx];
core::str::from_utf8(u8).unwrap()
}
}
impl Index<usize> for Location {
type Output = u8;
#[instrument(level = "debug")]
fn index(&self, index: usize) -> &Self::Output {
return self.contents.index(self.idx + index);
}
}
impl Iterator for Location {
type Item = u8;
#[instrument(level = "debug", ret)]
fn next(&mut self) -> Option<Self::Item> {
if self.idx >= self.end_idx {
return None;
} else {
let ret = self.contents[self.idx];
self.idx += 1;
return Some(ret);
}
}
}
//
// strings.rs
//
//
// tokenize.rs
#[derive(PartialEq, Debug, Clone)]
enum TokenKind {
Ident, // Identifiers
Punct, // Punctuators
Keyword, // Keywords
Str, // String literals
Num, // Numeric literals
PPNum, // Preprocessing numbers
}
#[derive(Clone)]
struct File {
name: String,
file_no: i32,
contents: Rc<Vec<u8>>,
display_name: String,
line_delta: i32,
}
impl std::fmt::Debug for File {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
f.debug_struct("File")
.field("name", &self.name)
.field("file_no", &self.file_no)
.field("display_name", &self.display_name)
.field("line_delta", &self.line_delta)
.finish()
}
}
#[derive(Clone, Debug)]
enum TokenFields {
None,
Int(Type, i64),
Float(Type, f64),
Str(Type, Box<[u8]>), // Can be UTF-8, UTF-16, or UTF-32.
}
#[derive(Clone)]
pub struct Token {
kind: TokenKind,
loc: Location, // Token location (equivalent to a char*)
file: Rc<RefCell<File>>,
filename: Option<String>,
line_no: i32,
line_delta: i32,
at_bol: bool, // If token is at beginning of line, true
has_space: bool, // If token follows space, true
hideset: Vec<Hideset>, // For macro expansion
origin: Option<Rc<Token>>, // If this is expanded from a macro, the original token
other: TokenFields,
}
impl std::fmt::Debug for Token {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
f.debug_struct("Token")
.field("kind", &self.kind)
.field("loc", &self.loc)
.field("other", &self.other)
.field("line_no", &self.line_no)
.field("line_delta", &self.line_delta)
.field("at_bol", &self.at_bol)
.field("has_space", &self.has_space)
.finish_non_exhaustive()
}
}
impl std::fmt::Display for Token {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
return write!(f, "Token({:?}, {})", self.kind, self.loc.as_str());
}
}
//
// preprocess.c
//
#[derive(Debug, Clone)]
struct Hideset {
name: String,
}
enum ObjFields {
None,
LVar {
// Local variable
offset: i32,
},
GVar {
// Global variable or function
is_function: bool,
is_definition: bool,
is_static: bool,
// Global variable
is_tentative: bool,
is_tls: bool,
init_data: Vec<u8>,
rel: Option<Relocation>,
},
Func {
// Global variable or function
is_function: bool,
is_definition: bool,
is_static: bool,
// Function
is_inline: bool,
params: Vec<Rc<RefCell<Obj>>>,
body: Rc<RefCell<Node>>,
locals: Vec<Rc<RefCell<Obj>>>,
va_area: Rc<RefCell<Obj>>,
alloca_bottom: Rc<RefCell<Obj>>,
stack_size: i32,
},
StackInlineFunc {
// Stack inline function
is_live: bool,
is_root: bool,
refs: Vec<String>,
},
}
//
// parse.c
//
struct Obj {
next: Rc<RefCell<LinkedList<Rc<RefCell<Obj>>>>>,
name: String,
ty: Type,
tok: Option<Token>, // Representative token
is_local: bool, // local or glocal/function
align: usize, // alignment
other: ObjFields,
}
// Global variable can be initialized either by a constant expression
// or a pointer to another global variable. This struct represents the
// latter.
#[derive(Debug)]
struct Relocation {
offset: i32,
label: String, // TODO: pointer?
addend: i32,
}
#[derive(Debug)]
enum NodeBinaryKind {
Add,
Sub,
Mul,
Div,
Neg,
Mod,
BitAnd, // &
BitOr, // |
BitXor, // ^
Shl, // <<
Shr, // >>
Eq, // =
Ne, // !=
Lt,
Le,
Assign,
Comma,
}
#[derive(Debug)]
enum NodeUnaryKind {
Addr,
Deref,
Not, // !
BitNot, // ~
LogAnd, // &&
LogOr, // ||
}
#[derive(Debug)]
enum NodeStmtKind {
ExprStmt, // Expression statement
StmtExpr, // Statement expression
}
#[derive(Debug)]
enum NodeAtomicKind {
Cas,
Exch, // Atomic exchange
}
#[derive(Debug)]
enum NodeCondKind {
Cond, // ?:
If,
For,
Do,
Switch,
Case,
Goto,
GotoExpr, // "goto" labels-as-values"
Label, // labeled statement
LabelVal, // Labels-as-values
}
enum NodeFields {
NullExpr, // Do nothing
Binary {
kind: NodeBinaryKind,
lhs: Rc<RefCell<Node>>,
rhs: Rc<RefCell<Node>>,
},
Member {
member: Member,
},
Unary {
kind: NodeUnaryKind,
expr: Rc<RefCell<Node>>,
},
Return,
Cond {
kind: NodeCondKind,
// if or for statement
cond: Rc<RefCell<Node>>,
then: Rc<RefCell<Node>>,
els: Rc<RefCell<Node>>,
init: Rc<RefCell<Node>>,
inc: Rc<RefCell<Node>>,
// "break" and "continue" labels
brk_label: String,
cont_label: String,
// Switch
case_next: Rc<RefCell<Node>>,
default_case: Rc<RefCell<Node>>,
// Case
begin: i32,
end: i32,
},
Block, // { .. }
Label {
// Goto or labeled statement, or labels-as-values
label: String,
unique_label: String,
goto_next: Rc<RefCell<Node>>,
},
FnCall {
// Function call
func_ty: Type,
args: Vec<Rc<RefCell<Node>>>,
pass_by_stack: bool,
ret_buffer: Rc<RefCell<Obj>>,
},
Stmt {
kind: NodeStmtKind,
expr: Option<Rc<RefCell<Node>>>,
body: Vec<Rc<RefCell<Node>>>,
},
Var(Rc<RefCell<Obj>>),
VlaPtr(Rc<RefCell<Obj>>),
Int(i64),
Float(i64),
Cast(Rc<RefCell<Node>>),
Cas {
// Atomic compare-and-swap
addr: Rc<RefCell<Node>>,
old: Rc<RefCell<Node>>,
new: Rc<RefCell<Node>>,
},
Exch {
lhs: Rc<RefCell<Node>>,
rhs: Rc<RefCell<Node>>,
},
MemZero,
Asm(String),
}
impl std::fmt::Debug for NodeFields {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
match self {
Self::NullExpr => write!(f, "NullExpr"),
Self::Binary { kind, lhs, rhs } => f
.debug_struct("Binary")
.field("kind", kind)
// .field("lhs", lhs)
// .field("rhs", rhs)
.finish(),
Self::Member { member } => f.debug_struct("Member").field("member", member).finish(),
Self::Unary { kind, expr } => f
.debug_struct("Unary")
.field("kind", kind)
// .field("expr", expr)
.finish(),
Self::Return => write!(f, "Return"),
Self::Cond {
kind,
cond,
then,
els,
init,
inc,
brk_label,
cont_label,
case_next,
default_case,
begin,
end,
} => f
.debug_struct("Cond")
.field("kind", kind)
// .field("cond", cond)
// .field("then", then)
// .field("els", els)
// .field("init", init)
// .field("inc", inc)
// .field("brk_label", brk_label)
// .field("cont_label", cont_label)
// .field("case_next", case_next)
// .field("default_case", default_case)
// .field("begin", begin)
// .field("end", end)
.finish(),
Self::Block => write!(f, "Block"),
Self::Label {
label,
unique_label,
goto_next,
} => f
.debug_struct("Label")
// .field("label", label)
// .field("unique_label", unique_label)
// .field("goto_next", goto_next)
.finish(),
Self::FnCall {
func_ty,
args,
pass_by_stack,
ret_buffer,
} => f
.debug_struct("FnCall")
.field("func_ty", func_ty)
// .field("args", args)
.field("pass_by_stack", pass_by_stack)
// .field("ret_buffer", ret_buffer)
.finish(),
Self::Stmt { kind, body, expr } => f
.debug_struct("Stmt")
.field("kind", kind)
// .field("body", body)
// .field("expr", expr)
.finish(),
Self::Var(arg0) => f
.debug_tuple("Var") /* .field(arg0) */
.finish(),
Self::VlaPtr(arg0) => f
.debug_tuple("VlaPtr") /* .field(arg0) */
.finish(),
Self::Int(arg0) => f.debug_tuple("Int").field(arg0).finish(),
Self::Float(arg0) => f.debug_tuple("Float").field(arg0).finish(),
Self::Cast(arg0) => f.debug_tuple("Cast").field(arg0).finish(),
Self::Cas { addr, old, new } => f
.debug_struct("Cas")
// .field("addr", addr)
// .field("old", old)
// .field("new", new)
.finish(),
Self::Exch { lhs, rhs } => f
.debug_struct("Exch")
// .field("lhs", lhs)
// .field("rhs", rhs)
.finish(),
Self::MemZero => write!(f, "MemZero"),
Self::Asm(arg0) => f.debug_tuple("Asm").field(arg0).finish(),
}
}
}
#[derive(Debug)]
struct Node {
kind: NodeFields,
ty: Option<Type>,
tok: Token, // Representative token
}
//
// type.rs
//
#[derive(Clone, Debug)]
enum TypeKind {
Void,
Bool,
Char,
Short,
Int,
Long,
Float,
Double,
LDouble,
Enum,
Ptr {
base: Box<Type>,
},
Func {
return_ty: Box<Type>,
params: Vec<Type>,
is_variadic: bool,
next: Option<Box<Type>>,
},
Array {
base: Box<Type>,
len: usize,
},
Vla {
base: Box<Type>,
vla_len: usize,
vla_size: Option<usize>,
}, // Variable-length array
Struct {
members: Vec<Member>,
is_flexible: bool,
is_packed: bool,
},
Union {
members: Vec<Member>,
is_flexible: bool,
},
}
#[derive(Clone)]
struct OneOrManyTypes {
ty: Vec<Type>,
}
#[derive(Clone)]
struct TypeDeclaration {
name: Box<Token>,
name_pos: Box<Token>,
}
#[derive(Clone)]
struct Type {
kind: TypeKind,
size: usize, // sizeof() value
align: usize,
is_unsigned: bool,
is_atomic: bool,
origin: Option<Rc<Type>>, // for type compatibility check
decl: Option<TypeDeclaration>,
}
impl std::fmt::Debug for Type {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
f.debug_struct("Type")
.field("kind", &self.kind)
.field("size", &self.size)
.field("align", &self.align)
.field("is_unsigned", &self.is_unsigned)
.finish_non_exhaustive()
}
}
#[derive(Clone, Debug)]
struct Member {
ty: Type,
tok: Token,
name: Token,
idx: i32,
align: i32,
offset: i32,
is_bitfield: bool,
bit_offset: i32,
bit_width: i32,
}
#[derive(Debug)]
pub struct Tokenize {
current_file: Option<Rc<RefCell<File>>>, // Input file
input_files: Vec<Rc<RefCell<File>>>, // A list of all input files
}
impl Tokenize {
#[instrument(ret)]
pub fn new() -> Self {
Tokenize {
current_file: None,
input_files: Vec::new(),
}
}
}
// type.c
const TY_VOID: Type = new_type(TypeKind::Void, 1, 1, false);
const TY_BOOL: Type = new_type(TypeKind::Bool, 1, 1, false);
const TY_CHAR: Type = new_type(TypeKind::Char, 1, 1, false);
const TY_SHORT: Type = new_type(TypeKind::Short, 2, 2, false);
const TY_INT: Type = new_type(TypeKind::Int, 4, 4, false);
const TY_LONG: Type = new_type(TypeKind::Long, 8, 8, false);
const TY_UCHAR: Type = new_type(TypeKind::Char, 1, 1, true);
const TY_USHORT: Type = new_type(TypeKind::Short, 1, 1, true);
const TY_UINT: Type = new_type(TypeKind::Int, 4, 4, true);
const TY_ULONG: Type = new_type(TypeKind::Long, 8, 8, true);
const TY_FLOAT: Type = new_type(TypeKind::Float, 4, 4, false);
const TY_DOUBLE: Type = new_type(TypeKind::Double, 8, 8, false);
const TY_LDOUBLE: Type = new_type(TypeKind::LDouble, 16, 16, false);
const fn new_type(kind: TypeKind, size: usize, align: usize, is_unsigned: bool) -> Type {
return Type {
kind,
size,
align,
is_unsigned,
is_atomic: false,
origin: None,
decl: None,
};
}
#[instrument(ret)]
fn is_integer(ty: &Type) -> bool {
match ty.kind {
TypeKind::Bool | TypeKind::Char | TypeKind::Short | TypeKind::Int | TypeKind::Long => true,
TypeKind::Enum => true,
TypeKind::Void
| TypeKind::Float
| TypeKind::Double
| TypeKind::LDouble
| TypeKind::Ptr { .. }
| TypeKind::Func { .. }
| TypeKind::Array { .. }
| TypeKind::Vla { .. }
| TypeKind::Struct { .. }
| TypeKind::Union { .. } => false,
}
}
#[instrument(ret)]
fn is_flonum(ty: &Type) -> bool {
let k = ty.kind.clone();
// return k == TypeKind::Float || k == TypeKind::Double || k == TypeKind::Double;
match ty.kind {
TypeKind::Float | TypeKind::Double | TypeKind::LDouble => true,
TypeKind::Void
| TypeKind::Bool
| TypeKind::Char
| TypeKind::Short
| TypeKind::Int
| TypeKind::Long
| TypeKind::Enum
| TypeKind::Ptr { .. }
| TypeKind::Func { .. }
| TypeKind::Array { .. }
| TypeKind::Vla { .. }
| TypeKind::Struct { .. }
| TypeKind::Union { .. } => false,
}
}
#[instrument(ret)]
fn is_numeric(ty: &Type) -> bool {
return is_integer(ty) || is_flonum(ty);
}
#[instrument(ret)]
fn is_compatible(t1: &Type, t2: &Type) -> bool {
if std::ptr::eq(t1, t2) {
return true;
}
if let Some(o) = &t1.origin {
return is_compatible(&o, t2);
}
if let Some(o) = &t2.origin {
return is_compatible(t1, &o);
}
if std::mem::discriminant(&t1.kind) == std::mem::discriminant(&t2.kind) {
return false;
}
match &t1.kind {
TypeKind::Void => return false,
TypeKind::Bool => return false,
TypeKind::Char | TypeKind::Short | TypeKind::Int | TypeKind::Long => {
return t1.is_unsigned == t2.is_unsigned;
}
TypeKind::Float | TypeKind::Double | TypeKind::LDouble => {
return true;
}
TypeKind::Enum => return false,
TypeKind::Ptr { base: base1 } => {
let base2 = match &t2.kind {
TypeKind::Void
| TypeKind::Bool
| TypeKind::Char
| TypeKind::Short
| TypeKind::Int
| TypeKind::Long
| TypeKind::Float
| TypeKind::Double
| TypeKind::LDouble
| TypeKind::Enum => {
return false;
}
TypeKind::Ptr { base } => base,
TypeKind::Func { .. } => {
return false;
}
TypeKind::Array { base, len } => base,
TypeKind::Vla {
base,
vla_len,
vla_size,
} => base,
TypeKind::Struct {
members,
is_flexible,
is_packed,
} => {
return false;
}
TypeKind::Union { .. } => {
return false;
}
};
return is_compatible(&base1, &base2);
}
TypeKind::Func {
return_ty: return_ty1,
params: params1,
is_variadic: is_variadic1,
next: _,
} => {
if let TypeKind::Func {
return_ty: return_ty2,
params: params2,
is_variadic: is_variadic2,
next: _,
} = &t2.kind
{
if !is_compatible(&return_ty1, &return_ty2) {
return false;
}
if is_variadic1 != is_variadic2 {
return false;
}
if params1.len() != params2.len() {
return false;
}
for (p1, p2) in params1.iter().zip(params2.iter()) {
if !is_compatible(&p1, &p2) {
return false;
}
}
return true;
} else {
return false;
}
}
TypeKind::Array { .. } => {
// let t1b = t1.base.as_ref().unwrap();
// let t2b = t2.base.as_ref().unwrap();
// if !is_compatible(&t1b, &t2b) {
// return false;
// }
// // TODO: what is going on here?
// return t1.array_len.unwrap() < 0
// && t2.array_len.unwrap() < 0
// && t1.array_len == t2.array_len;
return false;
}
// TODO: can these types be compatible?
TypeKind::Vla { .. } => return false,
TypeKind::Struct { .. } => return false,
TypeKind::Union { .. } => return false,
}
}
#[instrument(ret)]
fn copy_type(ty: &Type) -> Type {
let mut ret = ty.clone();
ret.origin = Some(Rc::new(ty.clone()));
return ret;
}
#[instrument(ret)]
fn pointer_to(base: &Type) -> Type {
return new_type(
TypeKind::Ptr {
base: Box::new(base.clone()),
},
8,
8,
true,
);
}
#[instrument(ret)]
fn func_type(return_ty: &Type) -> Type {
return new_type(
TypeKind::Func {
return_ty: Box::new(return_ty.clone()),
params: Vec::new(),
is_variadic: false,
next: None,
},
1,
1,
false,
);
}
#[instrument(ret)]
fn array_of(base: &Type, len: usize) -> Type {
return new_type(
TypeKind::Array {
base: Box::new(base.clone()),
len: len,
},
base.size * len,
base.align,
false,
);
}
#[instrument(ret)]
fn enum_type() -> Type {
return new_type(TypeKind::Enum, 4, 4, false);
}
#[instrument(ret)]
fn struct_type() -> Type {
return new_type(
TypeKind::Struct {
members: Vec::new(),
is_flexible: false,
is_packed: false,
},
0,
1,
false,
);
}
#[instrument(ret)]
fn get_common_type(ty1: &Type, ty2: &Type) -> Type {
match &ty1.kind {
TypeKind::Ptr { base, .. } => {
return pointer_to(&base);
}
TypeKind::Array { base, .. } => {
return pointer_to(&base);
}
TypeKind::Vla { base, .. } => {
return pointer_to(&base);
}
TypeKind::Void
| TypeKind::Bool
| TypeKind::Char
| TypeKind::Short
| TypeKind::Int
| TypeKind::Long
| TypeKind::Float
| TypeKind::Double
| TypeKind::LDouble
| TypeKind::Enum
| TypeKind::Func { .. }
| TypeKind::Struct { .. }
| TypeKind::Union { .. } => {}
}
if let TypeKind::Func { .. } = &ty1.kind {
return pointer_to(ty1);
}
if let TypeKind::Func { .. } = &ty2.kind {
return pointer_to(ty2);
}
if matches!(ty1.kind, TypeKind::LDouble) || matches!(ty2.kind, TypeKind::LDouble) {
return TY_LDOUBLE;
}
if matches!(ty2.kind, TypeKind::Double) || matches!(ty2.kind, TypeKind::Double) {
return TY_DOUBLE;
}
if matches!(ty1.kind, TypeKind::Float) || matches!(ty2.kind, TypeKind::Float) {
return TY_FLOAT;
}
let mut ty1m: Type = ty1.clone();
let mut ty2m: Type = ty2.clone();
if ty1m.size < 4 {
ty1m = TY_INT;
}
if ty2.size < 4 {
ty2m = TY_INT;
}
if ty1m.size != ty2m.size {
if ty1m.size < ty2m.size {
return ty2m;
} else {
return ty1m;
}
}
if ty2m.is_unsigned {
return ty2m;
}
return ty1m;
}
// add_type is implemented in the parse.c section
#[cfg(test)]
mod test_type {
use super::{is_compatible, TY_CHAR, TY_LONG, TY_ULONG};
#[test]
fn test_is_compatible() {
let cases = [((TY_CHAR, TY_CHAR), true), ((TY_ULONG, TY_LONG), false)];
for ((t1, t2), e) in cases {
let a = is_compatible(&t1, &t2);
assert_eq!(a, e);
}
}
}
// unicode.c
#[instrument(ret, level = "Debug")]
fn encode_utf8(buf: &mut [u8], c: u32) -> usize {
#[instrument(ret)]
fn one_zero_6_bits(c: u32) -> u8 {
return 0b10000000 | (c & 0b111111) as u8;
}
if c <= 0x7F {
buf[0] = c as u8;
return 1;
} else if c <= 0x7FF {
buf[0] = 0b11000000 | (c >> 6) as u8;
buf[1] = one_zero_6_bits(c);
return 2;
} else if c <= 0xFFFF {
buf[0] = 0b11100000 | (c >> 12) as u8;
buf[1] = one_zero_6_bits(c >> 6);
buf[2] = one_zero_6_bits(c);
return 3;
} else if c <= 0x10FFFF {
buf[0] = 0b11110000 | (c >> 18) as u8;
buf[1] = one_zero_6_bits(c >> 12);
buf[2] = one_zero_6_bits(c >> 6);
buf[3] = one_zero_6_bits(c);
return 4;
} else {
// TODO: error?
todo!();
}
}
#[instrument(ret, level = "Debug")]
fn decode_utf8(b: &[u8]) -> Result<(u32, &[u8]), &str> {
// TODO: handle out of bounds errors?
if b[0] <= 0x7F {
return Ok((b[0] as u32, &b[1..]));
} else {
let (first_bits, len) = {
if b[0] >= 0b11110000 {
(b[0] & 0b111, 4)
} else if b[0] >= 0b11100000 {
(b[0] & 0b1111, 3)
} else if b[0] >= 0b11000000 {
(b[0] & 0b11111, 2)
} else {
// TODO: bubble up to display the token which triggered the error?
return Err("invalid UTF-8 sequence");
}
};
let mut c = first_bits as u32;
for i in 1..len {
if b[i] < 0b10000000 {
// TODO: bubble up to display the token which triggered the error?
return Err("invalid UTF-8 sequence");
}
c = (c << 6) | (b[i] & 0b111111) as u32
}
Ok((c, &b[len..]))
}
}
#[cfg(test)]
mod test_utf8 {
use super::{decode_utf8, encode_utf8};
#[test]
fn utf8() {
let points: Vec<u32> = vec![0x7F, 0x80, 0x7FF, 0x800, 0xFFFF, 0x10000, 0x10FFFF];
for p in points {
let mut buffer = [0u8; 4];
encode_utf8(&mut buffer, p);
let r = decode_utf8(&buffer);
assert!(r.is_ok());
let (point_result, new) = r.unwrap();
assert_eq!(point_result, p);
assert!(new.len() < 4);
}
}
}
#[instrument(ret, level = "debug")]
fn in_range(range: &[(u32, u32)], c: u32) -> bool {
for (low, high) in range {
if *low <= c && c <= *high {
return true;
}
}
return false;
}
// [https://www.sigbus.info/n1570#D] C11 allows not only ASCII
// but some multibyte characters in certain Unicode ranges to be used in
// an identifier.
//
// This function returns true if a given character is acceptable as the
//first character of an identifier.
#[instrument(ret, level = "debug")]
fn is_ident1(c: u32) -> bool {
let range = [
('_' as u32, '_' as u32),
('a' as u32, 'z' as u32),
('A' as u32, 'Z' as u32),
('$' as u32, '$' as u32),
(0x00A8, 0x00A8),
(0x00AA, 0x00AA),
(0x00AD, 0x00AD),
(0x00AF, 0x00AF),
(0x00B2, 0x00B5),
(0x00B7, 0x00BA),
(0x00BC, 0x00BE),
(0x00C0, 0x00D6),
(0x00D8, 0x00F6),
(0x00F8, 0x00FF),
(0x0100, 0x02FF),
(0x0370, 0x167F),
(0x1681, 0x180D),
(0x180F, 0x1DBF),
(0x1E00, 0x1FFF),
(0x200B, 0x200D),
(0x202A, 0x202E),
(0x203F, 0x2040),
(0x2054, 0x2054),
(0x2060, 0x206F),
(0x2070, 0x20CF),
(0x2100, 0x218F),
(0x2460, 0x24FF),
(0x2776, 0x2793),
(0x2C00, 0x2DFF),
(0x2E80, 0x2FFF),
(0x3004, 0x3007),
(0x3021, 0x302F),
(0x3031, 0x303F),
(0x3040, 0xD7FF),
(0xF900, 0xFD3D),
(0xFD40, 0xFDCF),
(0xFDF0, 0xFE1F),
(0xFE30, 0xFE44),
(0xFE47, 0xFFFD),
(0x10000, 0x1FFFD),
(0x20000, 0x2FFFD),
(0x30000, 0x3FFFD),
(0x40000, 0x4FFFD),
(0x50000, 0x5FFFD),
(0x60000, 0x6FFFD),
(0x70000, 0x7FFFD),
(0x80000, 0x8FFFD),
(0x90000, 0x9FFFD),
(0xA0000, 0xAFFFD),
(0xB0000, 0xBFFFD),
(0xC0000, 0xCFFFD),
(0xD0000, 0xDFFFD),
(0xE0000, 0xEFFFD),
];
return in_range(&range, c);
}
// Return true if a given character is acceptable as a non-first
// character of an identifier.
#[instrument(ret, level = "debug")]
fn is_ident2(c: u32) -> bool {
let range = [
('0' as u32, '9' as u32),
('$' as u32, '$' as u32),
(0x0300, 0x036F),
(0x1DC0, 0x1DFF),
(0x20D0, 0x20FF),
(0xFE20, 0xFE2F),
];
return is_ident1(c) || in_range(&range, c);
}
// Returns the number of columns needed to display a given
// character in a fixed-width font.
//
// Based on Based on https://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c.
#[instrument(ret)]
fn char_width(c: u32) -> usize {
let range1 = [
(0x0000, 0x001F),
(0x007f, 0x00a0),
(0x0300, 0x036F),
(0x0483, 0x0486),
(0x0488, 0x0489),
(0x0591, 0x05BD),
(0x05BF, 0x05BF),
(0x05C1, 0x05C2),
(0x05C4, 0x05C5),
(0x05C7, 0x05C7),
(0x0600, 0x0603),
(0x0610, 0x0615),
(0x064B, 0x065E),
(0x0670, 0x0670),
(0x06D6, 0x06E4),
(0x06E7, 0x06E8),
(0x06EA, 0x06ED),
(0x070F, 0x070F),
(0x0711, 0x0711),
(0x0730, 0x074A),
(0x07A6, 0x07B0),
(0x07EB, 0x07F3),
(0x0901, 0x0902),
(0x093C, 0x093C),
(0x0941, 0x0948),
(0x094D, 0x094D),
(0x0951, 0x0954),
(0x0962, 0x0963),
(0x0981, 0x0981),
(0x09BC, 0x09BC),
(0x09C1, 0x09C4),
(0x09CD, 0x09CD),
(0x09E2, 0x09E3),
(0x0A01, 0x0A02),
(0x0A3C, 0x0A3C),
(0x0A41, 0x0A42),
(0x0A47, 0x0A48),
(0x0A4B, 0x0A4D),
(0x0A70, 0x0A71),
(0x0A81, 0x0A82),
(0x0ABC, 0x0ABC),
(0x0AC1, 0x0AC5),
(0x0AC7, 0x0AC8),
(0x0ACD, 0x0ACD),
(0x0AE2, 0x0AE3),
(0x0B01, 0x0B01),
(0x0B3C, 0x0B3C),
(0x0B3F, 0x0B3F),
(0x0B41, 0x0B43),
(0x0B4D, 0x0B4D),
(0x0B56, 0x0B56),
(0x0B82, 0x0B82),
(0x0BC0, 0x0BC0),
(0x0BCD, 0x0BCD),
(0x0C3E, 0x0C40),
(0x0C46, 0x0C48),
(0x0C4A, 0x0C4D),
(0x0C55, 0x0C56),
(0x0CBC, 0x0CBC),
(0x0CBF, 0x0CBF),
(0x0CC6, 0x0CC6),
(0x0CCC, 0x0CCD),
(0x0CE2, 0x0CE3),
(0x0D41, 0x0D43),
(0x0D4D, 0x0D4D),
(0x0DCA, 0x0DCA),
(0x0DD2, 0x0DD4),
(0x0DD6, 0x0DD6),
(0x0E31, 0x0E31),
(0x0E34, 0x0E3A),
(0x0E47, 0x0E4E),
(0x0EB1, 0x0EB1),
(0x0EB4, 0x0EB9),
(0x0EBB, 0x0EBC),
(0x0EC8, 0x0ECD),
(0x0F18, 0x0F19),
(0x0F35, 0x0F35),
(0x0F37, 0x0F37),
(0x0F39, 0x0F39),
(0x0F71, 0x0F7E),
(0x0F80, 0x0F84),
(0x0F86, 0x0F87),
(0x0F90, 0x0F97),
(0x0F99, 0x0FBC),
(0x0FC6, 0x0FC6),
(0x102D, 0x1030),
(0x1032, 0x1032),
(0x1036, 0x1037),
(0x1039, 0x1039),
(0x1058, 0x1059),
(0x1160, 0x11FF),
(0x135F, 0x135F),
(0x1712, 0x1714),
(0x1732, 0x1734),
(0x1752, 0x1753),
(0x1772, 0x1773),
(0x17B4, 0x17B5),
(0x17B7, 0x17BD),
(0x17C6, 0x17C6),
(0x17C9, 0x17D3),
(0x17DD, 0x17DD),
(0x180B, 0x180D),
(0x18A9, 0x18A9),
(0x1920, 0x1922),
(0x1927, 0x1928),
(0x1932, 0x1932),
(0x1939, 0x193B),
(0x1A17, 0x1A18),
(0x1B00, 0x1B03),
(0x1B34, 0x1B34),
(0x1B36, 0x1B3A),
(0x1B3C, 0x1B3C),
(0x1B42, 0x1B42),
(0x1B6B, 0x1B73),
(0x1DC0, 0x1DCA),
(0x1DFE, 0x1DFF),
(0x200B, 0x200F),
(0x202A, 0x202E),
(0x2060, 0x2063),
(0x206A, 0x206F),
(0x20D0, 0x20EF),
(0x302A, 0x302F),
(0x3099, 0x309A),
(0xA806, 0xA806),
(0xA80B, 0xA80B),
(0xA825, 0xA826),
(0xFB1E, 0xFB1E),
(0xFE00, 0xFE0F),
(0xFE20, 0xFE23),
(0xFEFF, 0xFEFF),
(0xFFF9, 0xFFFB),
(0x10A01, 0x10A03),
(0x10A05, 0x10A06),
(0x10A0C, 0x10A0F),
(0x10A38, 0x10A3A),
(0x10A3F, 0x10A3F),
(0x1D167, 0x1D169),
(0x1D173, 0x1D182),
(0x1D185, 0x1D18B),
(0x1D1AA, 0x1D1AD),
(0x1D242, 0x1D244),
(0xE0001, 0xE0001),
(0xE0020, 0xE007F),
(0xE0100, 0xE01EF),
];
if in_range(&range1, c) {
return 0;
}
let range2 = [
(0x1100, 0x115F),
(0x2329, 0x2329),
(0x232A, 0x232A),
(0x2E80, 0x303E),
(0x3040, 0xA4CF),
(0xAC00, 0xD7A3),
(0xF900, 0xFAFF),
(0xFE10, 0xFE19),
(0xFE30, 0xFE6F),
(0xFF00, 0xFF60),
(0xFFE0, 0xFFE6),
(0x1F000, 0x1F644),
(0x20000, 0x2FFFD),
(0x30000, 0x3FFFD),
];
if in_range(&range2, c) {
return 2;
}
return 0;
}
#[instrument(ret)]
fn display_width(p: &str) -> usize {
let mut w = 0;
for c in p.chars() {
w += char_width(c as u32);
}
return w;
}
//
// tokenize.c
//
#[instrument(ret)]
fn error(filename: &str, line_no: i32, loc: &Location, msg: &str) {
let contents = loc.contents.as_ref();
let line_idx = {
let mut idx = loc.idx;
while idx > 0 && contents[idx - 1] != b'\n' {
idx -= 1;
}
idx
};
let end_idx = {
let mut idx = loc.idx;
while idx < contents.len() && contents[idx] != b'\n' {
idx += 1;
}
idx
};
// Print out the line.
let line = core::str::from_utf8(&contents[line_idx..end_idx]).unwrap();
let header = format!("{}:{}: ", filename, line_no);
let indent = header.len();
eprint!("{}{}\n", header, line);
// Show the error message below the token.
let spaces = indent + {
let line_to_loc = core::str::from_utf8(&contents[line_idx..loc.idx]).unwrap();
display_width(line_to_loc)
};
eprint!("{}^ {}\n", " ".repeat(spaces), msg); // how many spaces?
}
impl Tokenize {
#[instrument(ret)]
fn error(&self, loc: &Location, msg: &str) {
let line_no = {
let mut n = 1;
for (i, b) in loc.contents.iter().enumerate() {
if i == loc.idx {
break;
}
if *b == b'\n' {
n += 1;
}
}
n
};
error(
self.current_file.as_ref().unwrap().borrow().name.as_str(),
line_no,
&loc,
msg,
);
panic!(); // TODO: replace with process exit?
}
}
impl Token {
#[instrument(ret)]
fn error(&self, msg: &str) {
let file = &self.file;
error(&file.borrow().name, self.line_no, &self.loc, msg);
panic!(); // TODO: replace with process exit?
}
#[instrument(ret)]
fn warn(&self, msg: &str) {
let file = &self.file;
error(&file.borrow().name, self.line_no, &self.loc, msg);
}
}
#[cfg(test)]
mod test_errors {
#[test]
fn errors() {
// TODO:
}
}
impl Token {
#[instrument(ret, level = Level::DEBUG)]
pub fn new(
kind: TokenKind,
current_file: Rc<RefCell<File>>,
loc: Location,
at_bol: bool,
has_space: bool,
) -> Self {
Token {
kind,
loc,
file: current_file,
filename: None,
line_no: -1, // TODO: unitialized?
line_delta: 0,
at_bol: at_bol,
has_space: has_space,
hideset: Vec::new(),
origin: None,
other: TokenFields::None,
}
}
}
#[instrument(ret, level = "debug")]
fn read_ident(start: &[u8]) -> usize {
let (c, mut next) = decode_utf8(start).unwrap(); // TODO: unwrap
if !is_ident1(c) {
return 0;
}
while next.len() > 0 {
let (c, nextnext) = decode_utf8(next).unwrap(); // TODO:
if !is_ident2(c) {
break;
}
next = nextnext;
}
return start.len() - next.len();
}
#[cfg(test)]
mod test_read_ident {
use super::read_ident;
#[test]
fn test_read_ident() {
let idents = [("asdfadsf1", 9), ("123adsfas", 0)];
for (s, elen) in idents {
let alen = read_ident(s.as_bytes());
assert_eq!(elen, alen);
}
}
}
#[instrument(ret)]
fn from_hex(c: u8) -> u8 {
if b'0' <= c && c <= b'9' {
return c - b'0';
} else if b'a' <= c && c <= b'f' {
return c - b'a' + 10;
} else {
return c - b'A' + 10;
}
}
#[instrument(ret, level = "debug")]
fn read_punct(p: &[u8]) -> usize {
let kw = [
"<<=", ">>=", "...", "==", "!=", "<=", ">=", "->", "+=", "-=", "*=", "/=", "++", "--",
"%=", "&=", "|=", "^=", "&&", "||", "<<", ">>", "##",
];
for w in kw {
if p.starts_with(w.as_bytes()) {
return w.len();
}
}
return unsafe {
// TODO: is p[0] a u8?
if libc::ispunct(p[0] as libc::c_int) > 0 {
1
} else {
0
}
};
}
impl Token {
#[instrument(ret, level = Level::DEBUG)]
fn is_keyword(&self) -> bool {
let set: HashSet<&str> = HashSet::from([
"return",
"if",
"else",
"for",
"while",
"int",
"sizeof",
"char",
"struct",
"union",
"short",
"long",
"void",
"typedef",
"_Bool",
"enum",
"static",
"goto",
"break",
"continue",
"switch",
"case",
"default",
"extern",
"_Alignof",
"_Alignas",
"do",
"signed",
"unsigned",
"const",
"volatile",
"auto",
"register",
"restrict",
"__restrict",
"__restrict__",
"_Noreturn",
"float",
"double",
"typeof",
"asm",
"_Thread_local",
"__thread",
"_Atomic",
"__attribute__",
]);
return set.contains(self.loc.as_str());
}
}
impl Tokenize {
// Escape sequences are defined using themselves here.
// '\n' is implemented using '\n'. This tautological definition
// works because the compiler that compiles our compiler knows what
// '\n' actually is. In other words, we "inherit" the ASCII code of '\n'
// from the compiler that compiles our compiler, so we don't have to teach
// the actual code here.
//
// This fact has huge implications not only for the correctness
// of the compiler but also for the security of the generated code.
// For more info, read "Reflections on Trusting Trust" by Ken Thompson
// https://github.com/rui314/chibicc/wiki/thompson1984.pdf
#[instrument(ret)]
fn read_escaped_char<'a>(&'a self, p: &mut Location) -> u32 {
if b'0' <= p[0] && p[0] <= b'7' {
// Read an octal number.
let mut c = (p[0] - b'0') as u32;
let mut i: usize = 1;
p.next();
while i < 3 {
if b'0' <= p[0] && p[0] <= b'7' {
c <<= 3;
c += (p[0] - b'0') as u32;
p.next();
}
i += 1;
}
return c;
} else if p[0] == b'x' {
// Read a hexadecimal number.
p.next();
if unsafe { libc::isxdigit(p[0] as i32) == 0 } {
// TODO: preemptively replace with Location?
self.error(p, "invalid hex escape sequence");
}
let mut c: u32 = 0;
while p.len() > 0 && unsafe { libc::isxdigit(p[0] as i32) > 0 } {
c <<= 4;
c += from_hex(p[0]) as u32;
p.next();
}
return c;
} else {
let c = match p[0] {
b'a' => 7 as u32,
b'b' => 8 as u32,
b't' => b'\t' as u32,
b'n' => b'\n' as u32,
b'v' => 11 as u32,
b'f' => 12 as u32,
b'r' => b'\r' as u32,
b'e' => 27,
_ => p[0] as u32,
};
p.next();
return c;
}
}
#[instrument(ret)]
fn string_literal_end(&self, loc: &Location) -> Location {
let mut end = loc.clone();
while let Some(c) = end.next() {
if c == b'"' {
break;
}
if c == b'\n' || c == b'\0' {
self.error(loc, "unclosed string literal");
}
if c == b'\\' {
end.next();
}
}
end.idx -= 1;
assert!(end[0] == b'"');
return end;
}
#[instrument(ret)]
fn read_string_literal(
&self,
start: &Location,
quote: &Location,
at_bol: bool,
has_space: bool,
) -> Token {
let mut q = quote.clone();
q.next();
let end = self.string_literal_end(&q);
q.end_idx = end.idx;
let mut buf: Vec<u8> = Vec::with_capacity(end.idx - quote.idx);
while let Some(c) = q.next() {
if c == b'\\' {
let ec: u32 = self.read_escaped_char(&mut q);
buf.push(ec as u8); // TODO: unsafe cast?
} else {
buf.push(c);
}
}
let mut tok = Token::new(
TokenKind::Str,
self.current_file.clone().unwrap(),
Location {
idx: start.idx,
end_idx: end.idx + 1,
contents: end.contents,
},
at_bol,
has_space,
);
tok.other = TokenFields::Str(array_of(&TY_CHAR, buf.len() + 1), buf.into_boxed_slice());
return tok;
}
// Read a UTF-8-encoded string literal and transcode it in UTF-16.
//
// UTF-16 is yet another variable-width encoding fro Unicode. Code
// points smaller than U+10000 are encoded in 2 bytes. Code points
// equal to or larger than that are encoded in 4 bytes. Each 2 bytes
// in the 4 byte sequence is called "surrogate", and a 4 byte sequence
// is called a "surrogate pair".
#[instrument(ret)]
fn read_utf16_string_literal(
&self,
start: &Location,
quote: &Location,
at_bol: bool,
has_space: bool,
) -> Token {
let mut q = quote.clone();
q.next();
let end = self.string_literal_end(&q);
q.end_idx = end.idx;
let mut buf = Vec::with_capacity(end.idx - start.idx);
while q.idx < end.idx {
if q[0] == b'\\' {
q.next();
let ec = self.read_escaped_char(&mut q);
buf.push(ec as u16); // TODO: unsafe cast?
} else {
let p = q.as_bytes();
let (cd, next_p) = decode_utf8(p).unwrap();
q.idx += p.len() - next_p.len();
if cd < 0x10000 {
// Encode a code point in 2 bytes.
buf.push(cd as u16);
} else {
// Encode a code point in 4 bytes.
let cs = cd - 0x10000;
let b1 = 0xd800 + ((cs >> 10) & 0x3ff);
let b2 = 0xdc00 + (cs & 0x3ff);
buf.push(b1 as u16);
buf.push(b2 as u16);
}
}
}
let mut tok = Token::new(
TokenKind::Str,
self.current_file.clone().unwrap(),
Location {
idx: start.idx,
end_idx: end.idx + 1,
contents: q.contents,
},
at_bol,
has_space,
);
tok.other = TokenFields::Str(array_of(&TY_USHORT, buf.len() + 1), unsafe {
std::mem::transmute(buf.into_boxed_slice())
});
return tok;
}
// Read a UTF-8-encoded string literal and transcode it in UTF-32.
//
// UTF-32 is a fixed-width encoding for Unicode. Each code point is
// encoded in 4 bytes.
#[instrument(ret)]
fn read_utf32_string_literal(
&self,
start: &Location,
quote: &Location,
ty: &Type,
at_bol: bool,
has_space: bool,
) -> Token {
let mut q = quote.clone();
q.next();
let end = self.string_literal_end(&q);
q.end_idx = end.idx;
let mut buf: Vec<u32> = Vec::with_capacity(end.idx - start.idx); // Why quote.idx instead of start in original code?
while q.idx < end.idx {
if q[0] == b'\\' {
q.next();
let ec = self.read_escaped_char(&mut q);
buf.push(ec);
} else {
let p = q.as_bytes();
let (dc, next_p) = decode_utf8(p).unwrap();
q.idx += p.len() - next_p.len();
buf.push(dc);
}
}
let mut tok = Token::new(
TokenKind::Str,
self.current_file.clone().unwrap(),
Location {
idx: start.idx,
end_idx: end.idx + 1,
contents: q.contents,
},
at_bol,
has_space,
);
tok.other = TokenFields::Str(array_of(ty, buf.len() + 1), unsafe {
std::mem::transmute(buf.into_boxed_slice())
});
return tok;
}
#[instrument(ret)]
fn read_char_literal(
&self,
start: &Location,
quote: &Location,
ty: &Type,
at_bol: bool,
has_space: bool,
) -> Token {
let mut q = quote.clone();
q.next();
let first_literal_byte = q[0];
if first_literal_byte == b'\0' {
self.error(start, "unclosed char literal");
}
let c32 = {
if first_literal_byte == b'\\' {
q.next();
self.read_escaped_char(&mut q)
} else {
let p = q.as_bytes();
let (c32, nextp) = decode_utf8(p).unwrap();
q.idx += p.len() - nextp.len();
c32
}
};
if q.find(|c| *c == b'\'').is_none() {
self.error(start, "unclosed char literal");
}
let end = q;
let mut tok = Token::new(
TokenKind::Num,
self.current_file.clone().unwrap(),
Location {
idx: start.idx,
end_idx: end.idx, // end excees q
contents: end.contents,
},
at_bol,
has_space,
);
tok.other = TokenFields::Int(ty.clone(), c32 as i64);
return tok;
}
}
impl Token {
#[instrument(ret)]
fn convert_pp_int(&mut self) -> bool {
// TODO: functionality is different because p stops at loc.end_idx
// while the original code make look at addresses after loc.end_idx
let mut p = self.loc.as_bytes();
// Read a binary, octal, decimal, or hexadecimal number.
let mut base = 10;
if p.len() >= 2
&& p[..2].eq_ignore_ascii_case("0x".as_bytes())
&& unsafe { isxdigit(p[2] as i32) > 0 }
{
p = &p[2..];
base = 16;
} else if p.len() >= 2
&& p[..2].eq_ignore_ascii_case("0b".as_bytes())
&& (p[2] == b'0' || p[2] == b'1')
{
p = &p[2..];
base = 2;
} else if p[0] == b'0' {
base = 8;
}
let mut pint = &p[..];
let mut flag = false;
for i in 0..p.len() {
if p[i].eq_ignore_ascii_case(&b'l') || p[i].eq_ignore_ascii_case(&b'u') {
pint = &p[..i];
p = &p[i..];
flag = true;
break;
}
}
if !flag {
p = &p[p.len()..];
}
let (val, l, u) = match u64::from_str_radix(core::str::from_utf8(&pint).unwrap(), base) {
Ok(val_unsigned) => {
let val = val_unsigned as i64;
if p.starts_with("LLU".as_bytes())
|| p.starts_with("LLu".as_bytes())
|| p.starts_with("llU".as_bytes())
|| p.starts_with("llu".as_bytes())
|| p.starts_with("ULL".as_bytes())
|| p.starts_with("Ull".as_bytes())
|| p.starts_with("uLL".as_bytes())
|| p.starts_with("ull".as_bytes())
{
p = &p[3..];
(val, true, true)
} else if p.len() >= 2
&& (p[..2].eq_ignore_ascii_case("lu".as_bytes())
|| p[..2].eq_ignore_ascii_case("ul".as_bytes()))
{
p = &p[2..];
(val, true, true)
} else if p.starts_with("LL".as_bytes()) || p.starts_with("ll".as_bytes()) {
p = &p[2..];
(val, true, false)
} else if p.len() > 0 && p[0].eq_ignore_ascii_case(&b'l') {
p = &p[1..];
(val, true, false)
} else if p.len() > 0 && p[0].eq_ignore_ascii_case(&b'u') {
p = &p[1..];
(val, false, true)
} else {
(val, false, false)
}
}
Err(_) => {
return false;
}
};
if p.len() != 0 {
return false;
}
let ty: Type = {
if base == 10 {
if l & u {
TY_ULONG
} else if l {
TY_LONG
} else if u {
if val >> 32 != 0 {
TY_ULONG
} else {
TY_UINT
}
} else {
if val >> 31 != 0 {
TY_LONG
} else {
TY_INT
}
}
} else {
if l & u {
TY_ULONG
} else if l {
if val >> 63 != 0 {
TY_ULONG
} else {
TY_LONG
}
} else if u {
if val >> 32 != 0 {
TY_ULONG
} else {
TY_UINT
}
} else if val >> 63 != 0 {
TY_ULONG
} else if val >> 32 != 0 {
TY_LONG
} else if val >> 31 != 0 {
TY_UINT
} else {
TY_INT
}
}
};
self.kind = TokenKind::Num;
self.other = TokenFields::Int(ty, val);
return true;
}
// The definition of the numeric literal at the preprocessing stage
// is more relaxed than the definition of that at the later stages.
// In order to handle that, a numeric literal is tokenized as a
// "pp-number" token first and then converted to a regular number
// token after preprocessing.
//
// This function converts a pp-number token to a regular number token.
#[instrument(ret)]
fn convert_pp_number(&mut self) {
if self.convert_pp_int() {
return;
}
let (val, bytes_read) = unsafe {
let s = CString::new(self.loc.as_bytes()).unwrap();
let start = s.as_ptr();
let end: Cell<*mut c_char> = Cell::new(std::ptr::null_mut());
let val = strtod(s.as_ptr(), end.as_ptr());
let len = end.get().byte_offset_from(start);
(val, len as usize)
};
let mut end = &self.loc.as_bytes()[bytes_read..];
let (ty, val) = {
if end.len() > 0 && end[0].eq_ignore_ascii_case(&b'f') {
end = &end[1..];
(TY_FLOAT, val)
} else if end.len() > 0 && end[0].eq_ignore_ascii_case(&b'l') {
end = &end[1..];
(TY_LDOUBLE, val)
} else {
(TY_DOUBLE, val)
}
};
if end.len() != 0 {
self.error("invalid numeric constant");
}
self.kind = TokenKind::Num;
self.other = TokenFields::Float(ty, val);
}
}
impl Tokenize {
// We take in a vector instead
#[instrument(skip_all)]
fn convert_pp_tokens(tokens: &mut Vec<Token>) {
for t in tokens {
if t.is_keyword() {
t.kind = TokenKind::Keyword;
} else if t.kind == TokenKind::PPNum {
t.convert_pp_number();
}
}
}
// Initialize line info for all tokens.
//
// Note: we take a vector, and assume the tokens are sorted.
#[instrument(skip_all)]
fn add_line_number(&self, tokens: &mut Vec<Token>) {
let p: &[u8] = &self.current_file.as_ref().unwrap().borrow().contents;
let mut n = 1;
let mut token_idx = 0;
for (i, c) in p.into_iter().enumerate() {
if token_idx == tokens.len() {
// assert_eq!(*c, b'\n');
continue;
}
if i == tokens[token_idx].loc.idx {
tokens[token_idx].line_no = n;
token_idx += 1;
}
if *c == b'\n' {
n += 1;
}
}
}
}
#[cfg(test)]
mod test_read {
use std::{cell::RefCell, rc::Rc};
use super::{File, Location, Tokenize};
#[test]
fn test_read_escaped_character() {
let cases = [
("044", 36), // octet
("xFF", 255), // hex
("\n", '\n' as u32),
("u", b'u' as u32), //
];
let t = fake_tokenize();
for (s, e) in cases {
let mut loc = Location {
idx: 0,
end_idx: s.len(),
contents: Rc::new(s.into()),
};
let a = t.read_escaped_char(&mut loc);
assert_eq!(loc.idx, loc.contents.len());
assert_eq!(a, e);
}
}
fn fake_tokenize() -> Tokenize {
let fake_file = File {
name: "fake".into(),
file_no: -1,
contents: Rc::new("fake".into()),
display_name: "fake".into(),
line_delta: 0,
};
let t = Tokenize {
current_file: Some(Rc::new(RefCell::new(fake_file))),
input_files: Vec::new(),
};
t
}
}
impl Tokenize {
#[instrument(ret)]
fn tokenize_string_literal(
&self,
tok: &Token,
basety: &Type,
at_bol: bool,
has_space: bool,
) -> Token {
let t = if basety.size == 2 {
self.read_utf16_string_literal(&tok.loc, &tok.loc, at_bol, has_space)
} else {
self.read_utf32_string_literal(&tok.loc, &tok.loc, basety, at_bol, has_space)
};
return t;
}
#[instrument(ret)]
fn tokenize(&mut self, file: Rc<RefCell<File>>) -> TokenList {
self.current_file = Some(file.clone());
let mut loc = Location {
idx: 0,
end_idx: file.borrow().contents.len(),
contents: file.borrow().contents.clone(),
};
let mut tokens: Vec<Token> = Vec::new();
let mut at_bol = true;
let mut has_space: bool = false;
while loc.idx < loc.end_idx {
// Skip line comments.
if loc.as_bytes().starts_with("//".as_bytes()) {
loc.idx += 2;
while loc[0] != b'\n' {
loc.idx += 1;
}
has_space = true;
}
// Skip block comments
else if loc.as_bytes().starts_with("/*".as_bytes()) {
#[instrument(ret, level = "debug")]
fn strstr(s: &[u8], m: &[u8]) -> Option<usize> {
for i in 0..s.len() - 1 {
let mut mf = true;
for j in 0..m.len() {
if s[i + j] != m[j] {
mf = false;
break;
}
}
if mf {
return Some(i);
}
}
return None;
}
let q = strstr(&loc.as_bytes()[2..], "*/".as_bytes());
match q {
Some(i) => {
loc.idx += 2 + i;
loc.idx += 2;
has_space = true
}
None => {
self.error(&loc, "unclosed block comment");
return TokenList {
tokens: Vec::new(),
idx: 0,
};
}
}
}
// Skip newline.
else if loc[0] == b'\n' {
loc.idx += 1;
at_bol = true;
has_space = false;
}
// Skip whitespace characters.
else if loc[0].is_ascii_whitespace() {
loc.idx += 1;
has_space = true;
} else {
let at_bol_copy = at_bol;
let has_space_copy = has_space;
let mut push_token = |tok| {
tokens.push(tok);
at_bol = false;
has_space = false;
};
// Numeric literal
if loc[0].is_ascii_digit() || (loc[0] == b'.' && loc[1].is_ascii_digit()) {
let start: usize = loc.idx;
loc.idx += 1;
loop {
if loc.len() >= 2
&& loc[0] != 0
&& loc[1] != 0
&& "eEpP".as_bytes().contains(&loc[0])
&& "+-".as_bytes().contains(&loc[1])
{
loc.idx += 2;
} else if loc.len() > 0
&& (loc[0].is_ascii_alphanumeric() || loc[0] == b'.')
{
loc.idx += 1;
} else {
break;
}
}
push_token(Token::new(
TokenKind::PPNum,
self.current_file.clone().unwrap(),
Location {
idx: start,
end_idx: loc.idx,
contents: loc.contents.clone(),
},
at_bol_copy,
has_space_copy,
));
}
// String literal
else if loc[0] == b'"' {
let tok = self.read_string_literal(&loc, &loc, at_bol_copy, has_space_copy);
loc.idx = tok.loc.end_idx;
push_token(tok);
}
// UTF-8 string literal
else if loc.as_bytes().starts_with("u8\"".as_bytes()) {
let mut quote = loc.clone();
quote.idx += 2;
let tok = self.read_string_literal(&loc, "e, at_bol_copy, has_space_copy);
loc.idx = tok.loc.end_idx;
push_token(tok);
}
// UTF-16 string literal
else if loc.as_bytes().starts_with("u\"".as_bytes()) {
let mut quote = loc.clone();
quote.idx += 1;
let tok =
self.read_utf16_string_literal(&loc, "e, at_bol_copy, has_space_copy);
loc.idx = tok.loc.end_idx;
push_token(tok);
}
// Wide string literal
else if loc.as_bytes().starts_with("L\"".as_bytes()) {
let mut quote = loc.clone();
quote.idx += 1;
let tok = self.read_utf32_string_literal(
&loc,
"e,
&TY_INT,
at_bol_copy,
has_space_copy,
);
loc.idx = tok.loc.end_idx;
push_token(tok);
}
// UTF-32 string literal
else if loc.as_bytes().starts_with("U\"".as_bytes()) {
let mut quote = loc.clone();
quote.idx += 1;
let tok = self.read_utf32_string_literal(
&loc,
"e,
&TY_UINT,
at_bol_copy,
has_space_copy,
);
loc.idx = tok.loc.end_idx;
push_token(tok);
}
// Character literal
else if loc[0] == b'\'' {
let tok =
self.read_char_literal(&loc, &loc, &TY_INT, at_bol_copy, has_space_copy);
// TODO: what is the meaning of cur->val = (char) cur->val;
loc.idx = tok.loc.end_idx;
push_token(tok);
}
// UTF16- Character literal
else if loc.as_bytes().starts_with("u'".as_bytes()) {
let mut quote = loc.clone();
quote.idx += 1;
let tok = self.read_char_literal(
&loc,
"e,
&TY_USHORT,
at_bol_copy,
has_space_copy,
);
loc.idx = tok.loc.end_idx;
push_token(tok);
}
// Wide character literal
else if loc.as_bytes().starts_with("L'".as_bytes()) {
let mut quote = loc.clone();
quote.idx += 1;
let tok =
self.read_char_literal(&loc, "e, &TY_INT, at_bol_copy, has_space_copy);
loc.idx = tok.loc.end_idx;
push_token(tok);
}
// UTF-32 character literal
else if loc.as_bytes().starts_with("U'".as_bytes()) {
let mut quote = loc.clone();
quote.idx += 1;
let tok: Token =
self.read_char_literal(&loc, "e, &TY_UINT, at_bol_copy, has_space_copy);
loc.idx = tok.loc.end_idx;
push_token(tok);
}
// Identifier or keyword
else {
let ident_len = read_ident(&loc.as_bytes());
if ident_len > 0 {
let tok = Token::new(
TokenKind::Ident,
self.current_file.clone().unwrap(),
Location {
idx: loc.idx,
end_idx: loc.idx + ident_len,
contents: loc.contents.clone(),
},
at_bol_copy,
has_space_copy,
);
loc.idx = tok.loc.end_idx;
push_token(tok);
continue;
}
let punct_len = read_punct(loc.as_bytes());
if punct_len > 0 {
let tok = Token::new(
TokenKind::Punct,
self.current_file.clone().unwrap(),
Location {
idx: loc.idx,
end_idx: loc.idx + punct_len,
contents: loc.contents.clone(),
},
at_bol_copy,
has_space_copy,
);
loc.idx = tok.loc.end_idx;
push_token(tok);
continue;
}
self.error(&loc, "invalid token");
}
}
}
self.add_line_number(&mut tokens);
return TokenList {
tokens: tokens,
idx: 0,
};
}
}
#[instrument]
fn read_file(path: &str) -> Option<Vec<u8>> {
let mut fp: Box<dyn Read>;
if path == "-" {
fp = Box::new(stdin());
} else {
match std::fs::File::open(path) {
Ok(bf) => {
if bf.metadata().unwrap().is_dir() {
return None;
}
fp = Box::new(bf);
}
Err(_) => {
return None;
}
}
}
let mut buf = Vec::new();
// Read the entire file.
loop {
let mut buf2: [u8; 4096] = [0; 4096];
let n = fp.read(&mut buf2).unwrap();
if n == 0 {
break;
}
buf.extend_from_slice(&buf2[..n]);
}
// Make sure that the last line is properly terminated with '\n'.
if buf.len() == 0 || *buf.last().unwrap() != b'\n' {
buf.push(b'\n');
}
return Some(buf);
}
impl Tokenize {
#[instrument(ret)]
fn get_input_files(&self) -> Vec<Rc<RefCell<File>>> {
return self.input_files.clone();
}
}
impl File {
#[instrument(ret, skip(contents))]
fn new(name: String, file_no: i32, contents: Vec<u8>) -> Self {
let display_name = name.clone();
return File {
name,
file_no,
contents: Rc::new(contents),
display_name: display_name,
line_delta: 0,
};
}
}
// Replaces \r or \r\n with \n.
#[instrument(skip_all)]
fn canonicalize_newline(p: &mut Vec<u8>) {
let mut i = 0;
let mut j = 0;
while i < p.len() {
if p[i] == b'\r' && p[i + 1] == b'\n' {
// TODO: unsafe
p[j] = b'\n';
i += 2;
j += 1;
} else if p[i] == b'\r' {
p[j] = b'\n';
i += 1;
j += 1;
} else {
p[j] = p[i];
i += 1;
j += 1;
}
}
p.truncate(j);
}
// Removes backslashes followed by a newline.
#[instrument(skip_all)]
fn remove_backslash_newline(p: &mut Vec<u8>) {
let mut i = 0;
let mut j = 0;
// We want to keep the number of newline characters so that
// the logical line number matches the physical one.
// This counter maintains the number of newlines we have removed.
let mut n = 0;
while i < p.len() {
if p[i] == b'\\' && p[i + 1] == b'\n' {
i += 2;
n += 1;
} else if p[i] == b'\n' {
p[j] = b'\n';
i += 1;
j += 1;
while n > 0 {
p[j] = b'\n';
j += 1;
n -= 1;
}
} else {
p[j] = p[i];
i += 1;
j += 1;
}
}
while n > 0 {
p[j] = b'\n';
j += 1;
n -= 1;
}
p.truncate(j);
}
#[instrument(ret, level = "debug")]
fn read_universal_char(p: &[u8], len: usize) -> u32 {
let mut c: u32 = 0;
for i in 0..len {
if unsafe { isxdigit(p[i] as i32) == 0 } {
return 0;
}
c = (c << 4) | (from_hex(p[i]) as u32);
}
return c;
}
// Replace \u or \U escape sequences with corresponding UTF-8 bytes
#[instrument(skip_all)]
fn convert_universal_chars(p: &mut Vec<u8>) {
let mut i = 0;
let mut j = 0;
while i < p.len() {
if p[i..].starts_with("\\u".as_bytes()) {
let c = read_universal_char(&p[i + 2..], 4);
if c > 0 {
i += 6;
j += encode_utf8(&mut p[j..], c);
} else {
p[j] = p[i];
i += 1;
j += 1;
}
} else if p[i..].starts_with("\\U".as_bytes()) {
let c = read_universal_char(&p[i + 2..], 8);
if c > 0 {
i += 10;
j += encode_utf8(&mut p[j..], c);
} else {
p[j] = p[i];
i += 10;
j += 10;
}
} else if p[0] == b'\\' {
p[j] = p[i];
i += 1;
j += 1;
p[j] = p[i];
i += 1;
j += 1;
} else {
p[j] = p[i];
i += 1;
j += 1;
}
}
p.truncate(j);
}
impl Tokenize {
#[instrument]
pub fn tokenize_file(&mut self, path: &str) -> Option<TokenList> {
let op = read_file(path);
if op.is_none() {
return None;
}
let mut contents = op.unwrap();
// UTF-8 texts may start with a 3-byte "BOM" marker sequence.
// If exists, just skip them because they are useless bytes.
// (It is actually not recommended to add BOM markers to UTF-8
// texts, but it's not uncommon particularly on Windows.)
if contents[..3].eq(&[b'\xef', b'\xbb', b'\xbf']) {
contents = contents[3..].into_iter().map(|x| *x).collect();
}
canonicalize_newline(&mut contents);
remove_backslash_newline(&mut contents);
convert_universal_chars(&mut contents);
// Save the filename for assembler .file directive.
let file = Rc::new(RefCell::new(File::new(
path.into(),
self.input_files.len() as i32,
contents,
)));
self.input_files.push(file.clone());
return Some(self.tokenize(file));
}
}
#[cfg(test)]
mod test_tokenize {
use std::{cell::RefCell, fmt::Write, rc::Rc};
use crate::chibicc::chibicc::{read_punct, setup_tracing};
use super::{File, Tokenize};
#[test]
fn test_read_punc() {
let s = r#"<= 0x7F) {
buf[0] ="#;
let len = read_punct(s.as_bytes());
assert_eq!(len, 2);
}
#[test]
fn test_tokenize() {
setup_tracing();
// Contents is a string that hits most branches
// of tokenize.
let contents = r#"
// Line Comment
/*
Block comment
*/
// Numbers
12345;
123.45;
10e+7;
0xFF;
0b11110000;
// String literals
strcmp("αβγ", "\u03B1\u03B2\u03B3");
strcmp("日本語", "\u65E5\u672C\u8A9E");
"u8\"a\"";
strcmp(u8"abc", "abc");
({ char16_t x[] = u"αβ"; x[1]; }))
memcmp(L"🍣", "c\363\001\0\0\0\0\0", 8);
sizeof(L"\xffzzz");
memcmp(U"abc", "a\0\0\0b\0\0\0c\0\0\0\0\0\0\0", 16);
memcmp(U"🍣", "c\363\001\0\0\0\0\0", 8);
// Char literals
L'\0';
L'a';
L'\xffffffff'>>31;
L'あ';
u'\0';
u'\xffff'>>15;
U'\xffffffff'>>31;
U'🍣';
// Identifier
var variablename123 = 1234;
// Punctuation
if (variablename123 == true) {
}
"#;
let expected_out = r#"Token(Punct, *)
Token(Punct, /)
Token(PPNum, 12345)
Token(Punct, ;)
Token(PPNum, 123.45)
Token(Punct, ;)
Token(PPNum, 10e+7)
Token(Punct, ;)
Token(PPNum, 0xFF)
Token(Punct, ;)
Token(PPNum, 0b11110000)
Token(Punct, ;)
Token(Ident, strcmp)
Token(Punct, ()
Token(Str, "αβγ")
Token(Punct, ,)
Token(Str, "\u03B1\u03B2\u03B3")
Token(Punct, ))
Token(Punct, ;)
Token(Ident, strcmp)
Token(Punct, ()
Token(Str, "日本語")
Token(Punct, ,)
Token(Str, "\u65E5\u672C\u8A9E")
Token(Punct, ))
Token(Punct, ;)
Token(Str, "u8\"a\"")
Token(Punct, ;)
Token(Ident, strcmp)
Token(Punct, ()
Token(Str, u8"abc")
Token(Punct, ,)
Token(Str, "abc")
Token(Punct, ))
Token(Punct, ;)
Token(Punct, ()
Token(Punct, {)
Token(Ident, char16_t)
Token(Ident, x)
Token(Punct, [)
Token(Punct, ])
Token(Punct, =)
Token(Str, u"αβ")
Token(Punct, ;)
Token(Ident, x)
Token(Punct, [)
Token(PPNum, 1)
Token(Punct, ])
Token(Punct, ;)
Token(Punct, })
Token(Punct, ))
Token(Punct, ))
Token(Ident, memcmp)
Token(Punct, ()
Token(Str, L"🍣")
Token(Punct, ,)
Token(Str, "c\363\001\0\0\0\0\0")
Token(Punct, ,)
Token(PPNum, 8)
Token(Punct, ))
Token(Punct, ;)
Token(Ident, sizeof)
Token(Punct, ()
Token(Str, L"\xffzzz")
Token(Punct, ))
Token(Punct, ;)
Token(Ident, memcmp)
Token(Punct, ()
Token(Str, U"abc")
Token(Punct, ,)
Token(Str, "a\0\0\0b\0\0\0c\0\0\0\0\0\0\0")
Token(Punct, ,)
Token(PPNum, 16)
Token(Punct, ))
Token(Punct, ;)
Token(Ident, memcmp)
Token(Punct, ()
Token(Str, U"🍣")
Token(Punct, ,)
Token(Str, "c\363\001\0\0\0\0\0")
Token(Punct, ,)
Token(PPNum, 8)
Token(Punct, ))
Token(Punct, ;)
Token(Num, L'\0')
Token(Punct, ;)
Token(Num, L'a')
Token(Punct, ;)
Token(Num, L'\xffffffff')
Token(Punct, >>)
Token(PPNum, 31)
Token(Punct, ;)
Token(Num, L'あ')
Token(Punct, ;)
Token(Num, u'\0')
Token(Punct, ;)
Token(Num, u'\xffff')
Token(Punct, >>)
Token(PPNum, 15)
Token(Punct, ;)
Token(Num, U'\xffffffff')
Token(Punct, >>)
Token(PPNum, 31)
Token(Punct, ;)
Token(Num, U'🍣')
Token(Punct, ;)
Token(Ident, var)
Token(Ident, variablename123)
Token(Punct, =)
Token(PPNum, 1234)
Token(Punct, ;)
Token(Ident, if)
Token(Punct, ()
Token(Ident, variablename123)
Token(Punct, ==)
Token(Ident, true)
Token(Punct, ))
Token(Punct, {)
Token(Punct, })
Token(EOF, )
"#;
let f = Rc::new(RefCell::new(File {
name: "test".into(),
file_no: 0,
contents: Rc::new(Vec::from(contents.as_bytes())),
display_name: "test".into(),
line_delta: 0,
}));
let mut t = Tokenize {
current_file: Some(f.clone()),
input_files: Vec::from([f.clone()]),
};
let v = t.tokenize(f);
let mut actual_out = String::new();
for t in v.tokens {
actual_out.write_str(t.to_string().as_str());
actual_out.write_char('\n');
}
println!("{}", actual_out);
{
let expected_lines: Vec<&str> = expected_out.split("\n").collect();
let actual_lines: Vec<&str> = actual_out.split("\n").collect();
let mut fail = false;
for (i, line) in expected_lines.iter().enumerate() {
if i >= actual_lines.len() {
break;
}
if *line != actual_lines[i] {
fail = true;
println!(
"line not equal: i={}\n\texpected={}\n\tactual={}",
i, line, actual_lines[i]
);
}
}
if expected_lines.len() != actual_lines.len() {
fail = true;
println!(
"len(expected_lines) != len(actual_lines): {} {}",
expected_lines.len(),
actual_lines.len()
);
}
assert!(!fail);
}
}
}
//
// preprocess.c
//
// The preprocessor takes a list of tokens as an input and returns a
// new list of tokens as an output.
//
// The preprocessing language is designed in such a way that there's
// guarnteed to stop even if there is a recursive macro.
// Informally speaking, a macro is applied only once for each token.
// That is, if a macro token T appears in a result of direct or
// indirect macro expansion of T, T won't be expanded any further.
// For example, if T is defined as U, and U is defined as T, then
// token T is expanded to U and athen to T and the macro expansion
// stops at that point.
//
// To achieve the above behavior, we attach for each token a set of
// macro names from which the token is expanded. The set is called
// "hideset". Hideset is initially empty, and every time we expand a
// macro, the macro name is added to the resulting tokens' hideset.
//
// The above macro expansion algorithm is explained in this document
// written by Dave Prossor, which is used as a bais for the standard's
// wording.
// https://github.com/rui314/chibicc/wiki/cpp.algo.pdf
#[derive(Debug)]
struct MacroParam {
name: String,
}
#[derive(Clone, Debug)]
struct MacroArg {
name: String,
is_va_args: bool,
tok: Rc<Vec<Token>>,
}
type MacroHandlerFn = fn(&Token, &mut Preprocess) -> Token;
#[derive(Debug)]
struct Macro {
name: String,
is_objlike: bool, // Object-like or function-like
params: Vec<MacroParam>,
va_args_name: Option<String>,
body: Option<Rc<TokenList>>,
handler: Option<MacroHandlerFn>,
}
// `#if` can be nested, so we use a stack to managed nested `#if`s.
#[derive(Debug, PartialEq)]
enum CondInclCtx {
InThen,
InElif,
InElse,
}
#[derive(Debug)]
struct CondIncl {
ctx: CondInclCtx,
tok: Rc<Token>,
included: bool,
}
pub struct Preprocess {
macros: HashMap<String, Rc<Macro>>,
cond_incl: Vec<Rc<RefCell<CondIncl>>>,
pragma_once: HashSet<String>,
include_paths: Vec<String>,
include_next_idx: usize,
include_cache: HashMap<String, String>,
include_guards: HashMap<String, String>,
counter_macro_idx: i32,
base_file: String,
}
impl std::fmt::Debug for Preprocess {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
f.debug_struct("Preprocess")
.field("macros.key()", &self.macros.keys())
.field(
"cond_incl",
&self
.cond_incl
.iter()
.map(|c| format!("[{}, {:?}]", c.borrow().tok, c.borrow().ctx))
.collect::<Vec<String>>(),
)
.field("pragma_once", &self.pragma_once)
.field("include_paths", &self.include_paths)
.field("include_next_idx", &self.include_next_idx)
.field("include_cache", &self.include_cache)
.field("include_guards", &self.include_guards)
.field("counter_macro_idx", &self.counter_macro_idx)
.field("base_file", &self.base_file)
.finish()
}
}
impl Preprocess {
pub fn new(include_paths: Vec<String>) -> Preprocess {
return Preprocess {
macros: HashMap::new(),
cond_incl: Vec::new(),
pragma_once: HashSet::new(),
include_paths: include_paths,
include_next_idx: 0,
include_cache: HashMap::new(),
include_guards: HashMap::new(),
counter_macro_idx: 0,
base_file: String::new(),
};
}
}
impl Token {
#[instrument(ret, level=Level::DEBUG)]
fn equal(&self, op: &str) -> bool {
self.loc.as_bytes() == op.as_bytes()
}
#[instrument(ret, level=Level::DEBUG)]
fn is_hash(&self) -> bool {
return self.at_bol && self.equal("#");
}
}
#[derive(Clone)] // TODO: reduce clone() calls
pub struct TokenList {
tokens: Vec<Token>,
idx: usize,
}
impl std::fmt::Debug for TokenList {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
f.debug_struct("TokenIterator")
.field("tokens[idx]", &self.tokens.get(self.idx))
.field("idx", &self.idx)
.field("len()", &self.len())
.finish()
}
}
impl Index<Range<usize>> for TokenList {
type Output = [Token];
fn index(&self, index: Range<usize>) -> &Self::Output {
let r: Range<usize> = Range {
start: index.start + self.idx,
end: index.end + self.idx,
};
return &self.tokens[r];
}
}
impl Index<RangeFull> for TokenList {
type Output = [Token];
fn index(&self, _: RangeFull) -> &Self::Output {
let len = self.tokens.len() - self.idx;
return &self[0..len];
}
}
impl Index<usize> for TokenList {
type Output = Token;
fn index(&self, index: usize) -> &Self::Output {
return &self.tokens[self.idx + index];
}
}
impl TokenList {
fn len(&self) -> usize {
return self.tokens.len() - self.idx;
}
}
impl std::ops::IndexMut<usize> for TokenList {
fn index_mut(&mut self, index: usize) -> &mut Self::Output {
return &mut self.tokens[self.idx + index];
}
}
impl TokenList {
#[instrument(ret)]
fn skip(&mut self, op: &str) {
let tok = &self[0];
if !tok.equal(op) {
tok.error(&format!("expected '{:?}'", op));
}
self.idx += 1;
}
#[instrument(ret)]
fn consume(&mut self, op: &str) -> bool {
let tok = &self[0];
if tok.equal(op) {
self.idx += 1;
return true;
}
return false;
}
// Some preprocessor directives such as #include allow extraneous
// tokens before newline. This function skips such tokens.
#[instrument(ret)]
fn skip_line(&mut self) {
if self.len() == 0 {
return;
}
let tok = &self[0];
if tok.at_bol {
return;
}
tok.warn("extra token"); // TODO: issues with nested macros?
while !self[0].at_bol {
self.idx += 1;
}
}
}
impl Hideset {
#[instrument(ret, level = Level::DEBUG)]
fn new(name: String) -> Self {
return Hideset { name };
}
}
#[instrument(ret, level=Level::DEBUG)]
fn hideset_union(hs1: &Vec<Hideset>, hs2: &Vec<Hideset>) -> Vec<Hideset> {
let mut r = hs1.clone();
r.extend_from_slice(&hs2);
return r;
}
#[instrument(ret, level=Level::DEBUG)]
fn hideset_contains(hs: &Vec<Hideset>, s: &str) -> bool {
for h in hs.iter() {
if h.name == s {
return true;
}
}
return false;
}
#[instrument(ret)]
fn hideset_intersection(hs1: &Vec<Hideset>, hs2: &Vec<Hideset>) -> Vec<Hideset> {
let mut r = Vec::new();
for x in hs1.iter() {
if hideset_contains(hs2, &x.name) {
r.push(Hideset::new(x.name.clone()))
}
}
return r;
}
impl TokenList {
#[instrument(ret)]
fn add_hideset(&mut self, hs: &Vec<Hideset>) {
for i in self.idx..self.tokens.len() {
self.tokens[i].hideset = hideset_union(&self.tokens[i].hideset, hs);
}
}
#[instrument(ret)]
fn append(&mut self, tok: &TokenList) {
self.tokens.extend_from_slice(&tok.tokens[tok.idx..]);
}
#[instrument(ret)]
fn skip_cond_incl2(&mut self) {
while self.len() > 0 {
if self[0].is_hash()
&& (self[1].equal("if") || self[1].equal("ifdef") || self[1].equal("ifndef"))
{
self.idx += 2;
self.skip_cond_incl2();
continue;
}
if self[0].is_hash() && self[1].equal("endif") {
self.idx += 2;
return;
}
self.idx += 1;
}
}
// Skip until next `#else`, `#elif`, or `#endif`.
// Nested `#if` and `#endif` are skipped.
#[instrument(ret)]
fn skip_cond_incl(&mut self) {
while self.len() > 0 {
if self[0].is_hash()
&& (self[1].equal("if") || self[1].equal("ifdef") || self[1].equal("ifndef"))
{
self.idx += 2;
self.skip_cond_incl2();
continue;
}
if self[0].is_hash()
&& (self[1].equal("elif") || self[1].equal("else") || self[1].equal("endif"))
{
break;
}
self.idx += 1;
}
}
}
#[instrument(ret)]
fn quote_string(s: &str) -> String {
let mut bufsize = 3;
for b in s.bytes() {
if b == b'\\' || b == b'"' {
bufsize += 1;
}
bufsize += 1;
}
let mut p = String::with_capacity(bufsize);
p.push('"');
for b in s.bytes() {
if b == b'\\' || b == b'"' {
p.push('\\');
}
p.push(b as char);
}
p.push('"');
return p;
}
impl Token {
#[instrument(ret)]
fn new_str_token(s: &str, tmpl: &Token) -> Token {
let buf = quote_string(s);
let mut t = Tokenize::new();
let f = File::new(
tmpl.file.borrow().name.clone(),
tmpl.file.borrow().file_no,
buf.into_bytes(),
);
let tokens: TokenList = t.tokenize(Rc::new(RefCell::new(f)));
assert!(tokens.tokens.len() == 1);
return tokens[0].clone();
}
}
impl TokenList {
#[instrument(ret)]
fn copy_line(&mut self) -> TokenList {
let mut r = Vec::new();
while !self[0].at_bol {
r.push(self[0].clone());
self.idx += 1;
}
return TokenList { tokens: r, idx: 0 };
}
}
impl Token {
#[instrument(ret)]
fn new_num_token(val: i32, tmpl: &Token) -> Token {
let buf = format!("{}\n", val);
let mut t = Tokenize::new();
let f = File::new(
tmpl.file.borrow().name.clone(),
tmpl.file.borrow().file_no,
buf.into_bytes(),
);
let tokens: TokenList = t.tokenize(Rc::new(RefCell::new(f)));
assert!(tokens.tokens.len() == 1);
return tokens[0].clone();
}
}
impl TokenList {
#[instrument(ret)]
fn read_const_expr(&mut self, pp: &Preprocess) -> TokenList {
let mut tok = self.copy_line();
let mut ret = Vec::new();
while tok.len() > 0 {
// "defined(foo)" or "defined foo" becomes "1" if macro "foo"
// is defined. Otherwise "0".
if tok[0].equal("defined") {
let start = tok[0].clone();
tok.idx += 1;
let has_paren = tok.consume("(");
if tok[0].kind != TokenKind::Ident {
start.error("macro name must be an identifier");
}
let m = pp.find_macro(&tok[0]);
tok.idx += 1;
if has_paren {
tok.skip(")");
}
ret.push(Token::new_num_token(
if m.is_some() { 1 } else { 0 },
&start,
));
continue;
}
ret.push(tok[0].clone());
tok.idx += 1;
}
return TokenList {
tokens: ret,
idx: 0,
};
}
#[instrument(ret)]
fn eval_const_expr(&mut self, pp: &mut Preprocess, tokenize: &mut Tokenize) -> i64 {
let start = self[0].clone();
self.idx += 1;
let mut expr = self.read_const_expr(pp);
expr = preprocess2(expr, pp, tokenize);
if expr.len() == 0 {
start.error("no expression");
}
// [https://www.sigbus.info/n1570#6.10.1p4] The standard requires
// we replace remaining non-macro identifiers with "0" before
// evaluating a constant expression. For example, `#if foo` is
// equivalent to `#if 0` if foo is not defined.
let expr_orig = expr.join(None);
{
let mut i: usize = 0;
while i < expr.len() {
if expr[i].kind == TokenKind::Ident {
expr[i] = Token::new_num_token(0, &expr[i]);
}
i += 1;
}
}
// Convert pp-numbers to regular numbers
Tokenize::convert_pp_tokens(&mut expr.tokens);
eprintln!(
"WARNING: if evaluation not implemented yet, returning 0: {} ||| {}",
expr_orig,
expr.join(None)
);
return 0;
// TODO: implement after implementing parse
// expr.const_expr
/*
Token *rest2;
long val = const_expr(&rest2, expr);
if (rest2->kind != TK_EOF)
error_tok(rest2, "extra token");
return val;
*/
todo!();
}
}
impl Preprocess {
#[instrument(ret)]
fn push_cond_incl(&mut self, tok: Rc<Token>, included: bool) -> Rc<RefCell<CondIncl>> {
let ci = CondIncl {
ctx: CondInclCtx::InThen,
tok: tok.clone(),
included,
};
let circ = Rc::new(RefCell::new(ci));
self.cond_incl.push(circ.clone());
return circ;
}
#[instrument(ret, level=Level::DEBUG)]
fn find_macro(&self, tok: &Token) -> Option<Rc<Macro>> {
if tok.kind != TokenKind::Ident {
return None;
}
return self.macros.get(tok.loc.as_str()).map(|x| x.clone());
}
#[instrument(ret, level=Level::DEBUG)]
fn add_macro(&mut self, m: Macro) -> Rc<Macro> {
let name = m.name.clone();
let mrc = Rc::new(m);
self.macros.insert(name, mrc.clone());
return mrc;
}
}
impl TokenList {
#[instrument(ret)]
fn read_macro_params(&mut self) -> (Vec<MacroParam>, Option<String>) {
let mut va_args_name = None;
let mut params = Vec::new();
while !self[0].equal(")") {
if params.len() != 0 {
self.skip(",");
}
if self[0].equal("...") {
va_args_name = Some("__VA_ARGS__".into());
self.idx += 1;
self.skip(")");
return (params, va_args_name);
}
if self[0].kind != TokenKind::Ident {
self[0].error("expected an identifier");
}
if self[1].equal("...") {
va_args_name = Some(String::from(self[0].loc.as_str()));
self.idx += 2;
self.skip(")");
return (params, va_args_name);
}
let m = MacroParam {
name: self[0].loc.as_str().into(),
};
params.push(m);
self.idx += 1;
}
self.idx += 1;
return (params, va_args_name);
}
#[instrument(ret)]
fn read_macro_definition(&mut self, pp: &mut Preprocess) {
if self[0].kind != TokenKind::Ident {
self[0].error("macro name must be an identifier");
}
let name = String::from(self[0].loc.as_str());
self.idx += 1;
if !self[0].has_space && self[0].equal("(") {
// Function-like macro
self.idx += 1;
let (params, va_args_name) = self.read_macro_params();
let tokens = self.copy_line();
let m = Macro {
name,
is_objlike: false,
params,
va_args_name,
body: Some(Rc::new(tokens)),
handler: None,
};
pp.add_macro(m);
} else {
let tokens = self.copy_line();
let m = Macro {
name,
is_objlike: true,
params: Vec::new(),
va_args_name: None,
body: Some(Rc::new(tokens)),
handler: None,
};
pp.add_macro(m);
}
}
#[instrument(ret)]
fn read_macro_arg_one(&mut self, read_rest: bool) -> MacroArg {
let mut tokens = Vec::new();
let mut level = 0;
loop {
if level == 0 && self[0].equal(")") {
break;
}
if level == 0 && !read_rest && self[0].equal(",") {
break;
}
if self.len() == 0 {
self[0].error("premature end of input");
}
if self[0].equal("(") {
level += 1;
} else if self[0].equal(")") {
level -= 1;
}
tokens.push(self[0].clone());
self.idx += 1;
}
let arg = MacroArg {
name: String::new(),
is_va_args: false,
tok: Rc::new(tokens),
};
return arg;
}
#[instrument(ret)]
fn read_macro_args(
&mut self,
params: &Vec<MacroParam>,
va_args_name: Option<String>,
) -> Vec<MacroArg> {
self.idx += 2;
let mut args = Vec::new();
for p in params {
if args.len() != 0 {
self.skip(",");
}
let mut a = self.read_macro_arg_one(false);
a.name = p.name.clone();
args.push(a);
}
if let Some(va_args_name) = va_args_name {
let mut arg = if self[0].equal(")") {
MacroArg {
name: String::new(),
is_va_args: true,
tok: Rc::new(vec![]),
}
} else {
if args.len() > 0 {
self.skip(",");
}
self.read_macro_arg_one(true)
};
arg.name = va_args_name;
arg.is_va_args = true;
args.push(arg);
}
self.skip(")");
self.idx -= 1;
return args;
}
}
#[instrument(ret, level=Level::DEBUG)]
fn find_arg(args: &Vec<MacroArg>, tok: &Token) -> Option<MacroArg> {
for a in args {
if a.name == String::from(tok.loc.as_str()) {
return Some(a.clone());
}
}
return None;
}
impl TokenList {
// Concatenates all tokens in `self` and returns a new string
#[instrument(ret)]
fn join(&self, end: Option<&Token>) -> String {
let mut len = 0;
for t in &self[..] {
if end.is_some() && t.loc.idx == end.unwrap().loc.idx {
break;
}
if len == 0 || t.has_space {
len += 1;
}
len += t.loc.len();
}
let mut buf = String::with_capacity(len);
// Copy token texts.
for t in &self[..] {
if end.is_some() && t.loc.idx == end.unwrap().loc.idx {
break;
}
if buf.len() > 0 && t.has_space {
buf.push(' ');
}
buf.push_str(t.loc.as_str());
}
return buf;
}
// Concatenate all tokens in `self` and return a new string token.
// This function is used for the stringizing operator (#).
#[instrument(ret)]
fn stringize(&self, hash: &Token) -> Token {
// Create a new string token. We need to set some value for
// its source location for error reporting function, so we use
// a macro name token as a template.
let s = self.join(None);
return Token::new_str_token(&s, hash);
}
}
impl Token {
#[instrument(ret)]
fn paste(&self, other: &Token) -> Token {
let buf = format!("{}{}", self.loc.as_str(), other.loc.as_str());
let mut t = Tokenize::new();
let f = File::new(
self.file.borrow().name.clone(),
self.file.borrow().file_no,
buf.clone().into_bytes(),
);
let tokens = t.tokenize(Rc::new(RefCell::new(f)));
if tokens.tokens.len() != 1 {
self.error(format!("pasting forms '{}', an invalid token", buf).as_str());
}
return tokens[0].clone();
}
}
#[instrument(ret)]
fn has_varargs(args: &Vec<MacroArg>) -> bool {
for ap in args {
if ap.name == "__VA_ARGS" {
return ap.tok.len() > 0;
}
}
return false;
}
// Replace func-like macro params with given arguments.
impl TokenList {
#[instrument(ret)]
fn subst(
mut self,
args: &Vec<MacroArg>,
pp: &mut Preprocess,
tokenize: &mut Tokenize,
) -> TokenList {
let mut v = Vec::new();
while self.len() > 0 {
let tok = &self[0];
if tok.equal("#") {
let arg = find_arg(args, &self[1]);
if arg.is_none() {
self[1].error("'#' is not followed by a macro parameter");
}
let argtl: TokenList = TokenList {
tokens: arg.unwrap().tok.to_vec(),
idx: 0,
};
v.push(argtl.stringize(&self[0]));
self.idx += 2;
continue;
}
// [GNU] If __VA_ARG__ is empty, `,##__VA_ARGS__` is expanded
// to the empty token list. Otherwise, it's expanded to `,` and
// __VA_ARGS.
if tok.equal(",") && self[1].equal("##") {
let arg = find_arg(args, &self[2]);
if arg.is_some() {
if let Some(arg) = arg {
if arg.tok.len() == 0 {
self.idx += 3;
} else {
v.push(tok.clone());
self.idx += 2;
}
continue;
}
}
}
if tok.equal("##") {
if v.len() == 0 {
tok.error("'##' cannot appear at the start of macro expansion");
}
if self.len() == 1 {
tok.error("'##' cannot appear at the end of macro expansion");
}
let arg = find_arg(args, &self[1]);
if let Some(a) = arg {
if a.tok.len() > 0 {
let last = v.last_mut().unwrap();
*last = last.paste(&a.tok[0]);
for t in &a.tok[1..] {
v.push(t.clone());
}
}
self.idx += 2;
continue;
}
let last = v.last_mut().unwrap();
*last = last.paste(&self[1]);
self.idx += 2;
continue;
}
let arg = find_arg(args, &tok);
if arg.is_some() && self.len() > 1 && self[1].equal("##") {
let rhs = &self[2];
let arg = arg.unwrap();
if arg.tok.len() == 0 {
let arg2 = find_arg(args, rhs);
if let Some(arg2) = arg2 {
for t in arg2.tok.iter() {
v.push(t.clone());
}
} else {
v.push(rhs.clone());
}
self.idx += 3;
continue;
}
for t in arg.tok.iter() {
v.push(t.clone());
}
self.idx += 1;
continue;
}
// If __VA_ARG is empty, __VA_OPT(x) is expanded to the
// empty token list. Otherwise, __VA_OPT__(x) is expanded to x.
if tok.equal("__VA_OPT__") && self[1].equal("(") {
self.idx += 2;
let arg = self.read_macro_arg_one(true);
if has_varargs(args) {
for t in arg.tok.iter() {
v.push(t.clone());
}
}
//
self.skip(")"); // idx += 1;
continue;
}
// Handle a macro token. Macro arguments are completely macro-expanded
// before they are substituted into a macro body.
if let Some(a) = arg {
let mut t: TokenList = preprocess2(
TokenList {
tokens: a.tok.to_vec(),
idx: 0,
},
pp,
tokenize,
);
if t.len() > 0 {
t[0].at_bol = tok.at_bol;
t[0].has_space = tok.has_space;
for tok in t.tokens {
v.push(tok);
}
}
self.idx += 1;
continue;
}
// Handle a non-macro token
v.push(tok.clone());
self.idx += 1;
continue;
}
return TokenList { tokens: v, idx: 0 };
}
// If tok is a macro, expand it and return true.
// Otherwise, do nothing and return false.
#[instrument(ret)]
fn expand_macro(mut self, pp: &mut Preprocess, tokenize: &mut Tokenize) -> (TokenList, bool) {
if hideset_contains(&self[0].hideset, self[0].loc.as_str()) {
return (self, false);
}
let m = pp.find_macro(&self[0]);
if m.is_none() {
return (self, false);
}
let m = m.unwrap();
// Built-in dynamic macro application such as __LINE__
if let Some(h) = m.handler {
let t: Token = h(&self[0], pp);
let mut tokens = TokenList {
tokens: vec![t],
idx: 0,
};
self.idx += 1;
tokens.append(&self);
return (tokens, true);
}
// Object-like macro application
if m.is_objlike {
let hs = hideset_union(&self[0].hideset, &vec![Hideset::new(m.name.clone())]);
let start_tok: Token = self[0].clone();
let mut body = m.body.as_ref().unwrap().as_ref().clone();
body.add_hideset(&hs); // TODO: should hideset be added to the macro body?
for t in body.tokens.iter_mut() {
t.origin = Some(Rc::new(start_tok.clone()))
}
self.idx += 1;
body.append(&self);
body[0].at_bol = start_tok.at_bol;
body[0].has_space = start_tok.has_space;
return (body, true);
}
// If a funclike macro token is not followed by an argument list,
// treat it as a normal identifier.
if self.len() == 1 || !self[1].equal("(") {
return (self, false);
}
// Function-like macro application
let macro_token = Rc::new(self[0].clone());
let args = self.read_macro_args(&m.params, m.va_args_name.clone());
let r_paren = self[0].clone();
// Tokens that consist of a func-like macro invocation may have different
// hidsets, and if that's the case, it's not clear what the hideset
// for the new tokens should be. We take the intersection of the
// macro token and the closing parenthesis and use it as a new hideset
// as explained in the Dave Prossor's algorithm.
let mut hs = hideset_intersection(¯o_token.hideset, &r_paren.hideset);
hs = hideset_union(&hs, &vec![Hideset::new(m.name.clone())]);
let mut body = m.body.clone().unwrap().as_ref().clone();
body = body.subst(&args, pp, tokenize);
body.add_hideset(&hs);
for t in body.tokens.iter_mut() {
t.origin = Some(macro_token.clone());
}
body.append(&self);
body[0].at_bol = macro_token.at_bol;
body[0].has_space = macro_token.has_space;
return (body, true);
}
}
impl Preprocess {
#[instrument(ret)]
fn search_include_paths(&mut self, filename: &str) -> Option<String> {
if filename.starts_with("/") {
return Some(String::from(filename));
}
let cached = self.include_cache.get(filename);
if let Some(c) = cached {
return Some(c.clone());
}
// Search a file from the include paths.
for (i, p) in self.include_paths.iter().enumerate() {
let path = format!("{}/{}", p, filename);
if !Path::new(&path).exists() {
continue;
}
self.include_cache
.insert(String::from(filename), path.clone());
self.include_next_idx = i + 1;
return Some(path);
}
return None;
}
#[instrument(ret)]
fn search_include_next(&mut self, filename: &str) -> Option<String> {
for p in &self.include_paths[self.include_next_idx..] {
let path = format!("{}/{}", p, filename);
if Path::new(&path).exists() {
return Some(path);
}
self.include_next_idx += 1;
}
return None;
}
}
impl TokenList {
// Read an # include argument
#[instrument(ret)]
fn read_include_filename(
&mut self,
pp: &mut Preprocess,
tokenize: &mut Tokenize,
) -> (Vec<u8>, bool) {
// Pattern 1: #include "foo.h"
if self[0].kind == TokenKind::Str {
// A double-quoted filename for #include is a special kind of
// token, and we don't want to interpret any escape sequences in it.
// For example, "\f" in "C:\foo" is not a formfeed character but
// just two non-control characters, backslash and f.
// So, we don't want to use token->str.
let s = Vec::from(&self[0].loc.contents[self[0].loc.idx + 1..self[0].loc.end_idx - 1]);
self.idx += 1;
self.skip_line();
return (s, true);
}
// Pattern 2: #include <foo.h>
if self[0].equal("<") {
// Reconstruct a filename from a sequence of tokens between
// "<" and ">";
let mut i = 0;
while i <= self.len() && !self[i].equal(">") {
if i == self.len() || self[i].at_bol {
self[i].error("expected '>'")
}
i += 1;
}
let end = self[i].clone();
self.idx += 1;
let s = self.join(Some(&end));
self.idx += i;
self.skip_line();
return (s.into_bytes(), false);
}
// Pattern 3: #include FOO
// In this case FOO must be macro-expanded to either
// a single string token or a sequence of "<" ... ">".
if self[0].kind == TokenKind::Ident {
let cl = self.copy_line();
let mut tok2 = preprocess2(cl, pp, tokenize);
return tok2.read_include_filename(pp, tokenize);
}
self[0].error("expected a filename");
panic!("unreachable");
}
#[instrument(ret)]
fn detect_include_guard(&self) -> Option<Vec<u8>> {
let mut t = self.clone(); // TODO: expensive
// Detect the first two lines.
if !t[0].is_hash() || !t[1].equal("ifdef") {
return None;
}
t.idx += 2;
if t[0].kind != TokenKind::Ident {
return None;
}
let m = String::from(t[0].loc.as_str());
t.idx += 1;
if !t[0].is_hash() || !t[1].equal("define") || !t[2].equal(&m) {
return None;
}
while t.len() > 0 {
if !t[0].is_hash() {
t.idx += 1;
continue;
}
if t[1].equal("endif") && t.len() == 1 {
return Some(m.into_bytes());
}
if t[0].equal("if") || t[0].equal("ifdef") || t[0].equal("ifndef") {
t.idx += 1;
t.skip_cond_incl();
} else {
t.idx += 1;
}
}
return None;
}
#[instrument(ret)]
fn include_file(
self,
path: &str,
filename_tok: &Token,
pp: &mut Preprocess,
t: &mut Tokenize,
) -> TokenList {
// Check for "#pragma once"
if pp.pragma_once.contains(path) {
return self;
}
// If we read the same file before, and if the file was guarded
// by the usual #ifndef ... #endif pattern, we may be able to
// skip the file without opening it.
let guard_name = pp.include_guards.get(path);
if guard_name.is_some() && pp.macros.get(guard_name.unwrap()).is_some() {
return self;
}
let tok2 = t.tokenize_file(path);
if tok2.is_none() {
// TODO: errorno?
filename_tok.error(format!("{}: cannot open file", path).as_str())
}
let guard_name = self.detect_include_guard();
if let Some(guard_name) = guard_name {
pp.include_guards
.insert(String::from(path), String::from_utf8(guard_name).unwrap());
}
let mut tok2 = tok2.unwrap();
tok2.append(&self);
return tok2;
}
#[instrument(ret)]
fn read_line_marker(&mut self, pp: &mut Preprocess, tokenize: &mut Tokenize) {
let start_idx = self.idx;
let cl: TokenList = self.copy_line();
let mut tok = pp.preprocess(cl, tokenize);
if let TokenFields::Int(ty, val) = &tok[0].other {
assert_eq!(tok[0].kind, TokenKind::Num);
assert!(matches!(ty.kind, TypeKind::Int));
let mut f = self.tokens[start_idx].file.borrow_mut();
f.line_delta = *val as i32 - self.tokens[start_idx].line_no;
tok.idx += 1;
if tok.len() == 0 {
return;
}
if let TokenFields::Str(_, s) = &tok[0].other {
assert_eq!(tok[0].kind, TokenKind::Str);
f.display_name = String::from_utf8(Vec::from(s.as_ref())).unwrap();
} else {
tok[0].error("filename expected");
}
} else {
tok[0].error("invalid line marker");
}
}
}
#[instrument(ret)]
fn preprocess2(mut t: TokenList, pp: &mut Preprocess, tokenize: &mut Tokenize) -> TokenList {
let mut ret = Vec::new();
while t.len() > 0 {
// If it is a macro, expand it.
let ok: bool;
(t, ok) = t.expand_macro(pp, tokenize);
if ok {
continue;
}
// Pass through if it is not a "#".
if !t[0].is_hash() {
let line_delta = t[0].file.borrow().line_delta;
t[0].line_delta = line_delta;
let display_name = t[0].file.borrow().display_name.clone();
t[0].filename = Some(display_name);
ret.push(t[0].clone()); // TODO: avoid token clones?
t.idx += 1;
continue;
}
let start: Token = t[0].clone();
let start_next_next = t[2].clone();
t.idx += 1;
if t[0].equal("include") {
t.idx += 1;
let (filename, is_dquote) = t.read_include_filename(pp, tokenize);
let filename = String::from_utf8(filename).unwrap();
if !filename.starts_with("/") && is_dquote {
let mut dirname = PathBuf::from(&start.file.borrow().name);
dirname.pop();
let path = format!("{}/{}", dirname.to_str().unwrap(), filename);
if Path::new(&path).exists() {
t = t.include_file(&path, &start_next_next, pp, tokenize);
continue;
}
}
let path = pp.search_include_paths(&filename);
t = t.include_file(&path.unwrap_or(filename), &start_next_next, pp, tokenize);
continue;
}
if t[0].equal("include_next") {
t.idx += 1;
let (filename, _) = t.read_include_filename(pp, tokenize);
let filename = String::from_utf8(filename).unwrap();
let path = pp.search_include_next(&filename);
t = t.include_file(&path.unwrap_or(filename), &start_next_next, pp, tokenize);
continue;
}
if t[0].equal("define") {
t.idx += 1;
t.read_macro_definition(pp);
continue;
}
if t[0].equal("undef") {
t.idx += 1;
if t[0].kind != TokenKind::Ident {
t[0].error("macro name must be an identifier");
}
pp.undef_macro(t[0].loc.as_str());
t.idx += 1;
t.skip_line();
continue;
}
if t[0].equal("if") {
let val = t.eval_const_expr(pp, tokenize);
pp.push_cond_incl(Rc::new(start), val != 0); // why start?
if val == 0 {
t.skip_cond_incl();
}
continue;
}
if t[0].equal("ifdef") {
let defined = pp.find_macro(&t[1]).is_some();
pp.push_cond_incl(Rc::new(t[0].clone()), defined);
t.idx += 2;
t.skip_line();
if !defined {
t.skip_cond_incl();
}
continue;
}
if t[0].equal("ifndef") {
let defined = pp.find_macro(&t[1]).is_some();
pp.push_cond_incl(Rc::new(t[0].clone()), !defined);
t.idx += 2;
t.skip_line();
if defined {
t.skip_cond_incl();
}
continue;
}
if t[0].equal("elif") {
if pp.cond_incl.len() == 0
|| pp.cond_incl.last().unwrap().borrow().ctx == CondInclCtx::InElse
{
start.error("stray #elif");
}
{
let mut ci = pp.cond_incl.last().unwrap().borrow_mut();
ci.ctx = CondInclCtx::InElif;
}
let included = pp.cond_incl.last().unwrap().borrow().included;
if !included && t.eval_const_expr(pp, tokenize) > 0 {
let mut ci = pp.cond_incl.last().unwrap().borrow_mut();
ci.included = true;
} else {
t.skip_cond_incl();
}
continue;
}
if t[0].equal("else") {
if pp.cond_incl.is_empty()
|| pp.cond_incl.last().unwrap().borrow().ctx == CondInclCtx::InElse
{
start.error("stray #else");
}
let mut ci = pp.cond_incl.last().unwrap().borrow_mut();
ci.ctx = CondInclCtx::InElse;
t.idx += 1;
t.skip_line();
if !ci.included {
t.skip_cond_incl();
}
continue;
}
if t[0].equal("endif") {
if pp.cond_incl.len() == 0 {
start.error("stray #endif");
}
pp.cond_incl.pop();
t.idx += 1;
t.skip_line();
continue;
}
if t[0].equal("line") {
t.idx += 1;
t.read_line_marker(pp, tokenize);
continue;
}
if t[0].kind == TokenKind::PPNum {
t.read_line_marker(pp, tokenize);
continue;
}
if t[0].equal("pragma") && t[1].equal("once") {
pp.pragma_once.insert(t[0].file.borrow().name.clone());
t.idx += 2;
t.skip_line();
continue;
}
if t[0].equal("pragma") {
loop {
t.idx += 1;
if t[0].at_bol {
break;
}
}
continue;
}
if t[0].equal("error") {
t[0].error("error");
}
// CUSTOM
if t[0].equal("warning") {
eprintln!("CUSTOM: skipping warning");
t.skip_line();
continue;
}
// `#`-only line is legal. It's called a null directive.
if t[0].at_bol {
continue;
}
t[0].error("invalid preprocessor directive");
}
return TokenList {
tokens: ret,
idx: 0,
};
}
impl Preprocess {
#[instrument(ret)]
fn define_macro(&mut self, name: &str, buf: &str) {
let f = File::new("<built-in>".into(), 1, Vec::from(buf.as_bytes()));
let mut t = Tokenize::new();
let tokens = t.tokenize(Rc::new(RefCell::new(f)));
self.add_macro(Macro {
name: String::from(name),
is_objlike: true,
params: Vec::new(),
va_args_name: None,
body: Some(Rc::new(tokens)),
handler: None,
});
}
#[instrument(ret)]
fn undef_macro(&mut self, name: &str) {
self.macros.remove(name);
}
#[instrument(ret)]
fn add_builtin(&mut self, name: &str, f: MacroHandlerFn) {
let m = Macro {
name: String::from(name),
is_objlike: true,
params: Vec::new(),
va_args_name: None,
body: None,
handler: Some(f),
};
self.add_macro(m);
}
}
impl Token {
#[instrument(ret)]
fn file_macro(tmpl: &Token, _: &mut Preprocess) -> Token {
let mut tmpl = Rc::new(tmpl.clone());
while tmpl.origin.is_some() {
tmpl = tmpl.origin.clone().unwrap();
}
return Token::new_str_token(&tmpl.file.borrow().display_name, tmpl.as_ref());
}
#[instrument(ret)]
fn line_macro(tmpl: &Token, _: &mut Preprocess) -> Token {
let mut tmpl = Rc::new(tmpl.clone());
while let Some(t) = tmpl.origin.clone() {
tmpl = t
}
let i = tmpl.line_no + tmpl.file.borrow().line_delta;
return Token::new_num_token(i, tmpl.as_ref());
}
// __COUNTER__ is expanded to serial values starting from 0
#[instrument(ret)]
fn counter_macro(tmpl: &Token, pp: &mut Preprocess) -> Token {
let t = Token::new_num_token(pp.counter_macro_idx, tmpl);
pp.counter_macro_idx += 1;
return t;
}
// __TIMESTAMP__ is expanded to a string describing the last
// modification time of the current file. E.g.
// "Fri Jul 24 01:32:50 2020"
#[instrument(ret)]
fn timestamp_macro(tmpl: &Token, _: &mut Preprocess) -> Token {
let r = fs::metadata(&tmpl.file.borrow().name);
match r {
Ok(m) => {
let dt = DateTime::from_timestamp(m.ctime(), 0).unwrap();
let formatted = format!("{}", dt.format("%a %m %d %H:%M:%S %Y"));
return Token::new_str_token(&formatted, tmpl);
}
Err(_) => {
return Token::new_str_token("??? ??? ?? ??:??:?? ????", tmpl);
}
}
}
#[instrument(ret)]
fn base_file_macro(tmpl: &Token, pp: &mut Preprocess) -> Token {
return Token::new_str_token(&pp.base_file, tmpl);
}
}
// __DATE__ is expanded to the current date, e.g. "May 17 2020"
#[instrument(ret)]
fn format_date(dt: DateTime<Local>) -> String {
return format!("{}", dt.format("%a %m %Y"));
}
// __TIME__ is expanded to the current time, e.g. "13:34:03"
#[instrument(ret)]
fn format_time(dt: DateTime<Local>) -> String {
return format!("{}", dt.format("%H:%M:%S"));
}
impl Preprocess {
pub fn init_macros(&mut self) {
// Define predefined macros
self.define_macro("_LP64", "1");
self.define_macro("__C99_MACRO_WITH_VA_ARGS", "1");
self.define_macro("__ELF__", "1");
self.define_macro("__LP64__", "1");
self.define_macro("__SIZEOF_DOUBLE__", "8");
self.define_macro("__SIZEOF_FLOAT__", "4");
self.define_macro("__SIZEOF_INT__", "4");
self.define_macro("__SIZEOF_LONG_DOUBLE__", "8");
self.define_macro("__SIZEOF_LONG_LONG__", "8");
self.define_macro("__SIZEOF_LONG__", "8");
self.define_macro("__SIZEOF_POINTER__", "8");
self.define_macro("__SIZEOF_PTRDIFF_T__", "8");
self.define_macro("__SIZEOF_SHORT__", "2");
self.define_macro("__SIZEOF_SIZE_T__", "8");
self.define_macro("__SIZE_TYPE__", "unsigned long");
self.define_macro("__STDC_HOSTED__", "1");
self.define_macro("__STDC_NO_COMPLEX__", "1");
self.define_macro("__STDC_UTF_16__", "1");
self.define_macro("__STDC_UTF_32__", "1");
self.define_macro("__STDC_VERSION__", "201112L");
self.define_macro("__STDC__", "1");
self.define_macro("__USER_LABEL_PREFIX__", "");
self.define_macro("__alignof__", "_Alignof");
self.define_macro("__amd64", "1");
self.define_macro("__amd64__", "1");
self.define_macro("__chibicc__", "1");
self.define_macro("__const__", "const");
self.define_macro("__gnu_linux__", "1");
self.define_macro("__inline__", "inline");
self.define_macro("__linux", "1");
self.define_macro("__linux__", "1");
self.define_macro("__signed__", "signed");
self.define_macro("__typeof__", "typeof");
self.define_macro("__unix", "1");
self.define_macro("__unix__", "1");
self.define_macro("__volatile__", "volatile");
self.define_macro("__x86_64", "1");
self.define_macro("__x86_64__", "1");
self.define_macro("linux", "1");
self.define_macro("unix", "1");
self.add_builtin("__FILE__", Token::file_macro);
self.add_builtin("__LINE__", Token::line_macro);
self.add_builtin("__COUNTER__", Token::counter_macro);
self.add_builtin("__TIMESTAMP__", Token::timestamp_macro);
self.add_builtin("__BASE_FILE__", Token::base_file_macro);
let dt = Local::now();
self.define_macro("__DATE__", &format_date(dt));
self.define_macro("__TIME__", &format_time(dt));
}
}
#[derive(Debug, PartialEq)]
enum StringKind {
None,
UTF8,
UTF16,
UTF32,
Wide,
}
impl Token {
#[instrument(ret)]
fn get_string_kind(&self) -> StringKind {
if self.loc[0] == b'u' && self.loc[1] == b'8' {
return StringKind::UTF8;
}
match self.loc[0] {
b'"' => StringKind::None,
b'u' => StringKind::UTF16,
b'U' => StringKind::UTF32,
b'L' => StringKind::Wide,
_ => panic!("string kind locaation {}", self.loc[0] as char),
}
}
}
impl TokenList {
// Concatenate adjacent string literals into a single string literal
// as per the C spec.
#[instrument(ret)]
fn join_adjacent_string_literals(mut self, tokenize: &Tokenize) -> TokenList {
// First pass: If regular string literals are adjacent to
// wide string literals, regular string literals are converted to a wide
// type before concatenation. In this pass, we do the conversion.
let mut i = 0;
while i < self.len() {
if self[i].kind != TokenKind::Str
|| i + 1 == self.len()
|| self[i + 1].kind != TokenKind::Str
{
i += 1;
continue;
}
let mut kind = self[i].get_string_kind();
let mut basety = if let TokenFields::Str(base_ty, _) = &self[i].other {
base_ty.clone()
} else {
panic!();
};
for j in i + 1..self.len() {
if self[j].kind != TokenKind::Str {
break;
}
let k = self[j].get_string_kind();
if kind == StringKind::None {
kind = k;
basety = if let TokenFields::Str(base_ty, _) = &self[j].other {
base_ty.clone()
} else {
panic!();
};
} else if k != StringKind::None && kind != k {
self[j].error("unsupported non-standard concatenation of string literals");
}
}
if basety.size > 1 {
for j in i..self.len() {
if self[j].kind != TokenKind::Str {
break;
}
if let TokenFields::Str(base_ty, _) = &self[j].other {
if base_ty.size == 1 {
self[j] = tokenize.tokenize_string_literal(
&self[j],
&basety,
self[j].at_bol,
self[j].has_space,
);
}
} else {
panic!();
};
}
}
while self[i].kind == TokenKind::Str {
i += 1;
}
}
// Second pass: concatenate adjacent string literals
let mut ret = TokenList {
tokens: Vec::new(),
idx: 0,
};
let mut i = 0;
while i < self.len() {
if self[i].kind != TokenKind::Str
|| i + 1 == self.len()
|| self[i + 1].kind != TokenKind::Str
{
ret.tokens.push(self[i].clone());
i += 1;
} else {
let mut j = i + 1;
while self[j].kind == TokenKind::Str {
j += 1;
}
let mut len = 1;
let mut buf: Vec<u8> = Vec::new();
for k in i..j {
if let TokenFields::Str(ty, s) = &self[k].other {
buf.extend_from_slice(s.as_ref());
if let TypeKind::Array { len: l, .. } = ty.kind {
len += l - 1;
} else {
panic!();
}
} else {
panic!();
}
}
let mut t = self[i].clone();
if let TokenFields::Str(ty, _) = &self[i].other {
if let TypeKind::Array { base, .. } = &ty.kind {
t.other = TokenFields::Str(array_of(&base, len), buf.into_boxed_slice());
} else {
panic!();
}
} else {
panic!();
}
ret.tokens.push(t);
i = j;
}
}
return ret;
}
}
impl Preprocess {
#[instrument]
pub fn preprocess(&mut self, tokens: TokenList, tokenize: &mut Tokenize) -> TokenList {
let mut tl = preprocess2(tokens, self, tokenize);
if self.cond_incl.len() > 0 {
self.cond_incl
.last()
.unwrap()
.borrow()
.tok
.error("unterminated conditional directive");
}
Tokenize::convert_pp_tokens(&mut tl.tokens);
tl = tl.join_adjacent_string_literals(tokenize);
for t in tl.tokens.iter_mut() {
t.line_no += t.line_delta;
}
return tl;
}
}
//
// parse.c (and add_type from type.c)
//
// This file contains a recursive descent parser for C.
//
// Most functions in this file are named after the symbols they are
// supposed to read from an input token list. For example, stmt() is
// responsible for reading a statement from a token list. The function
// then constructs an AST node representing a statement.
//
// Each function conceptually returns two values, an AST node and
// remaining part of the input tokens. Since C doesn't support
// multiple return values, the remaining tokens are returned to the
// caller via a pointer argument.
//
// Input tokens are represented by a linked list. Unlike many recursive
// descent parsers, we don't have the notion of the "input token stream".
// Most parsing functions don't change the global state of the parser.
// So it is very easy to lookahead arbitrary number of tokens in this
// parser.
// Scope for local variables, global variables, typedefs
// or enum constants
struct VarScope {
var: Option<Rc<RefCell<Obj>>>,
type_def: Option<Type>,
enum_ty: Option<Type>,
enum_val: Option<i32>,
}
// Represent a block scope.
struct Scope {
// C has two block scopes, one is for variables/typedefs and
// the other is for struct/union/enum tags.
vars: HashMap<String, Rc<RefCell<VarScope>>>,
tags: HashMap<String, Type>,
}
impl Scope {
fn new() -> Scope {
return Scope {
vars: HashMap::new(),
tags: HashMap::new(),
};
}
}
// Variable attributes such as typedef or extern
struct VarAttr {
is_typedef: bool,
is_static: bool,
is_extern: bool,
is_inline: bool,
is_tls: bool,
align: i32,
}
// This struct represents a variable initializer. Since initializers
// can be nested (e.g. `int[2][2] = {{1, 2}, {3, 4}}`), this struct
// is a tree data structure.
struct Initializer {
ty: Type,
tok: Option<Token>,
is_flexible: bool,
// If it's not an aggregate type and has an initializer,
// `expr` has an initialization expression.
expr: Option<Node>,
// If it's an initializer for an aggregate type (e.g. array or struct),
// children has initializers for its children.
children: Option<Vec<Initializer>>,
// Only one member can be initialized for a union.
// `mem` is used to clarify which member is initialized.
mem: Option<Member>,
}
// For local variable initializer.
struct InitDesg {
idx: i32,
member: Member,
var: Obj,
}
struct Parse {
// All local variable instances created during parsing are
// accumulated to this list.
locals: Rc<RefCell<LinkedList<Rc<RefCell<Obj>>>>>,
// Likewise, global variables are accumulated to this list.
globals: Rc<RefCell<LinkedList<Rc<RefCell<Obj>>>>>,
scope: LinkedList<Scope>,
// Point to the function object the parser is current parsing
current_fn: Rc<RefCell<Obj>>,
// List of all goto statements and labels in the current function.
gotos: LinkedList<Rc<RefCell<Node>>>,
labels: LinkedList<Rc<RefCell<Node>>>,
// Current "goto" and "continue" jump targets.
brk_label: Option<String>,
cont_label: Option<String>,
// Points to a node representing a switch if we are parsing
// a switch statement. Otherwise, NULL.
current_switch: Option<Node>,
builtin_alloca: Rc<RefCell<Obj>>,
}
// Round up `n` to the nearest multiple of `align`. For instance,
// align_to(5, 8) returns 8 and align_to(11, 8) returns 16.
fn align_to(n: i32, align: i32) -> i32 {
return ((n + align - 1) / align) * align;
}
fn align_down(n: i32, align: i32) -> i32 {
return align_to(n - align + 1, align);
}
impl Parse {
fn enter_scope(&mut self) {
let s = Scope::new();
self.scope.push_front(s);
}
fn leave_scope(&mut self) {
self.scope.pop_front();
}
// Find a variable by name.
fn find_var(&self, tok: &Token) -> Option<Rc<RefCell<VarScope>>> {
for sc in self.scope.iter() {
if let Some(sc2) = sc.vars.get(tok.loc.as_str()) {
return Some(sc2.clone());
}
}
return None;
}
fn find_tag(&self, tok: &Token) -> Option<&Type> {
for sc in self.scope.iter() {
if let Some(ty) = sc.tags.get(tok.loc.as_str()) {
return Some(ty);
}
}
return None;
}
}
impl Node {
fn new_binary(kind: NodeBinaryKind, lhs: Node, rhs: Node, tok: Token) -> Node {
return Node {
ty: None,
tok,
kind: NodeFields::Binary {
kind,
lhs: Rc::new(RefCell::new(lhs)),
rhs: Rc::new(RefCell::new(rhs)),
},
};
}
fn new_unary(kind: NodeUnaryKind, expr: Node, tok: Token) -> Node {
return Node {
ty: None,
tok,
kind: NodeFields::Unary {
kind,
expr: Rc::new(RefCell::new(expr)),
},
};
}
fn new_num(val: i64, tok: Token) -> Node {
return Node {
ty: Some(TY_INT),
tok: tok,
kind: NodeFields::Int(val),
};
}
fn new_long(val: i64, tok: Token) -> Node {
return Node {
ty: Some(TY_LONG),
tok,
kind: NodeFields::Int(val),
};
}
fn new_ulong(val: i64, tok: Token) -> Node {
return Node {
ty: None,
tok: tok,
kind: NodeFields::Int(val),
};
}
fn new_var_node(var: Rc<RefCell<Obj>>, tok: Token) -> Node {
return Node {
ty: None,
tok,
kind: NodeFields::Var(var),
};
}
fn new_vla_ptr(var: Rc<RefCell<Obj>>, tok: Token) -> Node {
return Node {
ty: None,
tok: tok,
kind: NodeFields::VlaPtr(var),
};
}
fn new_cast(expr: Rc<RefCell<Node>>, ty: Type) -> Node {
expr.borrow_mut().add_type();
let tok = expr.borrow().tok.clone();
return Node {
ty: Some(copy_type(&ty)),
tok: tok,
kind: NodeFields::Cast(expr),
};
}
#[instrument(ret)]
fn add_type(&mut self) {
// For many binary operators, we implicitly promote operands so
// that both operands have the same type. Any integral type smaller than
// int is always promoted to int. If the type of one operand is larger
// than the other's (e.g. "long" vs. "int"), the smaller operand will
// be promoted to match with the other.
//
// This operation is called the "usual arithmetic conversion".
#[instrument(ret)]
fn usual_arith_conv(lhs: &Rc<RefCell<Node>>, rhs: &Rc<RefCell<Node>>) {
let ty = get_common_type(
lhs.borrow().ty.as_ref().unwrap(),
rhs.borrow().ty.as_ref().unwrap(),
);
let new_lhs = Node::new_cast(lhs.clone(), ty.clone());
let mut l = lhs.borrow_mut();
*l = new_lhs;
let new_rhs = Node::new_cast(rhs.clone(), ty);
let mut r = rhs.borrow_mut();
*r = new_rhs;
}
if self.ty.is_some() {
return;
}
match &self.kind {
NodeFields::NullExpr => {
// TODO:
}
NodeFields::Binary { kind, lhs, rhs } => {
// NodeBinaryKind::Neg => todo!(),
match kind {
NodeBinaryKind::Add
| NodeBinaryKind::Sub
| NodeBinaryKind::Mul
| NodeBinaryKind::Div
| NodeBinaryKind::Mod
| NodeBinaryKind::BitAnd
| NodeBinaryKind::BitOr
| NodeBinaryKind::BitXor => {
usual_arith_conv(lhs, rhs);
self.ty = lhs.borrow().ty.clone();
}
NodeBinaryKind::Neg => {
let ty = get_common_type(&TY_INT, lhs.borrow().ty.as_ref().unwrap());
*lhs.borrow_mut() = Node::new_cast(lhs.clone(), ty.clone());
self.ty = Some(ty);
}
NodeBinaryKind::Shl | NodeBinaryKind::Shr => {
self.ty = lhs.borrow().ty.clone();
}
NodeBinaryKind::Eq
| NodeBinaryKind::Ne
| NodeBinaryKind::Lt
| NodeBinaryKind::Le => {
usual_arith_conv(lhs, rhs);
self.ty = Some(TY_INT);
}
NodeBinaryKind::Assign => {
if let TypeKind::Array { .. } = &lhs.borrow().ty.as_ref().unwrap().kind {
let tok = &lhs.borrow().tok;
tok.error("not an lvalue");
}
if let TypeKind::Struct { .. } = lhs.borrow().ty.as_ref().unwrap().kind {
} else {
*rhs.borrow_mut() =
Node::new_cast(rhs.clone(), lhs.borrow().ty.clone().unwrap());
}
self.ty = lhs.borrow().ty.clone();
}
NodeBinaryKind::Comma => {
self.ty = rhs.borrow().ty.clone();
}
}
}
NodeFields::Member { member } => {
self.ty = Some(member.ty.clone());
}
NodeFields::Unary { kind, expr } => {
expr.borrow_mut().add_type();
match kind {
NodeUnaryKind::Addr => {
let e = expr.borrow();
let ty = e.ty.as_ref().unwrap();
if let TypeKind::Array { base, .. } = &ty.kind {
self.ty = Some(pointer_to(base));
} else {
// TODO: what about VLA_ARRAY?
self.ty = Some(pointer_to(ty));
}
}
NodeUnaryKind::Deref => {
let e = expr.borrow();
let ty = e.ty.as_ref().unwrap();
match &ty.kind {
TypeKind::Ptr { base, .. }
| TypeKind::Array { base, .. }
| TypeKind::Vla { base, .. } => {
if matches!(base.kind, TypeKind::Void) {
self.tok.error("dereferencing a void pointer");
}
self.ty = Some(base.as_ref().clone());
}
TypeKind::Void
| TypeKind::Bool
| TypeKind::Char
| TypeKind::Short
| TypeKind::Int
| TypeKind::Long
| TypeKind::Float
| TypeKind::Double
| TypeKind::LDouble
| TypeKind::Enum
| TypeKind::Func { .. }
| TypeKind::Struct { .. }
| TypeKind::Union { .. } => todo!(),
}
}
NodeUnaryKind::BitNot => {
self.ty = expr.borrow().ty.clone();
}
NodeUnaryKind::Not | NodeUnaryKind::LogAnd | NodeUnaryKind::LogOr => {
self.ty = Some(TY_INT);
}
}
}
NodeFields::Return => {}
NodeFields::Cond {
kind,
cond,
then,
els,
init,
inc,
brk_label,
cont_label,
case_next,
default_case,
begin,
end,
} => {
cond.borrow_mut().add_type();
then.borrow_mut().add_type();
els.borrow_mut().add_type();
init.borrow_mut().add_type();
inc.borrow_mut().add_type();
// TODO: ND_COND
match kind {
NodeCondKind::Cond => {
if matches!(then.borrow().ty.as_ref().unwrap().kind, TypeKind::Void)
|| matches!(els.borrow().ty.as_ref().unwrap().kind, TypeKind::Void)
{
self.ty = Some(TY_VOID);
} else {
usual_arith_conv(then, els);
self.ty = then.borrow().ty.clone();
}
}
NodeCondKind::If
| NodeCondKind::For
| NodeCondKind::Do
| NodeCondKind::Switch
| NodeCondKind::Case
| NodeCondKind::Goto
| NodeCondKind::GotoExpr
| NodeCondKind::Label => {}
NodeCondKind::LabelVal => {
self.ty = Some(pointer_to(&TY_VOID));
}
}
}
NodeFields::Block => {}
NodeFields::Label { .. } => {}
NodeFields::FnCall {
func_ty,
args,
pass_by_stack,
ret_buffer,
} => {
for n in args.iter() {
n.borrow_mut().add_type();
}
self.ty = Some(func_ty.clone());
}
NodeFields::Stmt { kind, body, expr } => {
for n in body.iter() {
n.borrow_mut().add_type();
}
match kind {
NodeStmtKind::ExprStmt => {}
NodeStmtKind::StmtExpr => {
match body.last() {
Some(stmt) => {
let s = stmt.borrow();
match &s.kind {
NodeFields::Stmt { kind, body, expr } => match kind {
NodeStmtKind::ExprStmt => {
assert!(expr.is_some());
self.ty = expr.as_ref().unwrap().borrow().ty.clone();
return;
}
NodeStmtKind::StmtExpr => {}
},
_ => panic!("stmt is not a NodeFields::Stmt: {:?}", s.kind),
}
}
None => {}
}
self.tok
.error("statement expression returning void is not supported");
}
}
}
NodeFields::Var(var) | NodeFields::VlaPtr(var) => {
self.ty = Some(var.borrow().ty.clone());
}
NodeFields::Cast(_) => {
assert!(self.ty.is_some());
}
NodeFields::Int(_) => {
assert!(self.ty.is_some());
}
NodeFields::Float(_) => {}
NodeFields::MemZero => {}
NodeFields::Asm(_) => todo!(),
NodeFields::Cas { addr, old, new } => {
addr.borrow_mut().add_type();
old.borrow_mut().add_type();
new.borrow_mut().add_type();
self.ty = Some(TY_BOOL);
if matches!(
addr.borrow().ty.as_ref().unwrap().kind,
TypeKind::Ptr { .. }
) {
addr.borrow().tok.error("pointer expected");
}
if matches!(old.borrow().ty.as_ref().unwrap().kind, TypeKind::Ptr { .. }) {
old.borrow().tok.error("pointer expected");
}
}
NodeFields::Exch { lhs, rhs } => match &lhs.borrow().ty.as_ref().unwrap().kind {
TypeKind::Ptr { base } => {
self.ty = Some(base.as_ref().clone());
}
_ => lhs.borrow().tok.error("pointer expected"),
},
}
assert!(self.ty.is_some());
}
}
impl Parse {
fn push_scope(&mut self, name: String) -> Rc<RefCell<VarScope>> {
let sc = Rc::new(RefCell::new(VarScope {
var: None,
type_def: None,
enum_ty: None,
enum_val: None,
}));
self.scope
.front_mut()
.unwrap()
.vars
.insert(name, sc.clone());
return sc;
}
}
impl Initializer {
fn new(ty: Type, is_flexible: bool) -> Initializer {
let mut init = Initializer {
ty,
tok: None,
is_flexible: false,
expr: None,
children: None,
mem: None,
};
match &init.ty.kind {
TypeKind::Void
| TypeKind::Bool
| TypeKind::Char
| TypeKind::Short
| TypeKind::Int
| TypeKind::Long
| TypeKind::Float
| TypeKind::Double
| TypeKind::LDouble
| TypeKind::Enum
| TypeKind::Ptr { .. }
| TypeKind::Func { .. } => {}
TypeKind::Array { base, len } => {
// TODO: when is ty.size < 0
if is_flexible && init.ty.size < 0 {
init.is_flexible = true;
return init;
}
let mut children = Vec::with_capacity(*len);
for b in 0..*len {
children.push(Initializer::new(base.as_ref().clone(), false));
}
init.children = Some(children);
return init;
}
TypeKind::Vla { .. } => {} // Do nothing?
TypeKind::Struct {
members,
is_flexible: tyif,
..
}
| TypeKind::Union {
members,
is_flexible: tyif,
..
} => {
let mut children = Vec::with_capacity(members.len());
for (i, mem) in members.iter().enumerate() {
if is_flexible && *tyif && i + 1 == members.len() {
let child = Initializer {
ty: mem.ty.clone(),
tok: None,
is_flexible: true,
expr: None,
children: None,
mem: None,
};
children.push(child);
} else {
children.push(Initializer::new(mem.ty.clone(), false));
}
}
init.children = Some(children);
return init;
}
}
return init;
}
}
impl Parse {
fn new_var(
&mut self,
name: String,
ty: Type,
next: Rc<RefCell<LinkedList<Rc<RefCell<Obj>>>>>,
) -> Rc<RefCell<Obj>> {
let align = ty.align;
let var = Obj {
next,
name: name.clone(),
ty,
tok: None,
is_local: false,
align,
other: ObjFields::None,
};
let sc = self.push_scope(name);
let varrc = Rc::new(RefCell::new(var));
sc.borrow_mut().var = Some(varrc.clone());
return varrc;
}
fn new_lvar(&mut self, name: String, ty: Type) -> Rc<RefCell<Obj>> {
let var = self.new_var(name, ty, self.locals.clone());
var.borrow_mut().is_local = true;
let mut l = self.locals.borrow_mut();
l.push_front(var.clone());
return var;
}
fn new_gvar(&mut self, name: String, ty: Type) -> Rc<RefCell<Obj>> {
let var = self.new_var(name, ty, self.globals.clone());
var.borrow_mut().other = ObjFields::GVar {
is_function: false,
is_definition: true,
is_static: true,
is_tentative: false,
is_tls: false,
init_data: Vec::new(),
rel: None,
};
let mut g = self.globals.borrow_mut();
g.push_front(var.clone());
return var;
}
}
// TODO: new_unique_name.