Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Type Independent String Functions #891

Closed
wants to merge 45 commits into from
Closed
Show file tree
Hide file tree
Changes from 4 commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
40f016f
String Utilities
BraedonWooding Apr 4, 2018
54b5c00
string utilities
BraedonWooding Apr 4, 2018
ae1b7e4
remove weird files
BraedonWooding Apr 4, 2018
3f20233
Changed higher to upper
BraedonWooding Apr 4, 2018
39d92e8
Added unicode encoding
BraedonWooding Apr 4, 2018
ec3194a
Fixes an overflow bug (wow Zig is actually good at protecting against…
BraedonWooding Apr 5, 2018
714b89c
Basics of locale
BraedonWooding Apr 5, 2018
4793904
Fixed up array
BraedonWooding Apr 5, 2018
23d424c
allowing slice size to be larger then 4
BraedonWooding Apr 5, 2018
1f47341
Fixed bugs in unicode mostly type issues, and then generalised the it…
BraedonWooding Apr 5, 2018
48028ce
Finished ascii example
BraedonWooding Apr 5, 2018
ebca31a
Updated init to be check the characters
BraedonWooding Apr 5, 2018
66a0559
Removed change in git ignore
BraedonWooding Apr 5, 2018
932eb6f
Mem.Split doesn't care about your type now
BraedonWooding Apr 6, 2018
1648192
Locale can figure out if a codepoint is a number, letter, or whitespa…
BraedonWooding Apr 7, 2018
1a75cdf
Unicode encoding now throws
BraedonWooding Apr 7, 2018
15ccce7
Added Utf8 Locale
BraedonWooding Apr 8, 2018
6868c71
Implemented join
BraedonWooding Apr 9, 2018
a307291
Remove Locale, fix split bug presuming length 1
BraedonWooding Apr 12, 2018
9e4206c
Fixed bugs in join
BraedonWooding Apr 12, 2018
2ace38b
Merge branch 'master' into StringUtils
BraedonWooding Apr 12, 2018
f570f18
Fixed spawn process error
BraedonWooding Apr 12, 2018
dfdfde2
Merge branch 'StringUtils' of github.com:BraedonWooding/zig into Stri…
BraedonWooding Apr 12, 2018
0999ea8
Fixed more compile time errors
BraedonWooding Apr 12, 2018
fb24ebc
Should be the last of fixes for path
BraedonWooding Apr 12, 2018
8b9db75
Generic trim
BraedonWooding Apr 12, 2018
2f027b7
Merge branch 'master' of github.com:zig-lang/zig into StringUtils
BraedonWooding Apr 13, 2018
3639e7a
Cleaned up and added docs
BraedonWooding Apr 13, 2018
dc3cfda
Name changes stuffed compile
BraedonWooding Apr 13, 2018
dd349fb
Tests both windows and posix joins
BraedonWooding Apr 13, 2018
5b8012b
Fixed up term color error
BraedonWooding Apr 13, 2018
f3bf4f6
This 'should' fix the issues with appveyor
BraedonWooding Apr 13, 2018
b0ea58a
removed const qualifier
BraedonWooding Apr 13, 2018
6fcbf6e
Merge branch 'master' into StringUtils
BraedonWooding Apr 14, 2018
2d2477a
Fixup merge
BraedonWooding Apr 14, 2018
b2cbcb5
Made requested changes
BraedonWooding Apr 14, 2018
ded5e5d
Fixed commit
BraedonWooding Apr 14, 2018
8453786
Removed 'std.string' from main.zig
BraedonWooding Apr 14, 2018
c6ff5e4
Fixed up wrong call
BraedonWooding Apr 14, 2018
a2269f9
Restructured it nicer, and removed the workaround with a proper solut…
BraedonWooding Apr 14, 2018
6e61be7
Things use utf8 for splitting rather than ascii call
BraedonWooding Apr 14, 2018
6294dd4
Added error to possible errors in spawnError
BraedonWooding Apr 14, 2018
4dc263e
typo
BraedonWooding Apr 14, 2018
17dc853
Fixed up a bug I've been meaning to fix for a while
BraedonWooding Apr 14, 2018
b8d1995
Made requested changes
BraedonWooding Apr 20, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
zig-cache/
build/
build-*/
.vscode/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #888

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough I'll remove it :), I shouldn't be calling 'git add .' anyways.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should be able to use git add .. Everything to ignore falls into 2 categories: system-wide ignore, or project-specific ignore. If both are correct then git add . works just fine.

1 change: 1 addition & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -441,6 +441,7 @@ set(ZIG_STD_FILES
"index.zig"
"io.zig"
"linked_list.zig"
"string_utils.zig"
"macho.zig"
"math/acos.zig"
"math/acosh.zig"
Expand Down
5 changes: 3 additions & 2 deletions doc/docgen.zig
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ const os = std.os;
const warn = std.debug.warn;
const mem = std.mem;
const assert = std.debug.assert;
const string = std.stringUtils;

const max_doc_file_size = 10 * 1024 * 1024;

Expand Down Expand Up @@ -309,7 +310,7 @@ const Node = union(enum) {
const Toc = struct {
nodes: []Node,
toc: []u8,
urls: std.HashMap([]const u8, Token, mem.hash_slice_u8, mem.eql_slice_u8),
urls: std.HashMap([]const u8, Token, string.hash_str, string.str_eql),
};

const Action = enum {
Expand All @@ -318,7 +319,7 @@ const Action = enum {
};

fn genToc(allocator: &mem.Allocator, tokenizer: &Tokenizer) !Toc {
var urls = std.HashMap([]const u8, Token, mem.hash_slice_u8, mem.eql_slice_u8).init(allocator);
var urls = std.HashMap([]const u8, Token, string.hash_str, string.str_eql).init(allocator);
errdefer urls.deinit();

var header_stack_size: usize = 0;
Expand Down
3 changes: 2 additions & 1 deletion std/buf_map.zig
Original file line number Diff line number Diff line change
@@ -1,13 +1,14 @@
const HashMap = @import("hash_map.zig").HashMap;
const mem = @import("mem.zig");
const Allocator = mem.Allocator;
const string = @import("string_utils.zig");

/// BufMap copies keys and values before they go into the map, and
/// frees them when they get removed.
pub const BufMap = struct {
hash_map: BufMapHashMap,

const BufMapHashMap = HashMap([]const u8, []const u8, mem.hash_slice_u8, mem.eql_slice_u8);
const BufMapHashMap = HashMap([]const u8, []const u8, string.hash_str, string.str_eql);

pub fn init(allocator: &Allocator) BufMap {
var self = BufMap {
Expand Down
3 changes: 2 additions & 1 deletion std/buf_set.zig
Original file line number Diff line number Diff line change
@@ -1,11 +1,12 @@
const HashMap = @import("hash_map.zig").HashMap;
const mem = @import("mem.zig");
const Allocator = mem.Allocator;
const string = @import("string_utils.zig");

pub const BufSet = struct {
hash_map: BufSetHashMap,

const BufSetHashMap = HashMap([]const u8, void, mem.hash_slice_u8, mem.eql_slice_u8);
const BufSetHashMap = HashMap([]const u8, void, string.hash_str, string.str_eql);

pub fn init(a: &Allocator) BufSet {
var self = BufSet {
Expand Down
5 changes: 3 additions & 2 deletions std/build.zig
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ const Term = os.ChildProcess.Term;
const BufSet = std.BufSet;
const BufMap = std.BufMap;
const fmt_lib = std.fmt;
const string = std.stringUtils;

pub const Builder = struct {
uninstall_tls: TopLevelStep,
Expand Down Expand Up @@ -48,8 +49,8 @@ pub const Builder = struct {
cache_root: []const u8,
release_mode: ?builtin.Mode,

const UserInputOptionsMap = HashMap([]const u8, UserInputOption, mem.hash_slice_u8, mem.eql_slice_u8);
const AvailableOptionsMap = HashMap([]const u8, AvailableOption, mem.hash_slice_u8, mem.eql_slice_u8);
const UserInputOptionsMap = HashMap([]const u8, UserInputOption, string.hash_str, string.str_eql);
const AvailableOptionsMap = HashMap([]const u8, AvailableOption, string.hash_str, string.str_eql);

const AvailableOption = struct {
name: []const u8,
Expand Down
2 changes: 2 additions & 0 deletions std/index.zig
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ pub const rand = @import("rand/index.zig");
pub const sort = @import("sort.zig");
pub const unicode = @import("unicode.zig");
pub const zig = @import("zig/index.zig");
pub const stringUtils = @import("string_utils.zig");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The decision to put these things in something called "memory" rather than "string" is intentional.

A "string" would be something like []u32, where each item is a unicode point. Zig std lib does not have a string data type yet - and it looks like this PR does not introduce one. Although we do have some utf8 decoding and encoding code in std/unicode.zig.

String utilities would be a welcome addition to the zig standard library, but they should operate on unicode points, not encoded bytes. Utilities that operate on encoded bytes should be clear about that, using parameter names such as "bytes" and explaining the difference between encoded strings and actual strings in the docs.


test "std" {
// run tests from these
Expand Down Expand Up @@ -62,4 +63,5 @@ test "std" {
_ = @import("sort.zig");
_ = @import("unicode.zig");
_ = @import("zig/index.zig");
_ = @import("string_utils.zig");
}
14 changes: 0 additions & 14 deletions std/mem.zig
Original file line number Diff line number Diff line change
Expand Up @@ -316,20 +316,6 @@ pub fn writeInt(buf: []u8, value: var, endian: builtin.Endian) void {
assert(bits == 0);
}


pub fn hash_slice_u8(k: []const u8) u32 {
// FNV 32-bit hash
var h: u32 = 2166136261;
for (k) |b| {
h = (h ^ b) *% 16777619;
}
return h;
}

pub fn eql_slice_u8(a: []const u8, b: []const u8) bool {
return eql(u8, a, b);
}

/// Returns an iterator that iterates over the slices of `buffer` that are not
/// any of the bytes in `split_bytes`.
/// split(" abc def ghi ", " ")
Expand Down
230 changes: 230 additions & 0 deletions std/string_utils.zig
Original file line number Diff line number Diff line change
@@ -0,0 +1,230 @@
const std = @import("index.zig");
const unicode = std.unicode;
const mem = std.mem;
const math = std.math;
const Set = std.BufSet;
const assert = std.debug.assert;
const allocator = std.debug.debug_allocator;

// Handles a series of string utilities that are focused around handling and manipulating strings

// Hash code for a string
pub fn hash_str(k: []const u8) u32 {
// FNV 32-bit hash
var h: u32 = 2166136261;
for (k) |b| {
h = (h ^ b) *% 16777619;
}
return h;
}

// Just directs you to the standard string handler
pub fn hash_unicode(k: unicode.Utf8View) u32 {
return hash_str(k.bytes);
}

pub fn find_str(a: []const u8, target: []const u8, start: usize, end: usize, highest: bool) ?usize {
var array = a[start..end];
var i : usize = 0;
var index : ?usize = null;

while (i < array.len) {
// If there is no possible way we could fit the string early return
if (array.len - i < target.len) return index;

if (array[i] == target[0]) {
var equal = true;
var j : usize = 1;

while (j < target.len) {
if (array[i + j] != target[j]) {
equal = false;

// Reduce amount of comparisons
i += j - 1;
break;
}
j += 1;
}

if (equal) {
index = i;
if (!highest) {
return index;
} else {
i += j - 1;
}
}
}
i += 1;
}

return index;
}

pub const Side = enum {
LEFT,
RIGHT,
BOTH,
};

pub fn strip_whitespace(a: []const u8, sides: Side)[]const u8 {
// Just a placeholder replace later with proper locale whitespace
return strip(a, " \t\n\r", sides);
}

fn impl_strip_side(a: []const u8, characters: []const u8, start: usize, change: usize, decrement: bool) usize {
var moved = true;
var index = start;

while (moved) {
moved = false;
for (characters) |char| {
if (char == a[index]) {
moved = true;
if (decrement) index -= change else index += change;
break;
}
}
}
return index;
}

// If max is 0 then it'll do forever
pub fn split(a: []const u8, sep: u8, out: &[][]const u8)void {
var actualCount: usize = 0;
var previousIndex: usize = 0;

for (a) |char, i| {
if (char == sep) {
if (i - previousIndex == 0) {
(*out)[actualCount] = "";
} else {
(*out)[actualCount] = a[previousIndex..i];
}

previousIndex = i + 1;
actualCount += 1;

if (actualCount == out.len) break;
}
}

(*out)[actualCount] = a[previousIndex..];
actualCount += 1;
*out = (*out)[0..actualCount];
}

// Note: characters is an array of u8 not a string!
// So passing in "abc" doesn't strip abc it strips a, b, and c
pub fn strip(a: []const u8, characters: []const u8, sides: Side)[]const u8 {
var start: usize = 0;
var end: usize = a.len - 1;
if (sides == Side.LEFT or sides == Side.BOTH) {
// Trim left
start = impl_strip_side(a, characters, start, 1, false);
}

if (sides == Side.RIGHT or sides == Side.BOTH) {
// Trim right
end = impl_strip_side(a, characters, end, 1, true);
}

// +1 to convert to 1-index
return a[start..end + 1];
}

pub fn starts_with(a: []const u8, target: []const u8) bool {
// Because we are 0-indexing it not 1-indexing it
if (a.len < target.len) return false;
var i : usize = 0;

while (i < target.len) {
if (a[i] != target[i]) return false;
i += 1;
}
return true;
}

pub fn ends_with(a: []const u8, target: []const u8) bool {
if (a.len < target.len) return false;
var diff : usize = a.len - target.len;
var i : usize = a.len - 1;

while (i >= target.len) {
if (a[i] != target[i - diff]) return false;
i -= 1;
}
return true;
}

pub fn str_eql(a: []const u8, b: []const u8) bool {
return mem.eql(u8, a, b);
}

pub fn is_num(byte: u8) bool {
return byte >= '0' and byte <= '9';
}

pub fn to_upper(byte: u8) u8 {
return if(is_ascii_lower(byte)) byte - 32 else byte;
}

pub fn to_lower(byte: u8) u8 {
return if(is_ascii_upper(byte)) byte + 32 else byte;
}

pub fn is_ascii_letter(byte: u8) bool {
return is_ascii_lower(byte) or is_ascii_upper(byte);
}

pub fn is_ascii_lower(byte: u8) bool {
return byte >= 'a' and byte <= 'z';
}

pub fn is_ascii_upper(byte: u8) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fairly confident that we should not have asciiUpper and asciiLower in the standard library for these reasons:

  • it's too easy to misuse, when actual strings should be used instead
  • if it really is the odd case where you need ascii upper/ascii lower, writing out the function body is easy enough. it should be enough tedium to get people to consider using the actual string functions.

Do you have an actual use case for ascii upper/lower right now?

Copy link
Sponsor Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the randomizer I use a few ascii functions(They take &const u8 because I use them as fn ptrs to generic functions), because I know the strings I get from the nds headers are ascii. I don't mind not having it in std though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, that's fair. Maybe it's not so bad to have them. We should definitely put a bunch of noticeable warning signs near them and avoid the use of the word "string".

Copy link
Sponsor Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They could live in ascii.zig. Then you sign up for ascii problems by importing that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with that.

Copy link
Sponsor Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that "ascii" is different from []u8. ascii requires the top bit of every byte to be 0, so if you're doing ascii string processing, you might consider validating that top bit.

those ascii functions you linked are correct, because they check for specific ranges of values, but something like ascii_to_upper() should perhaps assert the top bit is 0.

Copy link
Contributor Author

@BraedonWooding BraedonWooding Apr 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One could also just check if it was within 127? That avoids the bit masking operations (which are more expensive and remove clarity)? But yes it probably does have to be checked, though I would rather verify data through the locale viewer (which would check just once) rather than an indeterminate amount of times :).

return byte >= 'A' and byte <= 'Z';
}

test "String_Utils" {
Copy link
Sponsor Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend splitting this into some smaller tests like:

test "string_utils.is_ascii_letter" {
    assert(is_ascii_letter('C'));
    assert(is_ascii_letter('e'));
    assert(!is_ascii_letter('2'));
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes of course, I just wanted to get a basic testing suite up first :).

assert(is_ascii_letter('C'));
assert(is_ascii_letter('e'));
assert(!is_ascii_letter('2'));
assert(!is_ascii_upper('a'));
assert(is_ascii_upper('B'));
assert(!is_ascii_upper('5'));
assert(is_ascii_lower('a'));
assert(!is_ascii_lower('K'));
assert(!is_ascii_lower('-'));

assert(is_num('0'));
assert(!is_num('a'));

assert(str_eql("HOPE", "HOPE"));
assert(!str_eql("Piece", "Peace"));

assert(ends_with("Hopie", "pie"));
assert(!ends_with("Cat", "ta"));

assert(starts_with("bat", "ba"));
assert(!starts_with("late", "ma"));

assert(?? find_str("boo", "o", 0, 3, true) == 2);
assert(?? find_str("nookies", "ook", 0, 7, false) == 1);
assert(find_str("answer to the universe", "42", 0, 22, false) == null);

assert(str_eql(strip_whitespace(" a ", Side.BOTH), "a"));
assert(str_eql(strip_whitespace(" a ", Side.LEFT), "a "));
assert(str_eql(strip_whitespace(" a ", Side.RIGHT), " a"));

assert(str_eql(strip("mississippi", "ipz", Side.BOTH), "mississ"));

var splits : [3][]const u8 = undefined;
split("Cat,Bat,Mat", ',', &splits[0..]);
var expected = [][3]u8 { "Cat", "Bat", "Mat" };

assert(expected.len == splits.len);
for (expected) |str, i| {
assert(str_eql(splits[i], str));
}
}