{"id":3691455,"name":"github.com/lytics/cloudforest","ecosystem":"go","description":"Package CloudForest implements ensembles of decision trees for machine\nlearning in pure Go (golang to search engines). It allows for a number of related algorithms\nfor classification, regression, feature selection and structure analysis on heterogeneous\nnumerical/categorical data with missing values. These include:\n\nBreiman and Cutler's Random Forest for Classification and Regression\n\nAdaptive Boosting (AdaBoost) Classification\n\nGradiant Boosting Tree Regression\n\nEntropy and Cost driven classification\n\nL1 regression\n\nFeature selection with artificial contrasts\n\nProximity and model structure analysis\n\nRoughly balanced bagging for unbalanced classification\n\nThe API hasn't stabilized yet and may change rapidly. Tests and benchmarks have been performed\nonly on embargoed data sets and can not yet be released.\n\nLibrary Documentation is in code and can be viewed with godoc or live at:\nhttp://godoc.org/github.com/ryanbressler/CloudForest\n\nDocumentation of command line utilities and file formats can be found in README.md, which can be\nviewed fromated on github:\nhttp://github.com/ryanbressler/CloudForest\n\nPull requests and bug reports are welcome.\n\nCloudForest was created by Ryan Bressler and is being developed in the Shumelivich Lab at\nthe Institute for Systems Biology for use on genomic/biomedical data with partial support\nfrom The Cancer Genome Atlas and the Inova Translational Medicine Institute.\n\nCloudForest is intended to provide fast, comprehensible building blocks that can\nbe used to implement ensembles of decision trees. CloudForest is written in Go to\nallow a data scientist to develop and scale new models and analysis quickly instead\nof having to modify complex legacy code.\n\nData structures and file formats are chosen with use in multi threaded and cluster\nenvironments in mind.\n\nGo's support for function types is used to provide a interface to run code as data\nis percolated through a tree. This method is flexible enough that it can extend the tree being\nanalyzed. Growing a decision tree using Breiman and Cutler's method can be done in an anonymous\nfunction/closure passed to a tree's root node's Recurse method:\n\nThis allows a researcher to include whatever additional analysis they need (importance scores,\nproximity etc) in tree growth. The same Recurse method can also be used to analyze existing forests\nto tabulate scores or extract structure. Utilities like leafcount and errorrate use this\nmethod to tabulate data about the tree in collection objects.\n\nDecision tree's are grown with the goal of reducing \"Impurity\" which is usually defined as Gini\nImpurity for categorical targets or mean squared error for numerical targets. CloudForest grows\ntrees against the Target interface which allows for alternative definitions of impurity. CloudForest\nincludes several alternative targets:\n\nAdditional targets can be stacked on top of these target to add boosting functionality:\n\nRepeatedly splitting the data and searching for the best split at each node of a decision tree\nare the most computationally intensive parts of decision tree learning and CloudForest includes\noptimized code to perform these tasks.\n\nGo's slices are used extensively in CloudForest to make it simple to interact with optimized code.\nMany previous implementations of Random Forest have avoided reallocation by reordering data in\nplace and keeping track of start and end indexes. In go, slices pointing at the same underlying\narrays make this sort of optimization transparent. For example a function like:\n\ncan return left and right slices that point to the same underlying array as the original\nslice of cases but these slices should not have their values changed.\n\nFunctions used while searching for the best split also accepts pointers to reusable slices and\nstructs to maximize speed by keeping memory allocations to a minimum. BestSplitAllocs contains\npointers to these items and its use can be seen in functions like:\n\nFor categorical predictors, BestSplit will also attempt to intelligently choose between 4\ndifferent implementations depending on user input and the number of categories.\nThese include exhaustive, random, and iterative searches for the best combination of categories\nimplemented with bitwise operations against int and big.Int. See BestCatSplit, BestCatSplitIter,\nBestCatSplitBig and BestCatSplitIterBig.\n\nAll numerical predictors are handled by BestNumSplit which\nrelies on go's sorting package.\n\nTraining a Random forest is an inherently parallel process and CloudForest is designed\nto allow parallel implementations that can tackle large problems while keeping memory\nusage low by writing and using data structures directly to/from disk.\n\nTrees can be grown in separate go routines. The growforest utility provides an example\nof this that uses go routines and channels to grow trees in parallel and write trees\nto disk as the are finished by the \"worker\" go routines. The few summary statistics\nlike mean impurity decrease per feature (importance) can be calculated using thread\nsafe data structures like RunningMean.\n\nTrees can also be grown on separate machines. The .sf stochastic forest format\nallows several small forests to be combined by concatenation and the ForestReader\nand ForestWriter structs allow these forests to be accessed tree by tree (or even node\nby node) from disk.\n\nFor data sets that are too big to fit in memory on a single machine Tree.Grow and\nFeatureMatrix.BestSplitter can be reimplemented to load candidate features from disk,\ndistributed database etc.\n\nBy default cloud forest uses a fast heuristic for missing values. When proposing a split on a feature\nwith missing data the missing cases are removed and the impurity value is corrected to use three way impurity\nwhich reduces the bias towards features with lots of missing data:\n\nMissing values in the target variable are left out of impurity calculations.\n\nThis provided generally good results at a fraction of the computational costs of imputing data.\n\nOptionally, feature.ImputeMissing or featurematrixImputeMissing can be called before forest growth\nto impute missing values to the feature mean/mode which Brieman [2] suggests as a fast method for\nimputing values.\n\nThis forest could also be analyzed for proximity (using leafcount or tree.GetLeaves) to do the\nmore accurate proximity weighted imputation Brieman describes.\n\nExperimental support is provided for 3 way splitting which splits missing cases onto a third branch.\n[2] This has so far yielded mixed results in testing.\n\nAt some point in the future support may be added for local imputing of missing values during tree growth\nas described in [3]\n\n[1] http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#missing1\n\n[2] https://code.google.com/p/rf-ace/\n\n[3] http://projecteuclid.org/DPubS?verb=Display\u0026version=1.0\u0026service=UI\u0026handle=euclid.aoas/1223908043\u0026page=record\n\nIn CloudForest data is stored using the FeatureMatrix struct which contains Features.\n\nThe Feature struct  implements storage and methods for both categorical and numerical data and\ncalculations of impurity etc and the search for the best split.\n\nThe Target interface abstracts the methods of Feature that are needed for a feature to be predictable.\nThis allows for the implementation of alternative types of regression and classification.\n\nTrees are built from Nodes and Splitters and stored within a Forest. Tree has a Grow\nimplements Brieman and Cutler's method (see extract above) for growing a tree. A GrowForest\nmethod is also provided that implements the rest of the method including sampling cases\nbut it may be faster to grow the forest to disk as in the growforest utility.\n\nPrediction and Voting is done using Tree.Vote and CatBallotBox and NumBallotBox which implement the\nVoteTallyer interface.","homepage":"https://github.com/lytics/cloudforest","licenses":"BSD-3-Clause","normalized_licenses":["BSD-3-Clause"],"repository_url":"https://github.com/lytics/cloudforest","keywords_array":[],"namespace":"github.com/lytics","versions_count":1,"first_release_published_at":"2020-11-16T17:40:08.000Z","latest_release_published_at":"2020-11-16T17:40:08.000Z","latest_release_number":"v0.0.0-20201116174008-381792ef996b","last_synced_at":"2026-03-27T06:08:43.656Z","created_at":"2022-04-11T11:30:54.829Z","updated_at":"2026-03-27T06:08:43.656Z","registry_url":"https://pkg.go.dev/github.com/lytics/cloudforest","install_command":"go get github.com/lytics/cloudforest","documentation_url":"https://pkg.go.dev/github.com/lytics/cloudforest#section-documentation","metadata":{},"repo_metadata":{"uuid":"42741998","full_name":"lytics/CloudForest","owner":"lytics","description":"Ensembles of decision trees in go/golang.","archived":false,"fork":true,"pushed_at":"2020-11-16T17:40:10.000Z","size":1902,"stargazers_count":15,"open_issues_count":2,"forks_count":3,"subscribers_count":16,"default_branch":"master","last_synced_at":"2023-02-23T14:02:38.891Z","etag":null,"topics":["math","random-forest","trees"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":"ryanbressler/CloudForest","license":"other","status":null,"scm":"git","pull_requests_enabled":true,"logo_url":null,"metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-09-18T19:06:48.000Z","updated_at":"2021-06-04T12:28:45.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/lytics/CloudForest","commit_stats":null,"repository_url":"http://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lytics%2FCloudForest","tags_url":"http://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lytics%2FCloudForest/tags","manifests_url":"http://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lytics%2FCloudForest/manifests","owner_url":"http://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lytics","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":108921946,"host_url":"http://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"http://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"http://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names"}},"repo_metadata_updated_at":"2023-03-21T18:57:32.969Z","dependent_packages_count":0,"downloads":null,"downloads_period":null,"dependent_repos_count":0,"rankings":{"downloads":null,"dependent_repos_count":9.345852080216646,"dependent_packages_count":6.999148183520997,"stargazers_count":null,"forks_count":null,"average":8.172500131868823},"purl":"pkg:golang/github.com/lytics/cloudforest","advisories":[],"docker_usage_url":"https://docker.ecosyste.ms/usage/go/github.com/lytics/cloudforest","docker_dependents_count":null,"docker_downloads_count":null,"usage_url":"https://repos.ecosyste.ms/usage/go/github.com/lytics/cloudforest","dependent_repositories_url":"https://repos.ecosyste.ms/api/v1/usage/go/github.com/lytics/cloudforest/dependencies","status":null,"funding_links":[],"critical":null,"issue_metadata":{"last_synced_at":"2023-08-11T20:43:32.501Z","issues_count":3,"pull_requests_count":28,"avg_time_to_close_issue":373866.5,"avg_time_to_close_pull_request":122076.03703703704,"issues_closed_count":2,"pull_requests_closed_count":27,"pull_request_authors_count":7,"issue_authors_count":2,"avg_comments_per_issue":4.333333333333333,"avg_comments_per_pull_request":1.0714285714285714,"merged_pull_requests_count":24,"bot_issues_count":0,"bot_pull_requests_count":0,"past_year_issues_count":0,"past_year_pull_requests_count":0,"past_year_avg_time_to_close_issue":null,"past_year_avg_time_to_close_pull_request":null,"past_year_issues_closed_count":0,"past_year_pull_requests_closed_count":0,"past_year_pull_request_authors_count":0,"past_year_issue_authors_count":0,"past_year_avg_comments_per_issue":null,"past_year_avg_comments_per_pull_request":null,"past_year_bot_issues_count":0,"past_year_bot_pull_requests_count":0,"past_year_merged_pull_requests_count":0},"versions_url":"https://packages.ecosyste.ms/api/v1/registries/proxy.golang.org/packages/github.com%2Flytics%2Fcloudforest/versions","version_numbers_url":"https://packages.ecosyste.ms/api/v1/registries/proxy.golang.org/packages/github.com%2Flytics%2Fcloudforest/version_numbers","dependent_packages_url":"https://packages.ecosyste.ms/api/v1/registries/proxy.golang.org/packages/github.com%2Flytics%2Fcloudforest/dependent_packages","related_packages_url":"https://packages.ecosyste.ms/api/v1/registries/proxy.golang.org/packages/github.com%2Flytics%2Fcloudforest/related_packages","codemeta_url":"https://packages.ecosyste.ms/api/v1/registries/proxy.golang.org/packages/github.com%2Flytics%2Fcloudforest/codemeta","maintainers":[]}