Data science
There have been 1 completed talk and 3 topic suggestions tagged with data science.
Related Tags
- statistics
- Tech Talks
- tutorial
- parallel computing
- distributed system
- first year friendly
- efficiency
- algorithm
Completed Talks
Metric embeddings and dimensionality reduction
Delivered by Frieda Rong on Friday March 31, 2017
In this talk, we consider embeddings which preserve the pairwise distances of a set of points. It is often useful to find mappings from one high dimensional space to a lower dimensional space that preserve the geometry of the points. One source of applications is in streaming large amounts of data, for which storage is costly and/or impractical. However, the study of such embeddings has also inspired developments in the design of approximation algorithms and compressed sensing.
At the crux of the talk is the remarkable Johnson-Lindenstrauss lemma. This fundamental result shows that for Euclidean spaces, it is possible to achieve significant dimensionality reduction of a set of points while approximately preserving the pairwise distances. An elementary proof will be given, along with subsequent speed improvements with sparse projections and an interesting use of the Fourier transform. We will also discuss applications of the lemma to the fields mentioned above.
Talk Suggestions
Complex Event Processing Systems
The ever-increasing amount of information that needs to be processed has led to the development of Complex Event Processing systems such as Apache Storm or Twitter Heron. These systems distribute a workload over many machines in a cluster, and offer both efficiency and fault-tolerance.
Possible reference materials for this topic include
Kulkarni, S., Bhagat, N., Fu, M., Kedigehalli, V., Kellogg, C., Mittal, S., ... & Taneja, S. (2015, May). Twitter heron: Stream processing at scale. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (pp. 239-250). ACM. doi:10.1145/2723372.2742788
Quick links: Google search, arXiv.org search, propose to present a talk
Tech Talks computer science data science distributed system first year friendly parallel computing statistics
Dealing with Missing Data
Data are rarely perfect. Robust data science tools must have ways to deal with missing data. However, this is not always easy. A balance must be struck between performance and convenience.
Possible reference materials for this topic include
Quick links: Google search, arXiv.org search, propose to present a talk
Tech Talks computer science data science efficiency first year friendly statistics
Jupyter Notebooks
Jupyter Notebooks are a must-have for any data scientist or engineer. They are available for a wide variety of programming languages, particularly Python.
Possible reference materials for this topic include
Quick links: Google search, arXiv.org search, propose to present a talk
Tech Talks data science first year friendly statistics tutorial