# Data science

There have been 1 completed talk and 3 topic suggestions tagged with **data science**.

## Related Tags

- statistics
- Tech Talks
- tutorial
- parallel computing
- distributed system
- first year friendly
- efficiency
- algorithm

## Completed Talks

### Metric embeddings and dimensionality reduction

Delivered by Frieda Rong on Friday March 31, 2017

In this talk, we consider embeddings which preserve the pairwise distances of a set of points. It is often useful to find mappings from one high dimensional space to a lower dimensional space that preserve the geometry of the points. One source of applications is in streaming large amounts of data, for which storage is costly and/or impractical. However, the study of such embeddings has also inspired developments in the design of approximation algorithms and compressed sensing.

At the crux of the talk is the remarkable Johnson-Lindenstrauss lemma. This fundamental result shows that for Euclidean spaces, it is possible to achieve significant dimensionality reduction of a set of points while approximately preserving the pairwise distances. An elementary proof will be given, along with subsequent speed improvements with sparse projections and an interesting use of the Fourier transform. We will also discuss applications of the lemma to the fields mentioned above.

## Talk Suggestions

### Complex Event Processing Systems

The ever-increasing amount of information that needs to be processed has led to the development of Complex Event Processing systems such as Apache Storm or Twitter Heron. These systems distribute a workload over many machines in a cluster, and offer both efficiency and fault-tolerance.

Possible reference materials for this topic include

Kulkarni, S., Bhagat, N., Fu, M., Kedigehalli, V., Kellogg, C., Mittal, S., ... & Taneja, S. (2015, May). Twitter heron: Stream processing at scale. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (pp. 239-250). ACM. doi:10.1145/2723372.2742788

Quick links: Google search, arXiv.org search, propose to present a talk

Tech Talks computer science data science distributed system first year friendly parallel computing statistics

### Dealing with Missing Data

Data are rarely perfect. Robust data science tools must have ways to deal with missing data. However, this is not always easy. A balance must be struck between performance and convenience.

Possible reference materials for this topic include

Quick links: Google search, arXiv.org search, propose to present a talk

Tech Talks computer science data science efficiency first year friendly statistics

### Jupyter Notebooks

Jupyter Notebooks are a must-have for any data scientist or engineer. They are available for a wide variety of programming languages, particularly Python.

Possible reference materials for this topic include

Quick links: Google search, arXiv.org search, propose to present a talk

Tech Talks data science first year friendly statistics tutorial