PyData Warsaw 2017
PyData conferences are organized by NumFocus, a nonprofit supporting open source scientific computing (they support Numpy, Pandas, scikit-learn and Jupyter among other things).
Warsaw conference took place in Copernicus Science Centre. The conference spanned 3 days – one workshop + two conference days. I didn’t attend workshops. Some of them seemed pretty basic, and others were not announced until the last week.
One big topic of the conference was interpretability of machine learning models. The second keynote, Towards Interpretable and Accountable Models covered both technical and social aspect of the topic. It was the first talk that mentioned lime and eli5 libraries. Interesting takeaway was that these libraries, in addition to their intended purpose, can help with debugging and with choosing models.
Radim Řehůřek’s keynote, Winning together: Bridging the gap between academia and industry was, among other things, also mentioning the difference of interpretability standards in industrial and scientific setting. This talk also announced another library, bounter that NLP/text mining people should find useful: bounter (portmanteau of bounded counter) is a datastructure that has similar functionality to counter, albeit the answers it gives are approximate (it uses functionalities of probabilistic data structures such as HyperLogLog and Count-min Sketch). This talk also mentioned lime and eli5.
Another talk that was in similar vein was Debugging machine learning. The speaker emphasized interpretability and explainability of not only models, but also code. He also advocated using tests and other best software development practices with machine learning pipelines. This was the third talk that mentioned lime and eli5 libraries.
One of the talks I liked the most was How to visualize neural network parameters and activity. Justin Shenk, who is graduate student working on neural network interpretability, presented live demos of several toolboxes for visualizing weights of deep learning architectures. This talk’s output showed that tools for interpretability are available for complex neural network architectures like convolutional and recurrent nets, and they can be both fun and useful for learning about these methods and diagnosing their performance.
Word embeddings are often presented under the banner of deep learning, even though they don’t exactly use deep architectures. Exploring word2vec vector space touched the subject of encoding semantic and lexical structure linear space structure. Actually the title was a misnomer, since the embeddings used GloVe, a related technique. What really standed out in the talk was that the speaker implemented app for visualization of this structure, which can be accessed from Piotr Migdał’s blog post. The blog post is not too deep though, further information could be found in this paper.
Another interesting application of neural nets to NLP tasks was covered by Use of vectorized text and siamese recurrent neural networks for Allegro offers clustering. Siamese networks are networks that take two inputs and output their similarity score. They share weights for encoded inputs. Allegro guys used siamese RNNs to find duplicate book offers, based on concatenated word embeddings of the text inputs. Unfortunately their work on these architectures was based on a small subset of books (a couple of genres), so it’s hard to tell how their results will generalize.
Image generation using deep learning was the most theoretical talk I attended. The speaker tried to explain derivation of Variational Autoencoder and Generative Adversarial Networks training, which was hard in 30 minutes (most likely the slides were prepared for a longer talk). VAEs and GANs are generative methods that are used for state-of-the art image generation. The speaker showed amusing examples of faces generated by models trained on celebrity faces photos dataset.
The last interesting talk I’ll cover is Adam Paszke’s PyTorch talk. PyTorch is Python implementation of Torch, dynamic computation graph tensor library. Dynamic, or define-by-run (contrast this with define-and-run, like in Theano/Tensorflow) software does not compile computation graphs, but instead executes operations on the fly. This makes writing in PyTorch easier than Tensorflow for example – anyone familiar with linear algebra and Numpy can start write code in PyTorch. It also makes debugging easier, since standard Python debugging tools can be used. Also using dynamic graphs makes writing recurrent and other complicated neural networks architectures easier – no need to handle anything with `scan` or some similar function.
The speaker showcased impressive growth of the library: it was released in 2017, but it already supports lots of utilities and high-level functions, for example modules for loading classic image datasets (like Keras).
Some other topics covered on the conference included boosting (a talk on XGBoost, LightGBM and Catboost), natural language processing (introductory talk, neural translation, and one on Slavic languages). There was also one talk that wasn’t on Python – it was on Julia (this language is also sponsored by NumFocus).
It was real fun to attend PyData Warsaw 2017 conference if you’re into Python data science tools. You could talk with developers of your favourite libraries (and get the stickers too).
In general NumFocus/PyData people put a lot of work into this and it resulted in great conference – many thanks to them!