КАГАЛ! Все кто как-то анализировал данные и интересовался - Запись на стене пользователя Наима Джошуа в Вконтакте. Комментарии и лайки к записи.

Наима Джошуа

КАГАЛ!

Все кто как-то анализировал данные и интересовался как их анализируют знают, что есть такой сайт kaggle.com, на котором проводятся соревнования и разыгрываются иногда крупные финансы.

Я зареган на нём вот уже 5 лет и периодически следил, что там происходит. Скажу честно, мне всегда было не до него и времени въезжать в специфику данных отдельно взятого конкурса не находилось. Но. Последнее соревнование мне показалось особо актуальным https://www.kaggle.com/c/santander-customer-satisfaction/leaderboard, за ним внимательно следил и формально поучаствовал. Данные так же весили всего 60Мб и можно было с ними работать без подвисания компа.

Условие. Имеются данные клиентов банка - 370 колонок с цифрами по каждому клиенту и 1 колонка, состоящая из 0 и 1, которую надо "предсказать". Отвечает она за то, остался ли клиент в банке или решил уйти в другой. В действительности надо было предсказать не уйдёт он или нет, а вероятность ухода. Т.е. такие параметры с которым сталкивается любой бизнес, имея данные коррелирующие с поведением клиента по которым можно предсказывать его уход и предвосхищая его - начать давать ему всякие плющки и демонстрировать клиентский подход и уровень сервиса.

О чём этот пост вообще? Пост о пользе каггла. Во время соревнования люди постили свои варианты решений (скрипты на R, Python), обсуждали подходы и подгонял константы. Скрипты эти опубликованы в паблике и можно очень быстро пракачиваться в дата-анализе, т.е. знаешь суть решаемой проблемы и видишь разные подходы к её решению и сами реализации. Это супер-цено! Например, участвуя периодически в разных соревнованиях, у тебя в итоге на руках разные подходы, специализируемые для различных данных и более того, с высоким качеством кода.

Далее, неоценимый урок последнего соревнования. Решения в течении конкурса оценивились на 50% от всех данных. Сейчас во всех конкурсах машинного обучения регулярно побеждает алгоритм xgboost (http://www.slideshare.net/ShangxuanZhang/kaggle-winning-solution-xgboost-algorithm-let-us-learn-from-its-author) и, в итоге, решение у большинства игроков свелось к его оптимизации. Скучновато на самом деле, но радость сама по себе нестабильная субстанция. Напомню нужно было предсказать ВЕРОЯТНОСТЬ ухода клиента из банка. Итоговыми победителями стали не те, кто лидировал всё это время, а совсем другие ребята. Я например вылетел из top15%. Некоторые перепрыгнули сразу 1000 позиций и оказались в первой десятке. Это я к тому, что если начинаешь подгонять параметры алгоритма под конкретный случай (к тем самым 50%, на которых тестится решение), то в итоге можно поплатиться за специфичность, которая на 100% не сохраняется. Это всегда нужно иметь ввиду и не забывать; я, кстати, забыл. Сабмитеть можно два варианта и желательно: одно - общее решение, другое - специфичное.

Ещё интересно то, что бой идёт действительно за тысячные процента. Т.е. моё решение, которое использовал я, предсказывало вероятность ухода клиента на 82.4375% верно, решение победителя на 82.9072%. Какая ерунда, да? Но но но.

Немного ссылок:
https://www.kaggle.com/cast42/rossmann-store-sales/xgboost-extra-features/code
https://www.kaggle.com/cast42/santander-customer-satisfaction/xgboost-with-early-stopping/files
https://www.kaggle.com/tks0123456789/santander-customer-satisfaction/data-exploration/notebook
https://www.kaggle.com/cast42/santander-customer-satisfaction/debugging-var3-999999
https://www.kaggle.com/cast42/santander-customer-satisfaction/exploring-features

Сейчас начинает одно совершенно забавное соревнование про распознавание изображений. Welcome. Мне пишите, чо) Пока не представляю возможного решение, это будет эвристический улёт!
https://www.kaggle.com/c/draper-satellite-image-chronology

KAGAL!

Everyone who somehow analyzed the data and was interested in how they are analyzed knows that there is such a site kaggle.com where competitions are held and sometimes big finances are played out.

I zaregan on it for 5 years and periodically watched what was happening there. Frankly, I always had no time for him and there was no time to enter the specifics of the data of a particular competition. But. The last competition seemed especially relevant to me https://www.kaggle.com/c/santander-customer-satisfaction/leaderboard, he carefully watched it and formally participated. The data also weighed only 60 MB and it was possible to work with them without freezing the computer.

Condition. There are customer bank data - 370 columns with numbers for each client and 1 column consisting of 0 and 1, which must be "predicted". She is responsible for whether the client remained at the bank or decided to leave for another. In fact, it was necessary to predict whether he would leave or not, but the probability of leaving. Those. such parameters that any business faces, having data correlating with the behavior of the client by which it is possible to predict his departure and anticipating him - to start giving him all kinds of goodies and demonstrate the client approach and level of service.

What is this post about? A post about the benefits of kaggl. During the competition, people post their solutions (scripts in R, Python), discuss approaches and customize constants. These scripts are published in public and can be very quickly tweaked in data analysis, i.e. You know the essence of the problem being solved and you see different approaches to solving it and the actual implementation. This is super value! For example, participating in various competitions periodically, you end up with different approaches, specializing in various data and, moreover, with high quality code.

Further, an invaluable lesson from the last competition. Decisions during the competition were evaluated at 50% of all data. Now in all machine learning competitions, the xgboost algorithm (http://www.slideshare.net/ShangxuanZhang/kaggle-winning-solution-xgboost-algorithm-let-us-learn-from-its-author) regularly wins and, as a result, the decision of most players came down to its optimization. Boring in fact, but joy itself is an unstable substance. Let me remind you it was necessary to predict the PROBABILITY of the client leaving the bank. The final winners were not those who had been leading all this time, but completely different guys. For example, I flew from top15%. Some jumped immediately to 1000 positions and were in the top ten. This is to say that if you start to adjust the algorithm parameters for a specific case (to that 50%, on which the solution is tested), then in the end you can pay for specificity, which is not 100% preserved. This must always be kept in mind and not forgotten; By the way, I forgot. There are two options for submission and it is desirable: one is a general solution, the other is specific.

Another interesting fact is that the battle is really for thousandths of a percent. Those. my decision, which I used, predicted the probability of the client leaving by 82.4375% true, the decision of the winner by 82.9072%. What nonsense, huh? But but but.

Some links:
https://www.kaggle.com/cast42/rossmann-store-sales/xgboost-extra-features/code
https://www.kaggle.com/cast42/santander-customer-satisfaction/xgboost-with-early-stopping/files
https://www.kaggle.com/tks0123456789/santander-customer-satisfaction/data-exploration/notebook
https://www.kaggle.com/cast42/santander-customer-satisfaction/debugging-var3-999999
https://www.kaggle.com/cast42/santander-customer-satisfaction/exploring-features

One absolutely fun image recognition competition is starting now. Welcome. Write to me, cho) Until I can imagine a possible solution, it will be a heuristic fly!
https://www.kaggle.com/c/draper-satellite-image-chronology

У записи 12 лайков,
0 репостов.

Эту запись оставил(а) на своей стене Наима Джошуа