[00:00:04] twentyafterfour: Respected human, time to deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160331T0000). Please do the needful. [00:01:44] (03PS1) 10Krinkle: Revert "Correct HTML code for WMF image" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280608 [00:02:13] (03CR) 10Krinkle: [C: 032] Revert "Correct HTML code for WMF image" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280608 (owner: 10Krinkle) [00:02:37] (03CR) 10Krinkle: [C: 032] errorpages: Remove X-Wikimedia-Debug header from 404.php response [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279571 (owner: 10Krinkle) [00:02:39] (03CR) 10Krinkle: [C: 032] errorpages: Clean up 404.php code and simplify replacement url [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279572 (owner: 10Krinkle) [00:02:49] (03Merged) 10jenkins-bot: Revert "Correct HTML code for WMF image" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280608 (owner: 10Krinkle) [00:03:20] (03Merged) 10jenkins-bot: errorpages: Remove X-Wikimedia-Debug header from 404.php response [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279571 (owner: 10Krinkle) [00:03:31] (03Merged) 10jenkins-bot: errorpages: Clean up 404.php code and simplify replacement url [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279572 (owner: 10Krinkle) [00:04:15] !log krinkle@tin Synchronized errorpages/: (no message) (duration: 00m 33s) [00:04:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:13:18] (03CR) 1020after4: [C: 031] deploy: delete id_rsa.pub [puppet] - 10https://gerrit.wikimedia.org/r/280480 (owner: 10Chad) [00:17:06] 6Operations: torrus broken - https://phabricator.wikimedia.org/T131329#2163917 (10Dzahn) 5Open>3Invalid invalid. it worked again after the deadlock problem procedure.. but .. because i had a local /etc/hosts hack with torrus in it i did not actually talk to the right IP :p [00:19:34] (03CR) 10Dzahn: [C: 031] "commit f8724e60664a33a37a327434f5c3cb71837f4c20" [puppet] - 10https://gerrit.wikimedia.org/r/280480 (owner: 10Chad) [00:19:40] (03PS2) 10Dzahn: deploy: delete id_rsa.pub [puppet] - 10https://gerrit.wikimedia.org/r/280480 (owner: 10Chad) [00:21:27] (03CR) 10Dzahn: [C: 032] deploy: delete id_rsa.pub [puppet] - 10https://gerrit.wikimedia.org/r/280480 (owner: 10Chad) [00:26:32] (03PS2) 10Dzahn: irc.wikimedia.org - remove Apache [puppet] - 10https://gerrit.wikimedia.org/r/280342 (https://phabricator.wikimedia.org/T130981) [00:27:10] (03PS3) 10Dzahn: irc.wikimedia.org - remove Apache [puppet] - 10https://gerrit.wikimedia.org/r/280342 (https://phabricator.wikimedia.org/T130981) [00:27:41] (03CR) 10Dzahn: [C: 032] "per discussion on ticket. also removed the class and template" [puppet] - 10https://gerrit.wikimedia.org/r/280342 (https://phabricator.wikimedia.org/T130981) (owner: 10Dzahn) [00:27:44] PROBLEM - puppet last run on heze is CRITICAL: CRITICAL: Puppet has 1 failures [00:30:47] !log argon - removing apache and config [00:30:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:42:38] (03CR) 10Krinkle: [C: 04-1] Include request ID in profiling data (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279631 (owner: 10Ori.livneh) [00:43:33] (03CR) 10Ori.livneh: Include request ID in profiling data (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279631 (owner: 10Ori.livneh) [00:51:54] 6Operations, 10Traffic, 7HTTPS, 13Patch-For-Review: irc.wikimedia.org talks HTTP but not HTTPS - https://phabricator.wikimedia.org/T130981#2163941 (10Dzahn) done. removed Apache and config from argon. removed puppet role, class, template... [00:52:15] 6Operations, 10Traffic, 7HTTPS: irc.wikimedia.org talks HTTP but not HTTPS - https://phabricator.wikimedia.org/T130981#2163942 (10Dzahn) [00:53:44] 6Operations, 10Traffic, 7HTTPS: irc.wikimedia.org talks HTTP but not HTTPS - https://phabricator.wikimedia.org/T130981#2152741 (10Dzahn) 5Open>3Resolved [00:54:05] RECOVERY - puppet last run on heze is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [01:01:42] 6Operations, 6Discovery, 7Elasticsearch: Icinga should alert on free disk space < 15% on Elasticsearch hosts - https://phabricator.wikimedia.org/T130329#2163944 (10Dzahn) on the elastic hosts, the local NRPE command is adjusted: ``` root@elastic1001:/etc/nagios/nrpe.d# cat check_disk_space.cfg # File gener... [01:02:01] 6Operations, 6Discovery, 7Elasticsearch: Icinga should alert on free disk space < 15% on Elasticsearch hosts - https://phabricator.wikimedia.org/T130329#2163945 (10Dzahn) 5Open>3Resolved [01:02:46] 6Operations: decom magnesium (was: Reinstall magnesium with jessie) - https://phabricator.wikimedia.org/T123713#2163946 (10Dzahn) a:5Dzahn>3None [01:02:55] 6Operations: reinstall bast4001 with jessie - https://phabricator.wikimedia.org/T123674#2163947 (10Dzahn) p:5Normal>3High [01:03:03] 6Operations, 13Patch-For-Review: reinstall bast2001 with jessie - https://phabricator.wikimedia.org/T128899#2163948 (10Dzahn) p:5Normal>3High [01:08:12] 6Operations: setup YubiHSM and laptop at office - https://phabricator.wikimedia.org/T123818#2163950 (10Dzahn) p:5High>3Normal [01:08:59] 6Operations, 10Traffic, 10Wiki-Loves-Monuments-General, 7HTTPS: configure https for www.wikilovesmonuments.org - https://phabricator.wikimedia.org/T118388#2163951 (10Dzahn) p:5Triage>3High [01:13:35] 6Operations, 10ops-codfw, 6DC-Ops, 13Patch-For-Review: setup new mw maint host - wasat - https://phabricator.wikimedia.org/T129930#2163965 (10Dzahn) [01:13:54] 6Operations, 10hardware-requests, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: MediaWiki maintenance host for codfw (terbium's equivalent) - https://phabricator.wikimedia.org/T126987#2163968 (10Dzahn) [01:13:56] 6Operations, 10ops-codfw, 6DC-Ops, 13Patch-For-Review: setup new mw maint host - wasat - https://phabricator.wikimedia.org/T129930#2120304 (10Dzahn) 5Open>3Resolved a) wasat is up and running with the mw and mariadb maintenance roles b) mw2090 is reinstalled as regular appserver - add to site.pp, add... [01:14:27] robh: just got back, thanks for the ticket + setting everything up [01:14:37] 6Operations, 10ops-codfw, 6DC-Ops: setup new mw maint host - wasat - https://phabricator.wikimedia.org/T129930#2163969 (10Dzahn) [01:15:03] 6Operations, 10ops-codfw, 6DC-Ops: setup new mw maint host - wasat - https://phabricator.wikimedia.org/T129930#2120304 (10Dzahn) [01:19:24] 6Operations, 13Patch-For-Review: Install fonts-wqy-zenhei on all mediawiki app servers - https://phabricator.wikimedia.org/T84777#2163978 (10Dzahn) p:5Normal>3Low [01:27:20] 6Operations: decom magnesium (was: Reinstall magnesium with jessie) - https://phabricator.wikimedia.org/T123713#2163988 (10Dzahn) [01:27:22] 6Operations, 13Patch-For-Review: move RT off of magnesium - https://phabricator.wikimedia.org/T119112#2163987 (10Dzahn) 5Open>3stalled [01:29:26] 6Operations, 13Patch-For-Review: move RT off of magnesium - https://phabricator.wikimedia.org/T119112#2163989 (10Dzahn) a:5Dzahn>3None [02:12:21] (03PS10) 10Yuvipanda: tools: Add class that helps build kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/279648 (https://phabricator.wikimedia.org/T129311) [02:13:26] (03PS11) 10Yuvipanda: tools: Add class that helps build kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/279648 (https://phabricator.wikimedia.org/T129311) [02:19:20] 6Operations, 10Continuous-Integration-Infrastructure, 6Services: Package npm 2.14 - https://phabricator.wikimedia.org/T124474#2164007 (10Krinkle) @mobrovac That might work. Nvm is most known for managing multiple node versions, but it's also a useful tool for easily installing standard Node tarballs (which i... [02:19:55] (03CR) 10Yuvipanda: [C: 032] tools: Add class that helps build kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/279648 (https://phabricator.wikimedia.org/T129311) (owner: 10Yuvipanda) [02:27:42] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.18) (duration: 11m 52s) [02:27:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:48:23] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.19) (duration: 10m 26s) [02:48:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:03:31] is there a way to run fab initialize_server in a self hosted puppet master instance, without copying my ssh keys? [03:12:30] (03PS1) 10BBlack: Log last (not first) entry for resp/req headers and resp status [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/280612 [04:00:14] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 28846 seconds ago, expected 28800 [04:05:14] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 29145 seconds ago, expected 28800 [04:10:14] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 29445 seconds ago, expected 28800 [04:15:14] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 29745 seconds ago, expected 28800 [04:20:14] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 30045 seconds ago, expected 28800 [04:25:14] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 30345 seconds ago, expected 28800 [04:30:14] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 30645 seconds ago, expected 28800 [04:35:14] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 28973 seconds ago, expected 28800 [04:35:14] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 30945 seconds ago, expected 28800 [04:40:14] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 29273 seconds ago, expected 28800 [04:40:14] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 31245 seconds ago, expected 28800 [04:45:14] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 29573 seconds ago, expected 28800 [04:45:14] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 31545 seconds ago, expected 28800 [04:50:14] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 29873 seconds ago, expected 28800 [04:50:14] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 31845 seconds ago, expected 28800 [04:55:14] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 30173 seconds ago, expected 28800 [04:55:14] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 32145 seconds ago, expected 28800 [05:00:14] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 30473 seconds ago, expected 28800 [05:00:15] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 32445 seconds ago, expected 28800 [05:05:14] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 30773 seconds ago, expected 28800 [05:05:14] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 32745 seconds ago, expected 28800 [05:10:14] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 31073 seconds ago, expected 28800 [05:10:14] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 33046 seconds ago, expected 28800 [05:10:20] !log krinkle@tin Synchronized php-1.27.0-wmf.18/extensions/MobileFrontend/: Iaa5ed38c712b19e (duration: 00m 42s) [05:10:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:13:51] !log krinkle@tin Synchronized php-1.27.0-wmf.19/extensions/MobileFrontend/: Iaa5ed38c712b19e (duration: 00m 31s) [05:13:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:15:14] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 31373 seconds ago, expected 28800 [05:15:14] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 33346 seconds ago, expected 28800 [05:20:14] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 31673 seconds ago, expected 28800 [05:20:14] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 33645 seconds ago, expected 28800 [05:24:09] ACKNOWLEDGEMENT - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet last ran 33645 seconds ago, expected 28800 daniel_zahn https://phabricator.wikimedia.org/T131338 [05:24:44] ACKNOWLEDGEMENT - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet last ran 31673 seconds ago, expected 28800 daniel_zahn https://phabricator.wikimedia.org/T131338 [05:29:15] (03PS1) 10Krinkle: speed-tests: Update mobile-lazyimage.html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280615 [05:29:24] (03CR) 10Krinkle: [C: 032] speed-tests: Update mobile-lazyimage.html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280615 (owner: 10Krinkle) [05:29:52] (03Merged) 10jenkins-bot: speed-tests: Update mobile-lazyimage.html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280615 (owner: 10Krinkle) [05:31:24] !log krinkle@tin Synchronized docroot/wikipedia.org/speed-tests/: (no message) (duration: 00m 33s) [05:31:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:11:38] !log mwscript deleteEqualMessages.php --wiki zh_min_nanwiki (T45917) [06:11:39] T45917: Delete all redundant "MediaWiki" pages for system messages - https://phabricator.wikimedia.org/T45917 [06:11:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:14:46] PROBLEM - puppet last run on pc2004 is CRITICAL: CRITICAL: puppet fail [06:29:34] PROBLEM - puppet last run on kafka1002 is CRITICAL: CRITICAL: puppet fail [06:30:16] PROBLEM - puppet last run on es2013 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:25] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:45] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:57] PROBLEM - puppet last run on mw2120 is CRITICAL: CRITICAL: puppet fail [06:31:04] PROBLEM - puppet last run on mc2007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:54] PROBLEM - puppet last run on holmium is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:55] PROBLEM - puppet last run on elastic2007 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:04] PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:46] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:54] PROBLEM - puppet last run on praseodymium is CRITICAL: CRITICAL: puppet fail [06:34:02] mmmm there is a 500 apache error page logged in /var/log/syslog for kafka1002 (from strontium).. [06:36:05] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 53.33% of data above the critical threshold [5000000.0] [06:38:05] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:41:11] --^ re-run puppet manually, is it the "usual" puppet glitch? [06:41:15] RECOVERY - puppet last run on pc2004 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:46:36] RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 50.00% above the threshold [1000000.0] [06:56:36] RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:56:46] RECOVERY - puppet last run on es2013 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:57:14] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:34] RECOVERY - puppet last run on mc2007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:45] RECOVERY - puppet last run on kafka1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:16] RECOVERY - puppet last run on holmium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:25] RECOVERY - puppet last run on elastic2007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:44] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:15] RECOVERY - puppet last run on praseodymium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:16] RECOVERY - puppet last run on mw2120 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:07:44] PROBLEM - MariaDB Slave SQL: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Error Duplicate entry 35292709-629934565 for key PRIMARY on query. Default database: enwiki. Query: INSERT /* Revision::insertOn Bearcat */ INTO revision (rev_id,rev_page,rev_text_id,rev_comment,rev_minor_edit,rev_user,rev_user_text,rev_timestamp,rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_content_model,r [07:13:06] (03CR) 10Giuseppe Lavagetto: Use ProductionServices for the jobqueue configuration (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279350 (https://phabricator.wikimedia.org/T114273) (owner: 10Giuseppe Lavagetto) [07:14:05] PROBLEM - puppet last run on ms-be2001 is CRITICAL: CRITICAL: Puppet has 1 failures [07:14:45] PROBLEM - DPKG on labmon1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [07:16:26] RECOVERY - DPKG on labmon1001 is OK: All packages OK [07:23:34] RECOVERY - MariaDB Slave SQL: s1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [07:25:03] (03PS2) 10Muehlenhoff: Enable base::firewall on neodymium [puppet] - 10https://gerrit.wikimedia.org/r/280213 [07:28:39] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable base::firewall on neodymium [puppet] - 10https://gerrit.wikimedia.org/r/280213 (owner: 10Muehlenhoff) [07:28:46] PROBLEM - MariaDB Slave SQL: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Error Duplicate entry 78407281 for key PRIMARY on query. Default database: enwiki. Query: INSERT /* WikiPage::doDeleteArticleReal Closedmouth */ INTO archive (ar_namespace,ar_title,ar_comment,ar_user,ar_user_text,ar_timestamp,ar_minor_edit,ar_rev_id,ar_parent_id,ar_text_id,ar_text,ar_flags,ar_len,ar_page_id, [07:29:04] O_O [07:29:36] what did i do? [07:33:17] 6Operations, 13Patch-For-Review: Randomly failing puppetmaster sync to strontium - https://phabricator.wikimedia.org/T128895#2164182 (10MoritzMuehlenhoff) And failed again: ``` Merge these changes? (yes/no)? yes Merging 4fa5a8428500739447db13d4f2db3da4f5f900c1... git merge --ff-only 4fa5a8428500739447db13d4f... [07:35:26] 6Operations, 13Patch-For-Review: Randomly failing puppetmaster sync to strontium - https://phabricator.wikimedia.org/T128895#2164183 (10MoritzMuehlenhoff) p:5Triage>3High [07:40:46] RECOVERY - puppet last run on ms-be2001 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [07:41:25] (03PS1) 10Muehlenhoff: Fix ferm rules for salt master [puppet] - 10https://gerrit.wikimedia.org/r/280621 [07:41:54] (03PS2) 10Muehlenhoff: Fix ferm rules for salt master [puppet] - 10https://gerrit.wikimedia.org/r/280621 [07:42:22] (03CR) 10Muehlenhoff: [C: 032 V: 032] Fix ferm rules for salt master [puppet] - 10https://gerrit.wikimedia.org/r/280621 (owner: 10Muehlenhoff) [07:44:27] (03PS1) 10Elukey: Revert "Bump nf_conntrack_max temporarily to allow proper investigation." [puppet] - 10https://gerrit.wikimedia.org/r/280622 (https://phabricator.wikimedia.org/T131028) [07:44:50] (03CR) 10jenkins-bot: [V: 04-1] Revert "Bump nf_conntrack_max temporarily to allow proper investigation." [puppet] - 10https://gerrit.wikimedia.org/r/280622 (https://phabricator.wikimedia.org/T131028) (owner: 10Elukey) [07:45:15] (03PS2) 10Elukey: Revert "Bump nf_conntrack_max temporarily to allow proper investigation." [puppet] - 10https://gerrit.wikimedia.org/r/280622 (https://phabricator.wikimedia.org/T131028) [07:45:38] (03CR) 10jenkins-bot: [V: 04-1] Revert "Bump nf_conntrack_max temporarily to allow proper investigation." [puppet] - 10https://gerrit.wikimedia.org/r/280622 (https://phabricator.wikimedia.org/T131028) (owner: 10Elukey) [07:47:16] (03PS2) 10Giuseppe Lavagetto: Use ProductionServices for the jobqueue configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279350 (https://phabricator.wikimedia.org/T114273) [07:47:17] (03PS2) 10Giuseppe Lavagetto: Use local resources in codfw for parsoid, url-downloader and mathoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279355 (https://phabricator.wikimedia.org/T114273) [07:48:12] (03Abandoned) 10Elukey: Revert "Bump nf_conntrack_max temporarily to allow proper investigation." [puppet] - 10https://gerrit.wikimedia.org/r/280622 (https://phabricator.wikimedia.org/T131028) (owner: 10Elukey) [07:51:12] (03PS1) 10Elukey: Restore nf_conntrack_max setting for the analytics kafka brokers. [puppet] - 10https://gerrit.wikimedia.org/r/280623 (https://phabricator.wikimedia.org/T131028) [07:51:40] (03CR) 10Muehlenhoff: [C: 031] Restore nf_conntrack_max setting for the analytics kafka brokers. [puppet] - 10https://gerrit.wikimedia.org/r/280623 (https://phabricator.wikimedia.org/T131028) (owner: 10Elukey) [07:53:10] (03CR) 10Elukey: [C: 032] Restore nf_conntrack_max setting for the analytics kafka brokers. [puppet] - 10https://gerrit.wikimedia.org/r/280623 (https://phabricator.wikimedia.org/T131028) (owner: 10Elukey) [07:55:01] 6Operations, 6Analytics-Kanban, 13Patch-For-Review: nf_conntrack warnings for kafka hosts - https://phabricator.wikimedia.org/T131028#2164194 (10elukey) a:3elukey [07:57:24] (03CR) 10Giuseppe Lavagetto: jobrunner: fix the redis servers list in codfw (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/279340 (owner: 10Giuseppe Lavagetto) [08:01:58] <_joe_> AaronSchulz: if you're up (given your current TZ it's a possibility) I updated https://gerrit.wikimedia.org/r/279350 [08:08:13] (03PS2) 10Giuseppe Lavagetto: jobrunner: fix the redis servers list in codfw [puppet] - 10https://gerrit.wikimedia.org/r/279340 [08:09:06] (03PS1) 10Catrope: Remove MoodBar and related settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280624 (https://phabricator.wikimedia.org/T131340) [08:10:03] (03CR) 10Elukey: [C: 031] jobrunner: fix the redis servers list in codfw [puppet] - 10https://gerrit.wikimedia.org/r/279340 (owner: 10Giuseppe Lavagetto) [08:10:09] (03CR) 10Giuseppe Lavagetto: [C: 032] jobrunner: fix the redis servers list in codfw [puppet] - 10https://gerrit.wikimedia.org/r/279340 (owner: 10Giuseppe Lavagetto) [08:10:21] <_joe_> that was fast elukey :) [08:11:10] _joe_ I was reading my emails :P [08:12:45] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:14:34] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [08:15:15] 6Operations, 6Analytics-Kanban, 13Patch-For-Review: nf_conntrack warnings for kafka hosts - https://phabricator.wikimedia.org/T131028#2164245 (10elukey) 5Open>3Resolved As stated in the commit message: """ We added a graphite metric (using a diamond script) to track nf_conntrack_count and observed the k... [08:15:44] RECOVERY - MariaDB Slave SQL: s1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [08:17:04] apergos: around? [08:17:13] kart_: yes, what's up? [08:17:32] apergos: We need to start work on, https://phabricator.wikimedia.org/T127793 so need your help. [08:17:59] ah ha [08:18:08] I won't be available til the 2nd to do any real work on it [08:18:18] apergos: estimated dump size etc given in comment there, let me know how we can start. [08:19:08] apergos: No problem. If you can point any preparation etc needed by us, that will be great meanwhile. [08:19:27] these are to be run out of cron I guess, once every so often? [08:19:37] kart_: [08:19:43] apergos: cron. [08:19:55] apergos: frequency need to finalize. [08:19:58] ok [08:20:23] we'll want a directory to put them in, under "other" on the dumps host [08:21:03] is the script a maintenance class script to generate them? [08:21:26] will it take all the usual arguments for output files, compression streams, etc? [08:21:37] kart_: [08:21:40] (03CR) 10Elukey: [C: 031] Log last (not first) entry for resp/req headers and resp status [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/280612 (owner: 10BBlack) [08:22:07] (03CR) 10Jforrester: "What'll happen to the contents of the DB tables? They're already archived I assume, so we can drop them from prod at some point?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280624 (https://phabricator.wikimedia.org/T131340) (owner: 10Catrope) [08:27:41] 6Operations, 6Analytics-Kanban, 10Traffic, 13Patch-For-Review: varnishkafka integration with Varnish 4 for analytics - https://phabricator.wikimedia.org/T124278#2164270 (10elukey) Summary of last updates: 1) varnish-kafka code changes in https://gerrit.wikimedia.org/r/#/c/276439/ and https://gerrit.wikim... [08:27:56] PROBLEM - MariaDB Slave SQL: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Error Duplicate entry 78407288 for key PRIMARY on query. Default database: enwiki. Query: INSERT /* WikiPage::doDeleteArticleReal Closedmouth */ INTO archive (ar_namespace,ar_title,ar_comment,ar_user,ar_user_text,ar_timestamp,ar_minor_edit,ar_rev_id,ar_parent_id,ar_text_id,ar_text,ar_flags,ar_len,ar_page_id, [08:29:20] (03CR) 10Jcrespo: "Please do not accept patches that stop using tables without the developer providing the steps to cleanup the database." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280624 (https://phabricator.wikimedia.org/T131340) (owner: 10Catrope) [08:30:14] jynus: ^FYI again, let me know if you want me to handle it ;) [08:30:22] (03PS1) 10Muehlenhoff: Add ferm rules for mediawiki::maintenance [puppet] - 10https://gerrit.wikimedia.org/r/280625 [08:30:24] (03PS1) 10Muehlenhoff: Enable base::firewall on wasat [puppet] - 10https://gerrit.wikimedia.org/r/280626 [08:30:26] (03PS1) 10Muehlenhoff: Enable base::firewall on terbium [puppet] - 10https://gerrit.wikimedia.org/r/280627 [08:30:41] no, volans, it is ok, I will ack so it doesn't report again [08:30:48] apergos: I'll add info to task. [08:30:59] kart_: cool [08:31:02] it should give a couple of extra errors until it is finally fixed [08:31:11] basically that's all that has to be there, once we have the script then cronifying is easy peasy [08:31:29] apergos: cool. going to meeting, after that. [08:31:36] ok, thank you [08:31:39] (03PS6) 10Elukey: Add correct varnishkafka configuration files for Varnish 4 servers. [puppet] - 10https://gerrit.wikimedia.org/r/280459 (https://phabricator.wikimedia.org/T124278) [08:33:26] RECOVERY - MariaDB Slave SQL: s1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [08:34:05] (03PS1) 10Muehlenhoff: Also enable base::firewall on sarin [puppet] - 10https://gerrit.wikimedia.org/r/280628 [08:38:05] (03CR) 10Muehlenhoff: "puppet has been re-enabled on rcs*" [puppet] - 10https://gerrit.wikimedia.org/r/279339 (https://phabricator.wikimedia.org/T104981) (owner: 10Muehlenhoff) [08:40:32] (03CR) 10Elukey: [C: 032] "Tested again with cp1043/1044/1052:" [puppet] - 10https://gerrit.wikimedia.org/r/280459 (https://phabricator.wikimedia.org/T124278) (owner: 10Elukey) [08:43:18] why does icinga keep mentioning my username? [08:49:29] closedmouth: erm? [08:50:05] "PROBLEM - MariaDB Slave SQL: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Error Duplicate entry 78407288 for key PRIMARY on query. Default database: enwiki. Query: INSERT /* WikiPage::doDeleteArticleReal Closedmouth */ INTO archive (ar_namespace,ar_title,ar_comment" etc [08:51:16] judging from the backlog jynus has it under control [08:51:27] the fact that it referenced your edit is just a coincidence [08:51:38] שלום [08:51:54] hashar: :) [08:52:00] I do not know why it is alerting, i have downtimed that [08:52:00] are you in jerusalem? [08:52:09] nop [08:52:14] not much values for me to be there [08:52:22] huh [08:52:28] since it is focusing on community related development and really ... I am not a dev -:D [08:52:31] yes, what do you know [08:52:35] except mediawiki and wikimedia [08:52:46] and networking and a few other things :) [08:52:56] might come in SF instead [08:53:04] it's a terrible place full of terrible people [08:53:12] (SF) [08:53:20] if I ever stop procrastinate and actually prepare material / tutorials / courses CI related for SF folks [08:53:26] jerusalem, OTOH, is a wonderful place full of terrible people [08:54:17] <_joe_> ori: I disagree, SF is not a terrible place [08:54:41] well sure, maybe "place" is too flattering [08:54:46] <_joe_> and well, terrible people are a dense set within the set of human beings :) [08:55:02] dense people are a terrible set within the set of human beings, too [08:56:38] * James_F laughs. [08:57:56] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [08:59:50] 6Operations, 10ops-eqiad: upgrade package_builder machine with SSD - https://phabricator.wikimedia.org/T130759#2164326 (10fgiunchedi) @Cmjohnson is there space to add the SSDs alongside the existing disks? if so, let's coordinate for a time on IRC today or tomorrow to do this, thanks! [09:00:39] !log Forced WriteBack cache policy mode on db1047 RAID [09:00:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:02:43] (03PS1) 10Muehlenhoff: Add salt grains for mediawiki maintenance hosts [puppet] - 10https://gerrit.wikimedia.org/r/280631 [09:02:45] (03PS1) 10Muehlenhoff: Add mediawiki maintenance hosts to debdeploy server groups [puppet] - 10https://gerrit.wikimedia.org/r/280632 [09:04:13] (03PS1) 10Giuseppe Lavagetto: ganglia_aggregator_config: fixup for I6523610b5, linting [puppet] - 10https://gerrit.wikimedia.org/r/280633 [09:05:07] <_joe_> volans: ^^ [09:06:23] I going to be slow reporting back for the next 15 or so minutes [09:07:25] PROBLEM - BGP status on cr2-ulsfo is CRITICAL: BGP CRITICAL - AS2914/IPv4: Active, AS2914/IPv6: Active [09:07:29] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/280633 (owner: 10Giuseppe Lavagetto) [09:16:38] CI is back :) [09:17:38] (03CR) 10ArielGlenn: [C: 031] "Why isn't that already on there? I'm looking at neodymium to see why it has it and sarin doesn't. Anyways, yes please add it." [puppet] - 10https://gerrit.wikimedia.org/r/280628 (owner: 10Muehlenhoff) [09:23:03] (03CR) 10Florianschmidtwelzow: "what is reading web staging? o.O" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279934 (https://phabricator.wikimedia.org/T113243) (owner: 10Florianschmidtwelzow) [09:23:35] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 68.97% of data above the critical threshold [5000000.0] [09:25:32] 6Operations, 10ops-codfw: install SSDs in restbase2001-restbase2006 - https://phabricator.wikimedia.org/T127333#2164398 (10fgiunchedi) 5Open>3Resolved thanks @papaul [09:27:43] 6Operations, 10ops-codfw, 10RESTBase-Cassandra: restbase2004.codfw.wmnet: Failed disk/RAID - https://phabricator.wikimedia.org/T130990#2164406 (10fgiunchedi) restbase2004 has been reimaged yesterday with multiple instances, its first instance is bootstrapping, ETA ~16h [09:30:44] (03PS2) 10Giuseppe Lavagetto: ganglia_aggregator_config: fixup for I6523610b5, linting [puppet] - 10https://gerrit.wikimedia.org/r/280633 [09:31:47] (03CR) 10Giuseppe Lavagetto: [C: 032] ganglia_aggregator_config: fixup for I6523610b5, linting [puppet] - 10https://gerrit.wikimedia.org/r/280633 (owner: 10Giuseppe Lavagetto) [09:32:25] (03PS1) 10Muehlenhoff: Remove access credentials for S Page [puppet] - 10https://gerrit.wikimedia.org/r/280635 [09:35:04] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [09:36:09] (03PS2) 10Muehlenhoff: Add salt grains for mediawiki maintenance hosts [puppet] - 10https://gerrit.wikimedia.org/r/280631 [09:36:24] PROBLEM - puppet last run on cp1043 is CRITICAL: CRITICAL: Puppet has 1 failures [09:36:37] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add salt grains for mediawiki maintenance hosts [puppet] - 10https://gerrit.wikimedia.org/r/280631 (owner: 10Muehlenhoff) [09:36:50] (03PS11) 10KartikMistry: Enable non-default MT for some languages [puppet] - 10https://gerrit.wikimedia.org/r/277463 (https://phabricator.wikimedia.org/T129849) [09:39:43] (03CR) 10Ori.livneh: Include request ID in profiling data (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279631 (owner: 10Ori.livneh) [09:43:11] (03PS2) 10Muehlenhoff: Add mediawiki maintenance hosts to debdeploy server groups [puppet] - 10https://gerrit.wikimedia.org/r/280632 [09:43:19] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add mediawiki maintenance hosts to debdeploy server groups [puppet] - 10https://gerrit.wikimedia.org/r/280632 (owner: 10Muehlenhoff) [09:47:36] (03PS1) 10Mschon: update the DNS record for benefactors.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/280637 (https://phabricator.wikimedia.org/T130937) [09:50:07] !log disabled puppet on logstash1002/1003 (to activate the new logstash systemd unit in steps) [09:50:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:50:36] (03Abandoned) 10Ori.livneh: Include request ID in profiling data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279631 (owner: 10Ori.livneh) [09:51:25] (03PS6) 10Muehlenhoff: Add systemd unit for logstash [puppet] - 10https://gerrit.wikimedia.org/r/274696 (https://phabricator.wikimedia.org/T126677) [09:54:57] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add systemd unit for logstash [puppet] - 10https://gerrit.wikimedia.org/r/274696 (https://phabricator.wikimedia.org/T126677) (owner: 10Muehlenhoff) [09:55:06] RECOVERY - BGP status on cr2-ulsfo is OK: BGP OK - up: 75, down: 2, shutdown: 0 [10:00:24] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 50.00% above the threshold [1000000.0] [10:06:55] RECOVERY - Host hooft is UP: PING OK - Packet loss = 0%, RTA = 83.30 ms [10:15:33] (03PS1) 10ArielGlenn: add scap subdir to .gitignore [dumps] - 10https://gerrit.wikimedia.org/r/280640 [10:16:25] (03CR) 10ArielGlenn: [C: 032] add scap subdir to .gitignore [dumps] - 10https://gerrit.wikimedia.org/r/280640 (owner: 10ArielGlenn) [10:16:27] 6Operations, 10ops-codfw, 6DC-Ops: Check bast2001 for hardware problems - https://phabricator.wikimedia.org/T129316#2164551 (10faidon) It's been 3 weeks now — can we get an update on why this is taking so long? Rumour is that we're waiting for some disk shipment or something, but I don't see any updates or b... [10:16:59] (03PS2) 10Faidon Liambotis: DHCP: don't use esams.wm.org as domain name [puppet] - 10https://gerrit.wikimedia.org/r/280505 (owner: 10Dzahn) [10:17:11] (03CR) 10Faidon Liambotis: [C: 032 V: 032] DHCP: don't use esams.wm.org as domain name [puppet] - 10https://gerrit.wikimedia.org/r/280505 (owner: 10Dzahn) [10:18:05] (03PS1) 10Faidon Liambotis: Rename hooft's mgmt to bast3001 too [dns] - 10https://gerrit.wikimedia.org/r/280641 [10:19:55] PROBLEM - torrus.wikimedia.org UI on netmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string Torrus Top: Wikimedia not found on https://torrus.wikimedia.org:443/torrus - 628 bytes in 0.301 second response time [10:21:42] paravoid: I was about to ping you about --^ [10:22:35] (03Abandoned) 10Giuseppe Lavagetto: Remove decommissioned appservers [dns] - 10https://gerrit.wikimedia.org/r/275756 (https://phabricator.wikimedia.org/T126242) (owner: 10Giuseppe Lavagetto) [10:22:40] I added my username to librenms as suggested on the wiki (netmon1001), and puppet is disabled [10:22:56] (not by me, previously from what it is written in the SAL) [10:23:56] before restarting services I'd like to ask :) [10:24:51] !log logstash on logstash100[1-3] is now using systemd [10:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:25:15] (03PS1) 10Mschon: added SPF record to phabricator.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/280644 (https://phabricator.wikimedia.org/T116806) [10:35:37] !log restarted torrus-common on netmon1001 [10:35:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:41:51] 6Operations, 7Puppet, 10Salt, 13Patch-For-Review: Make it possible for wmf-reimage to work seamlessly with a non-local salt master - https://phabricator.wikimedia.org/T124761#2164586 (10Joe) @ArielGlenn the problem lies solely with wmf-reimage, I'm going to fix it now. [10:42:06] !log stopping torrus-common on netmon1001 to try https://wikitech.wikimedia.org/wiki/Torrus#Deadlock_problem [10:42:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:44:14] RECOVERY - torrus.wikimedia.org UI on netmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 2505 bytes in 0.386 second response time [10:52:46] thanks torrus [10:54:47] (03CR) 10Nemo bis: added SPF record to phabricator.wikimedia.org (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/280644 (https://phabricator.wikimedia.org/T116806) (owner: 10Mschon) [11:03:01] (03PS1) 10Ema: ganglia-varnish.py: get rid of dangerous characters [puppet] - 10https://gerrit.wikimedia.org/r/280646 (https://phabricator.wikimedia.org/T122880) [11:03:09] (03PS1) 10Giuseppe Lavagetto: wmf-reimage: fix salt signing [puppet] - 10https://gerrit.wikimedia.org/r/280647 (https://phabricator.wikimedia.org/T124761) [11:12:53] (03CR) 10Giuseppe Lavagetto: [C: 031] wmf-reimage: fix salt signing [puppet] - 10https://gerrit.wikimedia.org/r/280647 (https://phabricator.wikimedia.org/T124761) (owner: 10Giuseppe Lavagetto) [11:13:04] (03CR) 10Giuseppe Lavagetto: [C: 032] wmf-reimage: fix salt signing [puppet] - 10https://gerrit.wikimedia.org/r/280647 (https://phabricator.wikimedia.org/T124761) (owner: 10Giuseppe Lavagetto) [11:19:14] (03PS2) 10Ema: ganglia-varnish.py: get rid of dangerous characters [puppet] - 10https://gerrit.wikimedia.org/r/280646 (https://phabricator.wikimedia.org/T122880) [11:19:24] (03CR) 10Ema: [C: 032 V: 032] ganglia-varnish.py: get rid of dangerous characters [puppet] - 10https://gerrit.wikimedia.org/r/280646 (https://phabricator.wikimedia.org/T122880) (owner: 10Ema) [11:27:44] RECOVERY - puppet last run on cp1043 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [11:34:38] (03PS4) 10Volans: Hostgroups: add missing DBs into the mysql cluster [puppet] - 10https://gerrit.wikimedia.org/r/279329 (https://phabricator.wikimedia.org/T130819) [11:40:53] (03PS1) 10Elukey: Remove unused feature in varnishkafka for Varnish 4 [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/280651 (https://phabricator.wikimedia.org/T124278) [11:41:37] (03PS1) 10Filippo Giunchedi: prometheus: add server support [puppet] - 10https://gerrit.wikimedia.org/r/280652 (https://phabricator.wikimedia.org/T126785) [11:41:45] 6Operations, 10CirrusSearch, 6Discovery, 3Discovery-Search-Sprint, and 4 others: Look into encrypting Elasticsearch traffic - https://phabricator.wikimedia.org/T124444#2164625 (10Gehel) I created a minimal gatling project to do some experiments: {F3801415}. Result from a run with pools and HTTPS enabled: {... [11:42:10] (03CR) 10Ema: [C: 031] Remove unused feature in varnishkafka for Varnish 4 [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/280651 (https://phabricator.wikimedia.org/T124278) (owner: 10Elukey) [11:42:30] (03CR) 10Elukey: [C: 032] Remove unused feature in varnishkafka for Varnish 4 [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/280651 (https://phabricator.wikimedia.org/T124278) (owner: 10Elukey) [11:42:43] (03CR) 10Filippo Giunchedi: "there's likely more explanation to do on how it works, but this should provide a starting point" [puppet] - 10https://gerrit.wikimedia.org/r/280652 (https://phabricator.wikimedia.org/T126785) (owner: 10Filippo Giunchedi) [11:42:56] (03CR) 10jenkins-bot: [V: 04-1] prometheus: add server support [puppet] - 10https://gerrit.wikimedia.org/r/280652 (https://phabricator.wikimedia.org/T126785) (owner: 10Filippo Giunchedi) [11:44:10] (03PS2) 10Muehlenhoff: Also enable base::firewall on sarin [puppet] - 10https://gerrit.wikimedia.org/r/280628 [11:44:28] (03CR) 10Muehlenhoff: [C: 032 V: 032] Also enable base::firewall on sarin [puppet] - 10https://gerrit.wikimedia.org/r/280628 (owner: 10Muehlenhoff) [11:44:55] (03PS1) 10Elukey: Update the varnishkafka module [puppet] - 10https://gerrit.wikimedia.org/r/280653 (https://phabricator.wikimedia.org/T124278) [11:46:19] (03PS2) 10Elukey: Update the varnishkafka module [puppet] - 10https://gerrit.wikimedia.org/r/280653 (https://phabricator.wikimedia.org/T124278) [11:47:19] (03PS2) 10Muehlenhoff: Remove access credentials for S Page [puppet] - 10https://gerrit.wikimedia.org/r/280635 [11:47:26] (03CR) 10Ema: [C: 031] Update the varnishkafka module [puppet] - 10https://gerrit.wikimedia.org/r/280653 (https://phabricator.wikimedia.org/T124278) (owner: 10Elukey) [11:47:29] (03CR) 10Elukey: [C: 032] Update the varnishkafka module [puppet] - 10https://gerrit.wikimedia.org/r/280653 (https://phabricator.wikimedia.org/T124278) (owner: 10Elukey) [11:50:55] (03CR) 10Muehlenhoff: [C: 032 V: 032] Remove access credentials for S Page [puppet] - 10https://gerrit.wikimedia.org/r/280635 (owner: 10Muehlenhoff) [11:51:07] (03PS3) 10Muehlenhoff: Remove access credentials for S Page [puppet] - 10https://gerrit.wikimedia.org/r/280635 [11:51:15] (03CR) 10Muehlenhoff: [V: 032] Remove access credentials for S Page [puppet] - 10https://gerrit.wikimedia.org/r/280635 (owner: 10Muehlenhoff) [11:52:45] (03PS2) 10Muehlenhoff: Remove special case handling for labs realm [puppet] - 10https://gerrit.wikimedia.org/r/264264 [11:53:53] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:57:25] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [12:02:31] !log removed rsync 3.10-2ubuntu0.1~wmf1 from carbon. this backport was only needed when the hosts used a mix of precise and trusty. this is no longer the case, so remove the backport and allow to use stock Ubuntu updates on precise again [12:02:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:13:11] (03PS7) 10Giuseppe Lavagetto: Add select mode [software/conftool] - 10https://gerrit.wikimedia.org/r/278552 (https://phabricator.wikimedia.org/T128199) [12:13:57] (03CR) 10jenkins-bot: [V: 04-1] Add select mode [software/conftool] - 10https://gerrit.wikimedia.org/r/278552 (https://phabricator.wikimedia.org/T128199) (owner: 10Giuseppe Lavagetto) [12:15:02] (03PS8) 10Giuseppe Lavagetto: Add select mode [software/conftool] - 10https://gerrit.wikimedia.org/r/278552 (https://phabricator.wikimedia.org/T128199) [12:28:34] PROBLEM - Auth DNS on labs-ns0-former-placeholder.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [12:29:06] I'm looking for a safe way to deploy https://gerrit.wikimedia.org/r/#/c/280204/ (modification to caching headers for portals). [12:29:52] It has been suggested to deploy first on "debug app servers". What does it mean exactly? Manually apply the change and test with x-wikimedia-debug headers? [12:30:22] RECOVERY - Auth DNS on labs-ns0-former-placeholder.wikimedia.org is OK: DNS OK: 0.038 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.155.135 [12:32:52] gehel: I think they refer to mw1017 [12:33:49] elukey: yes, most probably, but am I expected to manually modify the config? Or disable puppet on all mw* servers and apply only on this one? [12:34:41] I've never done it but I believe that you should push the change, and instead of running sync-file (or similar) on tin you just go on mw1017 and sync [12:34:57] and if everything looks good, you proceed [12:35:13] Sorry, I was not specific enough... this is a puppet change, so sync-file does not apply [12:36:18] ahhhh snap sorry! [12:36:19] my bad [12:36:27] gehel: AFAIK the second one (disable with reason, apply on one, test, enable) [12:36:46] on all mw* nodes? [12:36:50] but get confirmation from someone else too for mw* specific stuff [12:37:01] * gehel need to have a look into salt ... [12:37:19] of course with salt, not ssh-ing into them one by one :D [12:37:23] yep if it is puppet you'd need to disable puppet temporarily using salt grains [12:38:39] (03CR) 10Gehel: "puppet-compiler output for mw1017: https://puppet-compiler.wmflabs.org/2242/mw1017.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/280204 (https://phabricator.wikimedia.org/T126280) (owner: 10Gehel) [12:39:10] life is better with a grain of salt... [12:45:12] RECOVERY - check_puppetrun on pay-lvs1001 is OK: OK: Puppet is currently enabled, last run 209 seconds ago with 0 failures [12:45:12] RECOVERY - check_puppetrun on barium is OK: OK: Puppet is currently enabled, last run 165 seconds ago with 0 failures [12:47:24] gehel: that's the spirit (not good for your blood pressure though) [12:47:39] * elukey auto-kicks itself out of the channell [12:48:50] gehel: with a shot of tequila? [12:49:39] just to check that I understood what I read correctly. I should from neodymium: sudo salt 'mw*' puppet.disable, apply my change on a test node (mw1017), validate that it is correct, re-enable puppet with "sudo salt 'mw*' puppet.enable". Sounds mostly correct? [12:49:51] (03PS1) 10Ema: ganglia-varnish.py: use slope=both where appropriate [puppet] - 10https://gerrit.wikimedia.org/r/280658 [12:50:12] p858snake: I'm more of a whisky guy, but I'm not sure about salt + whisky ... [12:51:29] (03CR) 10BBlack: [C: 031] ganglia-varnish.py: use slope=both where appropriate [puppet] - 10https://gerrit.wikimedia.org/r/280658 (owner: 10Ema) [12:51:32] gehel: I'd check for -C salt-grain [12:51:58] mw* will also disable puppet on jobrunners, that are not your target right? [12:52:29] (03CR) 10Ema: [C: 032 V: 032] ganglia-varnish.py: use slope=both where appropriate [puppet] - 10https://gerrit.wikimedia.org/r/280658 (owner: 10Ema) [12:53:02] elukey: not my goal (should not hurt if it is only for a short time, but let's try to do it well...) [12:53:42] yeah I wouldn't disable salt on all mw* [12:53:58] err disable puppet [12:54:01] (03PS1) 10Ema: Upgrade cp1044 to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/280660 (https://phabricator.wikimedia.org/T122880) [12:54:21] if you're testing a change on mw1017, surely you only need to disable puppet on mw1017? [12:54:53] (03CR) 10BBlack: [C: 031] Upgrade cp1044 to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/280660 (https://phabricator.wikimedia.org/T122880) (owner: 10Ema) [12:55:39] that was my initial question. Is it expected that I disable puppet on a my test host and apply the change manually, or that I disable puppet on the cluster and run puppet agent on my test host. [12:56:28] I'm happy either way, just wanted to know if we have a more or less standard way of running those kind of tests [12:57:58] I would assume the latter, but it's not really my area. you'd be best off asking someone who does this on a regular basis, but most them aren't in here this early [12:58:29] bblack: thanks! I'll try again later... no emergency to apply this... [12:59:15] gehel: disabling puppet on a subset of mw* and running it only on mw1017 would give you a complete confidence about a puppet run, rather than relying on manual changes [12:59:17] I meant the former anyways [12:59:41] I don't see why any host but mw1017 would be involved in testing on mw1017 [12:59:58] unless you're talking about merging before testing I guess [13:00:09] bblack: yes was about merging [13:00:14] I see [13:00:24] so it should be merging incrementally [13:00:26] gehel: the other way is to change the code so that will apply only to one host [13:00:46] parameter in the class or different block in site.pp I guess [13:01:12] volans: that's getting more complex than I would like it to be for this very simple change... [13:01:36] have you run the puppet compiler to check the changes on affected hosts? [13:02:55] In the end I'm changing a caching header for portals (which are static files in any case). Worst case, I screw up the config and apache does not reload, next worse I screw up my rule and we get slightly more traffic which in the end is only serving static files. [13:03:38] volans: yes, puppet-compiler has been run, change looks good (https://puppet-compiler.wmflabs.org/2242/mw1017.eqiad.wmnet/) [13:05:47] gehel: if you need it, salt -C 'G@cluster:appserver and G@site:eqiad' cmd.run 'bla' [13:06:06] elukey: thanks! [13:06:32] unless any of you has a strong objection, I'm gonna disable puppet on mw1017, apply my change manually, test and merge if all looks good... [13:08:24] if puppet-compiler looks good and the change is trivial (like it seems to be), it should be fine. My objection was more on the fact that it is nice to know what puppet does, but in this case we have the compiler and it doesn't seem a complicated and convoluted change. [13:08:35] plus don't pay attention to me :P [13:09:00] elukey: I disagree... I should pay attention to you... [13:09:10] :P [13:09:21] coffee before change, brb [13:09:38] !log depooling cp1044 for varnish 4 upgrade (T122880) [13:09:38] T122880: Evaluate and Test Limited Deployment of Varnish 4 - https://phabricator.wikimedia.org/T122880 [13:09:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:16:33] (03CR) 10Ema: [C: 032 V: 032] Upgrade cp1044 to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/280660 (https://phabricator.wikimedia.org/T122880) (owner: 10Ema) [13:17:00] another newbie question, should I depool mw1017 before restarting apache? Or is a apache graceful sufficient? [13:17:01] !log installing rsync security updates [13:17:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:18:28] mobrovac: on another subject, your change for Puppet SWAT last Tuesday has disappeared and has not re-appeared on today's SWAT. Has it already been deployed? [13:27:14] !log repooling cp1044, upgraded to varnish 4 (T122880) [13:27:14] T122880: Evaluate and Test Limited Deployment of Varnish 4 - https://phabricator.wikimedia.org/T122880 [13:27:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:38:03] PROBLEM - puppet last run on netmon1001 is CRITICAL: CRITICAL: Puppet last ran 14 hours ago [13:39:58] 6Operations, 10Traffic, 13Patch-For-Review: Create separate packages for required vmods - https://phabricator.wikimedia.org/T124281#2164742 (10ema) 5Open>3Resolved [13:40:00] 6Operations, 10Traffic, 13Patch-For-Review: Evaluate and Test Limited Deployment of Varnish 4 - https://phabricator.wikimedia.org/T122880#2164744 (10ema) [13:40:12] PROBLEM - check_puppetrun on payments1003 is CRITICAL: CRITICAL: Puppet has 1 failures [13:43:00] stupid puppet. [13:44:20] !log forced puppet agent on netmon1001 [13:44:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:45:12] RECOVERY - check_puppetrun on payments1003 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [13:45:13] RECOVERY - puppet last run on netmon1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:45:19] (03CR) 10Alex Monk: update the DNS record for benefactors.wikimedia.org (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/280637 (https://phabricator.wikimedia.org/T130937) (owner: 10Mschon) [13:45:42] 6Operations, 10Traffic, 13Patch-For-Review: Evaluate and Test Limited Deployment of Varnish 4 - https://phabricator.wikimedia.org/T122880#2164749 (10ema) [13:48:46] (03PS7) 10Hashar: contint: Use slave-scripts/bin/php wrapper script [puppet] - 10https://gerrit.wikimedia.org/r/269370 (https://phabricator.wikimedia.org/T126211) (owner: 10Legoktm) [13:48:54] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 63.33% of data above the critical threshold [5000000.0] [13:49:09] (03CR) 10jenkins-bot: [V: 04-1] contint: Use slave-scripts/bin/php wrapper script [puppet] - 10https://gerrit.wikimedia.org/r/269370 (https://phabricator.wikimedia.org/T126211) (owner: 10Legoktm) [13:49:48] !log netmon1001 - re-enabled puppet (was for torrus issue earlier) [13:49:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:50:17] (03CR) 10Hashar: "Made contint::php to require contint::slave_scripts so the git clone for integration/jenkins is done before attempting to point alternativ" [puppet] - 10https://gerrit.wikimedia.org/r/269370 (https://phabricator.wikimedia.org/T126211) (owner: 10Legoktm) [13:50:33] 6Operations, 10Traffic, 7Varnish: Port remaining scripts depending on varnishlog.py to new VSL API - https://phabricator.wikimedia.org/T131353#2164751 (10ema) [13:50:42] 6Operations, 10Traffic, 7Varnish: Port remaining scripts depending on varnishlog.py to new VSL API - https://phabricator.wikimedia.org/T131353#2164765 (10ema) p:5Triage>3Normal [13:50:52] (03PS8) 10Hashar: contint: Use slave-scripts/bin/php wrapper script [puppet] - 10https://gerrit.wikimedia.org/r/269370 (https://phabricator.wikimedia.org/T126211) (owner: 10Legoktm) [13:51:14] 6Operations, 10Traffic, 13Patch-For-Review: Upgrade to Varnish 4: things to remember - https://phabricator.wikimedia.org/T126206#2164767 (10ema) [13:51:16] 6Operations, 10Traffic, 13Patch-For-Review: Evaluate and Test Limited Deployment of Varnish 4 - https://phabricator.wikimedia.org/T122880#2164768 (10ema) [13:51:18] 6Operations, 10Traffic, 13Patch-For-Review: Port varnishlog.py to new VSL API - https://phabricator.wikimedia.org/T128788#2164766 (10ema) 5Open>3Resolved [13:53:20] 6Operations, 10Traffic, 13Patch-For-Review: Evaluate and Test Limited Deployment of Varnish 4 - https://phabricator.wikimedia.org/T122880#2164772 (10ema) [13:53:22] 6Operations, 10Continuous-Integration-Infrastructure, 10Traffic, 13Patch-For-Review: Make CI run Varnish VCL tests - https://phabricator.wikimedia.org/T128188#2164773 (10ema) [13:54:27] (03CR) 10Hashar: "Rebased and cherry picked again on integration puppet master" [puppet] - 10https://gerrit.wikimedia.org/r/269370 (https://phabricator.wikimedia.org/T126211) (owner: 10Legoktm) [13:55:04] (03CR) 10Volans: "All changes looks good, thanks joe to fix the problem for uranium compilation, new run of compiler here:" [puppet] - 10https://gerrit.wikimedia.org/r/279329 (https://phabricator.wikimedia.org/T130819) (owner: 10Volans) [13:55:13] (03PS5) 10Volans: Hostgroups: add missing DBs into the mysql cluster [puppet] - 10https://gerrit.wikimedia.org/r/279329 (https://phabricator.wikimedia.org/T130819) [13:55:53] !log hooft - reboot to pxe, one more time [13:55:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:57:52] (03CR) 10Volans: [C: 032] Hostgroups: add missing DBs into the mysql cluster [puppet] - 10https://gerrit.wikimedia.org/r/279329 (https://phabricator.wikimedia.org/T130819) (owner: 10Volans) [13:58:23] (03CR) 10Ottomata: Run eventlogging services out of deployed eventlogging source path (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/280497 (https://phabricator.wikimedia.org/T131263) (owner: 10Ottomata) [14:01:29] !log manually running puppet on es2018 to double verify merged changes [14:01:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:02:34] great work, volans! [14:03:06] thank you! [14:03:37] just doing my part :) [14:05:12] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: puppet fail [14:06:26] ^^frack, should be unrelated to my changes, but I cannot ssh to check [14:08:13] volans: boron is fine, i'm not sure why it paged [14:08:25] I just checked and puppet runs clean [14:08:50] thanks Jeff_Green [14:09:03] didn't page here btw, as in, no sms [14:09:08] ya [14:10:12] RECOVERY - check_puppetrun on boron is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [14:10:28] i am working on a staged deployment of a significant puppet change, so don't be too surprised to see things paging here and there, I'm using icinga downtime to keep things quiet but there have been surprises (like boron, just now) [14:10:31] (03CR) 10Elukey: [C: 031] No-op refactor of eventlogging module [puppet] - 10https://gerrit.wikimedia.org/r/280486 (https://phabricator.wikimedia.org/T131263) (owner: 10Ottomata) [14:11:27] heya moritzm, yt? i do have a deb packaging q for you [14:11:45] frack: also running package updates...that's probably more the issue with spurious alerts [14:11:53] Jeff_Green: ack, thanks [14:12:59] (03CR) 10Yuvipanda: [C: 031] Move dynamicproxy ferm rules into role::labs::novaproxy and role::labs::tools::proxy [puppet] - 10https://gerrit.wikimedia.org/r/274962 (owner: 10Muehlenhoff) [14:15:18] sure [14:25:39] ah sorry [14:25:42] moritzm: ok so [14:25:47] python-tornado is in jessie backports [14:25:50] i want it in trusty [14:25:52] what should I do? :) [14:26:30] just dl all the stuff and reprepro include it in trusty-wikimedia apt manually? [14:26:34] or should I rebuild it? [14:28:12] 6Operations: Make services manageable by systemd (tracking) - https://phabricator.wikimedia.org/T97402#2164805 (10MoritzMuehlenhoff) [14:28:14] 6Operations, 10Wikimedia-Logstash: Systemd unit for logstash - https://phabricator.wikimedia.org/T127677#2164803 (10MoritzMuehlenhoff) 5Open>3Resolved This is running on logstash100[1-3] for a few hours now. I've also send a pull request to github.com/elastic/logstash (seems they have a CLA, though). [14:29:00] is is pure python or does it build c bits? [14:29:34] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 50.00% above the threshold [1000000.0] [14:29:47] the cleanest way is to download the source package from jessie-backports and just rebuild it on copper [14:30:12] if you just import the debs from jessie-backports it might have differences in the dependencies [14:30:23] yeah, it might have c pits, i think it uses epoll [14:30:27] moritzm: long term I am hoping we can have a devpi / wheelhouse solution :) [14:30:28] ok [14:30:45] then I'd certainly rather rebuild it [14:31:43] ok [14:32:05] moritzm: shoudl I create a gerrit repo to track our rebuild? or if i am just rebuilding and not changing source, is just modifying changelog manually fine? [14:33:04] unless you're planning extensive longterm changes, that's not really needed [14:33:13] we have plenty of simple rebuilds [14:33:41] maybe just add a +wmf1 changelog entry so that we can find out later on why it was rebuild [14:34:01] because some local builds will need to be forward-ported to a later distro release and some won't [14:35:00] yeah i'm adding an entry with -trusty1 on the current version, pacakging for trusty-wikimedia, and adding a comment [14:35:41] _joe_: possible to review https://gerrit.wikimedia.org/r/#/c/277463/ again? little bit improvement, reading from cxserver config is still WIP. [14:36:11] moritzm: i'm a little handicapped by git-buildpackage, especially so on copper where i'm used to all the cowbuilder stuff just working [14:36:31] should I just set up a local git and gbp.conf so it all works as I'm used to, or is it easy to do it manually? [14:36:44] perfect then. there's only a few external debs which are managed in git, those which we maintain rather long term (e.g. openssl 1.0.2/, nginx, linux) [14:37:36] it's rather straightforward: [14:37:53] download the source package on copper (via proxy) [14:38:07] dpkg-source foo.dsc [14:38:12] eh, dpkg-source -x foo.dsc [14:38:20] make local changes and amend changelog [14:38:31] moritzm, apt-get source python-tornado got me the source [14:38:34] and allowed me to get that far [14:38:47] DIST=trusty-wikimedia pdebuild [14:38:51] OO [14:38:56] COL [14:38:57] COOL [14:39:34] and then fetch the build result from /var/cache/pbuilder/trusty-amd64 [14:39:45] and add it on carbon [14:40:01] moritzm: /results/..? [14:40:11] ah, yes [14:40:40] so /var/cache/pbuilder/result/trusty-amd64 [14:41:35] the path from copper to carbon is still a little icky, in the session at teh TechOps offsite we were talking about adding a real upload queue, but that hasn't been setup yet [14:42:07] i think its wooorrkiiing [14:42:39] awesome! [14:43:58] yeehaw, thanks moritzm http://apt.wikimedia.org/wikimedia/pool/main/p/python-tornado/ [14:44:00] so easy [14:44:08] vw :-) [14:48:02] (03CR) 10Elukey: [C: 031] Run eventlogging services out of deployed eventlogging source path [puppet] - 10https://gerrit.wikimedia.org/r/280497 (https://phabricator.wikimedia.org/T131263) (owner: 10Ottomata) [14:48:21] (03PS12) 10Ladsgroup: [WIP] Scap3 deployment configurations for ores [puppet] - 10https://gerrit.wikimedia.org/r/280403 [14:48:46] (03PS1) 10Dzahn: Revert "DHCP: set next-server for public-esams subnet to carbon" [puppet] - 10https://gerrit.wikimedia.org/r/280670 [14:49:01] 6Operations, 10ops-codfw: rack five new spare pool systems - https://phabricator.wikimedia.org/T130941#2164846 (10Papaul) p:5Triage>3Normal [14:49:22] (03PS2) 10Dzahn: Revert "DHCP: set next-server for public-esams subnet to carbon" [puppet] - 10https://gerrit.wikimedia.org/r/280670 [14:50:16] (03CR) 10Dzahn: [C: 032] Revert "DHCP: set next-server for public-esams subnet to carbon" [puppet] - 10https://gerrit.wikimedia.org/r/280670 (owner: 10Dzahn) [14:58:15] (03PS1) 10ArielGlenn: snapshots: restructure directory handling a bit [puppet] - 10https://gerrit.wikimedia.org/r/280673 [14:59:24] (03CR) 10jenkins-bot: [V: 04-1] snapshots: restructure directory handling a bit [puppet] - 10https://gerrit.wikimedia.org/r/280673 (owner: 10ArielGlenn) [15:00:04] anomie ostriches thcipriani marktraceur: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160331T1500). [15:00:04] dcausse: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:00:18] (03PS1) 10Dzahn: install_server: make multatuli the new bast, not hooft [puppet] - 10https://gerrit.wikimedia.org/r/280674 (https://phabricator.wikimedia.org/T123712) [15:00:23] o/ [15:00:47] I can SWAT today. [15:03:39] (03PS7) 10Ottomata: No-op refactor of eventlogging module [puppet] - 10https://gerrit.wikimedia.org/r/280486 (https://phabricator.wikimedia.org/T131263) [15:03:58] (03PS2) 10Dzahn: install_server: make multatuli the new bast, not hooft [puppet] - 10https://gerrit.wikimedia.org/r/280674 (https://phabricator.wikimedia.org/T123712) [15:04:06] (03PS3) 10Elukey: Add basic varnishkafka rsyslog config to the varnishkafka module. [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/279308 (https://phabricator.wikimedia.org/T129344) [15:04:16] (03CR) 10Dzahn: [C: 032] install_server: make multatuli the new bast, not hooft [puppet] - 10https://gerrit.wikimedia.org/r/280674 (https://phabricator.wikimedia.org/T123712) (owner: 10Dzahn) [15:06:20] 6Operations, 10vm-requests: eqiad: VM request for archiva - https://phabricator.wikimedia.org/T131358#2164873 (10MoritzMuehlenhoff) [15:06:52] (03PS4) 10Elukey: Add basic varnishkafka rsyslog config to the varnishkafka module. [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/279308 (https://phabricator.wikimedia.org/T129344) [15:08:34] (03CR) 10Ottomata: [C: 031] Add basic varnishkafka rsyslog config to the varnishkafka module. [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/279308 (https://phabricator.wikimedia.org/T129344) (owner: 10Elukey) [15:08:50] (03PS8) 10Ottomata: No-op refactor of eventlogging module [puppet] - 10https://gerrit.wikimedia.org/r/280486 (https://phabricator.wikimedia.org/T131263) [15:08:59] (03CR) 10Ottomata: [C: 032 V: 032] No-op refactor of eventlogging module [puppet] - 10https://gerrit.wikimedia.org/r/280486 (https://phabricator.wikimedia.org/T131263) (owner: 10Ottomata) [15:12:43] (03CR) 10Ema: [C: 031] Add basic varnishkafka rsyslog config to the varnishkafka module. [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/279308 (https://phabricator.wikimedia.org/T129344) (owner: 10Elukey) [15:14:23] (03PS5) 10Muehlenhoff: Move dynamicproxy ferm rules into role::labs::novaproxy and role::labs::tools::proxy [puppet] - 10https://gerrit.wikimedia.org/r/274962 [15:14:33] (03CR) 10Muehlenhoff: [C: 032 V: 032] Move dynamicproxy ferm rules into role::labs::novaproxy and role::labs::tools::proxy [puppet] - 10https://gerrit.wikimedia.org/r/274962 (owner: 10Muehlenhoff) [15:16:24] (03PS1) 10CSteipp: Enable Ex:OATHAuth in beta, disabled for all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280676 [15:17:08] !log thcipriani@tin Synchronized php-1.27.0-wmf.19/extensions/CirrusSearch/includes/ElasticsearchIntermediary.php: SWAT: Ignore ResultSets that do not return pages [[gerrit:280669]] (duration: 00m 38s) [15:17:10] ^ dcausse check please [15:17:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:19:24] thcipriani: looks good, thanks! [15:19:38] dcausse: thanks for checking :) [15:20:42] (03CR) 10Elukey: [C: 032] Add basic varnishkafka rsyslog config to the varnishkafka module. [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/279308 (https://phabricator.wikimedia.org/T129344) (owner: 10Elukey) [15:23:13] (03PS1) 10Elukey: Update the varnishkafka module with latest changes. [puppet] - 10https://gerrit.wikimedia.org/r/280678 (https://phabricator.wikimedia.org/T129344) [15:27:33] (03PS2) 10Elukey: Update the varnishkafka module with latest changes. [puppet] - 10https://gerrit.wikimedia.org/r/280678 (https://phabricator.wikimedia.org/T129344) [15:35:31] PROBLEM - puppet last run on cp1008 is CRITICAL: CRITICAL: Puppet last ran 1 day ago [15:38:38] (03CR) 10JanZerebecki: [C: 031] added SPF record to phabricator.wikimedia.org (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/280644 (https://phabricator.wikimedia.org/T116806) (owner: 10Mschon) [15:39:14] 6Operations, 10Analytics-Cluster, 10hardware-requests: setup/deploy server analytics1003/WMF4541 - https://phabricator.wikimedia.org/T130840#2164979 (10Ottomata) @Robh, who should we ask? @faidon? [15:45:46] 6Operations, 10hardware-requests: +1 'stat' type box for hadoop client usage - https://phabricator.wikimedia.org/T128808#2165016 (10mark) a:5mark>3RobH >>! In T128808#2119727, @RobH wrote: > Those are the only in warranty spare in eqiad with 32GB or greater memory. > > I'm escalating this request to @mar... [15:46:02] (03CR) 10Elukey: [C: 04-1] "https://puppet-compiler.wmflabs.org/2245/" [puppet] - 10https://gerrit.wikimedia.org/r/280678 (https://phabricator.wikimedia.org/T129344) (owner: 10Elukey) [15:47:02] 6Operations, 10Analytics, 10hardware-requests, 13Patch-For-Review: eqiad: (3) AQS replacement nodes - https://phabricator.wikimedia.org/T124947#2165022 (10mark) a:5mark>3RobH Approved from the pool of new spare systems. [15:47:35] 6Operations, 10hardware-requests: eqiad: (3) nodes for Druid / analytics - https://phabricator.wikimedia.org/T128807#2165027 (10mark) a:5mark>3RobH Approved. [15:47:49] 6Operations, 10Analytics, 10hardware-requests, 13Patch-For-Review: eqiad: (3) AQS replacement nodes - https://phabricator.wikimedia.org/T124947#2165029 (10Ottomata) @Robh, are the SSDs for this already ordered too? [15:48:20] RECOVERY - puppet last run on cp1008 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [15:48:27] 6Operations, 10hardware-requests: eqiad: (3) nodes for Druid / analytics - https://phabricator.wikimedia.org/T128807#2165030 (10Ottomata) @Robh, SSDs too ja? [15:49:07] (03CR) 10JanZerebecki: [C: 04-1] update the DNS record for benefactors.wikimedia.org (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/280637 (https://phabricator.wikimedia.org/T130937) (owner: 10Mschon) [15:53:51] PROBLEM - puppet last run on cp3031 is CRITICAL: CRITICAL: puppet fail [15:54:26] (03PS3) 10Elukey: Update the varnishkafka module with latest changes. [puppet] - 10https://gerrit.wikimedia.org/r/280678 (https://phabricator.wikimedia.org/T129344) [15:55:24] 6Operations, 10hardware-requests: eqiad: (3) nodes for Druid / analytics - https://phabricator.wikimedia.org/T128807#2165055 (10RobH) Negative, these systems don't have SSDs. There were spare pool systems ordered with SATA. So we can order new systems with SSDs, or swap the in warranty sata disks out for SSDs. [15:56:29] yo, anyone else having problems searching on www.wikitionary.org? [15:56:36] !log running checkLocalUser.php --delete on some wikis for T119736 [15:56:36] T119736: Could not find local user data for {Username}@{wiki} - https://phabricator.wikimedia.org/T119736 [15:56:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:56:44] 503s and such [15:58:40] dr0ptp4kt: i just ran a few quick searches and didn't run into any problems [15:59:25] mdholloway: from the homepage? [15:59:35] mdholloway: like type and hit enter? [15:59:47] both www.wiktionary.org homepage and en.wiktionary [16:00:04] _joe_ gehel: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160331T1600). [16:00:04] thcipriani: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:36] thcipriani: so this time for real... [16:00:37] <_joe_> I didn't sign up for puppetswat this week... [16:00:49] gehel: :) [16:00:51] <_joe_> also I'm still on a meeting [16:00:58] mdholloway: hmmm. i'm doing the following: 1. with clean browser side cache/storage, etc. go to https://www.wiktionary.org/, 2. type econometrist, 3. hit ENTER [16:01:09] dr0ptp4kt: completion suggestions seem a little over-eager but it doesn't sound like that's what you're running into [16:01:12] _joe_: someone from ops needs to edit the [[deployments]] page after ya'll decide in your meeting [16:01:38] mdholloway: correct. suggest is producing a nice list. but the ENTER is resulting in a 503 [16:01:44] 6Operations, 10hardware-requests: eqiad: (3) nodes for Druid / analytics - https://phabricator.wikimedia.org/T128807#2086645 (10RobH) [16:01:50] mdholloway: able to repro following those exact steps? [16:01:52] _joe_: I'll take it [16:02:07] dr0ptp4kt: oh, weird, i got the 503 searching 'econometrist' even without clearing my cache [16:02:21] mdholloway: how about other words? [16:02:41] mdholloway: i just typed spearphishing and that was fine [16:02:57] mdholloway: i know what it is, but it was the first word that came to mind [16:03:23] dr0ptp4kt: that's the first 503 i've hit so far, will try others [16:04:25] thcipriani: this look like a trivial change, but probably has a non trivial impact (and I have just no idea what it actually does). Can you give me a short walk through so that I learn something today? [16:05:15] _joe_: if you can comment when free on https://gerrit.wikimedia.org/r/#/c/277463/, that will be nice :) [16:05:35] <_joe_> kart_: pinging me twice a day won't make all the other stuff I'm doing go away [16:05:51] _joe_: 'when free' [16:05:54] <_joe_> I'm also just back from vacation, so the backlog is large [16:06:20] _joe_: no worries, last priority. [16:06:27] <_joe_> yeah, you pinged me this morning, I will look (but I removed my -2 before vacations, IIRC) [16:06:35] okay! [16:06:49] gehel: sure, so the network::constants are used throughout puppet to get access to ips that represent groups of servers in a specific realm and datacenter. In this case, we added a deployment_server in the labs realm and didn't add it to the network::constants in that patch. [16:07:06] thcipriani: it seems to me that this impacts only ferm rules and rsync modules, correct? [16:07:33] gehel: indeed. and only for the labs realm. [16:08:17] this patch is cherry-picked and working on the deployment-prep puppetmaster, this should be a noop for production and really just gets rid of a cherry pick in beta. [16:08:32] ok, let me rebase and merge this... [16:08:59] (03PS2) 10Gehel: Beta: Add mira to deployment_hosts [puppet] - 10https://gerrit.wikimedia.org/r/279392 (owner: 10Thcipriani) [16:10:09] 6Operations, 10Wikimedia-General-or-Unknown, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Switchover of the application servers to codfw - https://phabricator.wikimedia.org/T124671#2165120 (10jcrespo) [16:10:11] 6Operations, 10DBA, 13Patch-For-Review, 7Performance, and 2 others: Stress-test mediawiki application servers at codfw (specially to figure out db weights configuration) and basic buffer warming - https://phabricator.wikimedia.org/T124697#2165118 (10jcrespo) 5Open>3Resolved I am closing this task becau... [16:10:30] PROBLEM - HHVM rendering on mw1250 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.005 second response time [16:10:33] (03CR) 10Gehel: [C: 032] Beta: Add mira to deployment_hosts [puppet] - 10https://gerrit.wikimedia.org/r/279392 (owner: 10Thcipriani) [16:12:12] RECOVERY - HHVM rendering on mw1250 is OK: HTTP OK: HTTP/1.1 200 OK - 66517 bytes in 0.074 second response time [16:12:28] (03CR) 10Andrew Bogott: [C: 031] "Bryan, have you confirmed that these packages (and 'mytop' in the subsequent patch) are present and identically-named in precise, trusty, " [puppet] - 10https://gerrit.wikimedia.org/r/272415 (https://phabricator.wikimedia.org/T114388) (owner: 10BryanDavis) [16:12:40] thcipriani: merged... [16:12:56] gehel: okie doke, manually updating deployment-puppetmaster [16:13:18] thcipriani: ping me if you need anything else... [16:13:26] gehel: will do, thanks for your help! [16:13:43] thcipriani: at your service [16:13:54] * gehel is happy that puppet SWAT is going better today... [16:14:16] :D [16:15:39] (03PS3) 10Rush: Tools: Add dev packages needed to compile python-ldap [puppet] - 10https://gerrit.wikimedia.org/r/272415 (https://phabricator.wikimedia.org/T114388) (owner: 10BryanDavis) [16:16:05] (03CR) 10Rush: [C: 031] "yeah to confirm this lands on bastions and we have all flavors :)" [puppet] - 10https://gerrit.wikimedia.org/r/272415 (https://phabricator.wikimedia.org/T114388) (owner: 10BryanDavis) [16:21:21] RECOVERY - puppet last run on cp3031 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [16:27:10] 6Operations, 10RESTBase-Cassandra, 6Services: Investigate high read requests on restbase1012-a - https://phabricator.wikimedia.org/T131370#2165194 (10GWicke) [16:27:32] PROBLEM - Host mr1-eqiad is DOWN: CRITICAL - Network Unreachable (208.80.154.199) [16:28:12] PROBLEM - Host ps1-b2-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [16:28:12] PROBLEM - Host ps1-c2-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [16:28:12] PROBLEM - Host ps1-a5-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [16:28:12] PROBLEM - Host ps1-c5-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [16:28:12] PROBLEM - Host ps1-c6-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [16:28:12] PROBLEM - Host ps1-d8-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [16:28:12] PROBLEM - Host ps1-b6-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [16:28:13] PROBLEM - Host ps1-d3-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [16:28:14] PROBLEM - Host ps1-a1-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [16:28:14] PROBLEM - Host ps1-c3-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [16:28:21] PROBLEM - Host ps1-c8-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [16:28:21] PROBLEM - Host ps1-b5-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [16:28:21] PROBLEM - Host ps1-c1-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [16:28:22] PROBLEM - Host ps1-b4-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [16:28:22] PROBLEM - Host ps1-b7-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [16:28:22] PROBLEM - Host ps1-d4-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [16:28:22] PROBLEM - Host ps1-d1-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [16:28:22] PROBLEM - Host ps1-d6-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [16:28:22] PROBLEM - Host ps1-d5-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [16:28:23] PROBLEM - Host ps1-b3-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [16:28:23] PROBLEM - Host ps1-a4-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [16:28:41] PROBLEM - Host ps1-a7-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [16:28:41] PROBLEM - Host ps1-b8-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [16:28:41] PROBLEM - Host ps1-b1-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [16:28:41] PROBLEM - Host ps1-a3-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [16:28:41] PROBLEM - Host ps1-d2-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [16:28:41] PROBLEM - Host ps1-d7-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [16:28:50] PROBLEM - Host ps1-c4-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [16:29:31] PROBLEM - Host mr1-eqiad.oob is DOWN: PING CRITICAL - Packet loss = 100% [16:29:40] ^ anyone knows what is happening? I did just deploy a change to ferm rules (which was supposed to be a noop, tested with puppet agent -t --noop ona few hosts) [16:31:04] 6Operations, 10RESTBase-Cassandra, 6Services: Investigate high read requests on restbase1012-a - https://phabricator.wikimedia.org/T131370#2165196 (10GWicke) Other ideas: - bootstraps: SAL says that 2004 was started 24 hours later, so unlikely. - repairs [16:32:24] as I understand it those alerts are misc equipment, most probably unrelated to my merge, still... [16:32:27] those are powersupplies I think? cmjohnson1 or robh any ideas on above? [16:32:46] yeah [16:32:58] chasemp: powersupplies and management router ... [16:33:04] mr1 is down [16:33:56] (03CR) 10Jgreen: [C: 04-1] "Other than the missing space the change looks ok to me, please resubmit." [dns] - 10https://gerrit.wikimedia.org/r/280637 (https://phabricator.wikimedia.org/T130937) (owner: 10Mschon) [16:36:11] RECOVERY - Host ps1-b1-eqiad is UP: PING OK - Packet loss = 16%, RTA = 4.32 ms [16:36:11] RECOVERY - Host ps1-d7-eqiad is UP: PING OK - Packet loss = 16%, RTA = 3.55 ms [16:36:11] RECOVERY - Host ps1-a5-eqiad is UP: PING OK - Packet loss = 16%, RTA = 3.64 ms [16:36:11] RECOVERY - Host ps1-c6-eqiad is UP: PING OK - Packet loss = 16%, RTA = 3.85 ms [16:36:11] RECOVERY - Host ps1-c1-eqiad is UP: PING OK - Packet loss = 16%, RTA = 2.60 ms [16:36:11] RECOVERY - Host ps1-a2-eqiad is UP: PING OK - Packet loss = 16%, RTA = 3.07 ms [16:36:11] RECOVERY - Host ps1-d3-eqiad is UP: PING OK - Packet loss = 16%, RTA = 3.64 ms [16:36:12] RECOVERY - Host ps1-d2-eqiad is UP: PING OK - Packet loss = 16%, RTA = 2.43 ms [16:36:12] RECOVERY - Host ps1-b5-eqiad is UP: PING OK - Packet loss = 16%, RTA = 4.15 ms [16:36:13] RECOVERY - Host ps1-b6-eqiad is UP: PING OK - Packet loss = 16%, RTA = 3.86 ms [16:36:13] RECOVERY - Host ps1-a6-eqiad is UP: PING OK - Packet loss = 16%, RTA = 2.58 ms [16:36:14] RECOVERY - Host ps1-b2-eqiad is UP: PING OK - Packet loss = 16%, RTA = 2.87 ms [16:36:31] RECOVERY - Host ps1-b3-eqiad is UP: PING OK - Packet loss = 0%, RTA = 4.32 ms [16:36:31] RECOVERY - Host ps1-d6-eqiad is UP: PING OK - Packet loss = 0%, RTA = 2.73 ms [16:36:31] RECOVERY - Host ps1-b7-eqiad is UP: PING OK - Packet loss = 0%, RTA = 3.55 ms [16:36:31] RECOVERY - Host ps1-c7-eqiad is UP: PING OK - Packet loss = 0%, RTA = 3.77 ms [16:36:31] RECOVERY - Host ps1-d8-eqiad is UP: PING OK - Packet loss = 0%, RTA = 2.68 ms [16:36:41] RECOVERY - Host ps1-c3-eqiad is UP: PING OK - Packet loss = 0%, RTA = 2.80 ms [16:36:41] RECOVERY - Host ps1-c5-eqiad is UP: PING OK - Packet loss = 0%, RTA = 3.31 ms [16:36:41] RECOVERY - Host ps1-a4-eqiad is UP: PING OK - Packet loss = 0%, RTA = 3.84 ms [16:36:41] RECOVERY - Host ps1-c8-eqiad is UP: PING OK - Packet loss = 0%, RTA = 3.23 ms [16:36:41] RECOVERY - Host ps1-a8-eqiad is UP: PING OK - Packet loss = 0%, RTA = 2.15 ms [16:36:41] RECOVERY - Host ps1-d4-eqiad is UP: PING OK - Packet loss = 0%, RTA = 2.10 ms [16:37:30] RECOVERY - Host mr1-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.96 ms [16:41:03] (03PS4) 10Volans: [WIP] DB: Expose Puppet SSL certs and generate CA cert [puppet] - 10https://gerrit.wikimedia.org/r/279596 (https://phabricator.wikimedia.org/T111654) [16:41:40] RECOVERY - Host mr1-eqiad.oob is UP: PING OK - Packet loss = 0%, RTA = 9.41 ms [16:48:11] ostriches: ping, 911 [16:51:37] Hm? [16:51:52] 6Operations, 10RESTBase-Cassandra, 6Services: Investigate high read requests on restbase1012-a - https://phabricator.wikimedia.org/T131370#2165344 (10GWicke) Top user agents in error messages over the last 10 days: Term Count Action Magnus labs tools 28227 MediaWiki/1.27.0-wmf.18 (RestbaseUpdateJob) 21209... [16:53:25] 6Operations, 10RESTBase-Cassandra, 6Services: Investigate high read requests on restbase1012-a - https://phabricator.wikimedia.org/T131370#2165347 (10GWicke) [16:55:33] (03PS2) 10ArielGlenn: snapshots: restructure directory handling a bit [puppet] - 10https://gerrit.wikimedia.org/r/280673 [16:55:50] ostriches: solved by Merlinj [16:55:55] thanks anyway [16:56:21] PROBLEM - puppet last run on maps-test2003 is CRITICAL: CRITICAL: puppet fail [16:56:48] (03CR) 10jenkins-bot: [V: 04-1] snapshots: restructure directory handling a bit [puppet] - 10https://gerrit.wikimedia.org/r/280673 (owner: 10ArielGlenn) [16:56:56] 6Operations, 10hardware-requests, 13Patch-For-Review: Allocate 2 analytics machines to experiment with a jupyterhub notebook service - https://phabricator.wikimedia.org/T130760#2165350 (10RobH) 5Open>3Resolved a:3RobH [16:57:49] (03CR) 10Volans: "Run of puppet compiler for the coredb:: affected hosts:" [puppet] - 10https://gerrit.wikimedia.org/r/279596 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [16:59:21] 6Operations, 10ops-eqiad, 10netops: investigate why mr1-eqiad randomly rebooted - https://phabricator.wikimedia.org/T131379#2165365 (10Cmjohnson) [17:00:04] yurik gwicke cscott arlolra subbu: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160331T1700). [17:01:02] nothing today for parsoid [17:01:26] (03PS2) 10Filippo Giunchedi: prometheus: add server support [puppet] - 10https://gerrit.wikimedia.org/r/280652 (https://phabricator.wikimedia.org/T126785) [17:06:16] (03PS2) 10Dzahn: Enable base::firewall on rcs1002 [puppet] - 10https://gerrit.wikimedia.org/r/279339 (https://phabricator.wikimedia.org/T104981) (owner: 10Muehlenhoff) [17:09:17] (03PS3) 10ArielGlenn: snapshots: restructure directory handling a bit [puppet] - 10https://gerrit.wikimedia.org/r/280673 [17:12:44] (03PS1) 10Elukey: Add rsyslog configuration only if Service['rsyslog'] has been defined. [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/280690 (https://phabricator.wikimedia.org/T129344) [17:20:49] (03CR) 10Dzahn: "nginx is running on 0.0.0.0 80/443 and there is f.e. http://stream.wikimedia.org/rcstream_status but the rule would limit that to $INTERNA" [puppet] - 10https://gerrit.wikimedia.org/r/279339 (https://phabricator.wikimedia.org/T104981) (owner: 10Muehlenhoff) [17:21:31] 7Puppet, 10Continuous-Integration-Infrastructure: mediawiki jobs fail intermittently with "mw-teardown-mysql.sh: Can't revoke all privileges" - https://phabricator.wikimedia.org/T126699#2021271 (10dduvall) [17:21:55] (03CR) 10Dzahn: "ah, of course it is, because that's behind stream-lb.eqiad , nevermind" [puppet] - 10https://gerrit.wikimedia.org/r/279339 (https://phabricator.wikimedia.org/T104981) (owner: 10Muehlenhoff) [17:22:31] 6Operations, 10RESTBase-Cassandra, 6Services: Investigate high read requests on restbase1012-a - https://phabricator.wikimedia.org/T131370#2165459 (10GWicke) Interestingly, [read request completed](https://grafana-admin.wikimedia.org/dashboard/db/cassandra-restbase-eqiad?panelId=24&fullscreen&from=1459187749... [17:23:30] RECOVERY - puppet last run on maps-test2003 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [17:23:38] (03CR) 10Dzahn: [C: 032] "confirmed with netstat. looks all covered. other ports are just the standards like diamond, nrpe etc that are covered by base" [puppet] - 10https://gerrit.wikimedia.org/r/279339 (https://phabricator.wikimedia.org/T104981) (owner: 10Muehlenhoff) [17:25:44] 6Operations, 10RESTBase-Cassandra, 6Services: Investigate high read requests on restbase1012-a - https://phabricator.wikimedia.org/T131370#2165465 (10GWicke) Actually, https://grafana-admin.wikimedia.org/dashboard/db/restbase-cassandra-client-requests?panelId=32&fullscreen&from=1459248919680&to=1459313279864... [17:26:18] (03CR) 10Nemo bis: added SPF record to phabricator.wikimedia.org (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/280644 (https://phabricator.wikimedia.org/T116806) (owner: 10Mschon) [17:28:28] 6Operations, 10Wikimedia-Stream, 13Patch-For-Review: Ferm rules for rcstream - https://phabricator.wikimedia.org/T104981#1433655 (10Dzahn) confirmed with netstat on rcs1002 that all listening ports are covered like Moritz said on the patch., merged. rules have been activated on rcs1002. rcs1001 is unchanged... [17:31:27] ori: hmm.. could you maybe confirm if rcstream looks normal ? [17:38:36] (03PS1) 10Dzahn: Revert "Enable base::firewall on rcs1002" [puppet] - 10https://gerrit.wikimedia.org/r/280694 [17:39:25] (03CR) 10Dzahn: [C: 032] "can't access the service status page anymore, but this should not happen" [puppet] - 10https://gerrit.wikimedia.org/r/280694 (owner: 10Dzahn) [17:42:22] Hey there, does anybody have a nice varnish config for Mediawiki they've made? I've been pulling my hair out at trying to figure out what Wikimedia/wikipedia and how they are able to properly purge their caches. I've been lead in some positive paths, such as https://wikitech.wikimedia.org/wiki/Multicast_HTCP_purging but nothing is translating to an actual basic config. [17:42:50] Mostly what I'm finding is methodologies for scaling, but really all I need to do is make a single varnish server cache the same content as Wikimedia would, along with purging the proper pages. [17:43:16] I have read the documentation and created varnish configs before, and have already tried decrypting what the Wikimedia team has in their puppet deployment, but its all pretty specific to their service and leaves out some key structures [17:44:41] (03CR) 10Ottomata: [C: 032 V: 032] Add rsyslog configuration only if Service['rsyslog'] has been defined. [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/280690 (https://phabricator.wikimedia.org/T129344) (owner: 10Elukey) [17:46:32] !log rcs1002 - stop ferm [17:46:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:46:48] (03PS4) 10Elukey: Update the varnishkafka module with latest changes. [puppet] - 10https://gerrit.wikimedia.org/r/280678 (https://phabricator.wikimedia.org/T129344) [17:46:49] moritzm: still around? [17:46:52] got a python deb q [17:47:48] (03CR) 10Dzahn: "after this and stopping ferm, which flushes all rules, http://stream.wikimedia.org/rcstream_status works again right away. this is strange" [puppet] - 10https://gerrit.wikimedia.org/r/280694 (owner: 10Dzahn) [17:48:18] ah, actually, moritzm, you probably arern't around, but i have to afk for a bit, will be back in in like 45 mins [17:50:36] 6Operations, 10Wikimedia-Stream, 13Patch-For-Review: Ferm rules for rcstream - https://phabricator.wikimedia.org/T104981#2165538 (10Dzahn) even though i saw no DROPs, this stopped working: http://stream.wikimedia.org/rcstream_status after that merge, which is really not expected because rcs1001/1002 are be... [18:06:50] (03PS8) 10Florianschmidtwelzow: Remove $wgCopyrightIcon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261999 (https://phabricator.wikimedia.org/T122754) [18:09:13] 6Operations, 10ops-eqiad: replace h310 with h710 controller in notebook1001 & notebook1002 poweredge r720xd systems - https://phabricator.wikimedia.org/T131331#2165616 (10RobH) a:3Cmjohnson [18:15:56] 6Operations, 10RESTBase-Cassandra, 6Services: Investigate high read requests on restbase1012-a - https://phabricator.wikimedia.org/T131370#2165623 (10GWicke) I have updated [restbase-cassandra-client-requests](https://grafana-admin.wikimedia.org/dashboard/db/restbase-cassandra-client-requests?from=1459248919... [18:16:09] 6Operations, 10RESTBase-Cassandra, 6Services: Investigate high read requests on restbase1012-a - https://phabricator.wikimedia.org/T131370#2165624 (10GWicke) 5Open>3Invalid [18:28:19] (03PS5) 10Yuvipanda: ores: Remove use of git clone [puppet] - 10https://gerrit.wikimedia.org/r/280247 (owner: 10Ladsgroup) [18:28:33] (03PS6) 10Yuvipanda: ores: Remove use of git clone [puppet] - 10https://gerrit.wikimedia.org/r/280247 (owner: 10Ladsgroup) [18:28:44] (03CR) 10Yuvipanda: [C: 032 V: 032] ores: Remove use of git clone [puppet] - 10https://gerrit.wikimedia.org/r/280247 (owner: 10Ladsgroup) [18:30:26] YuviPanda: maybe you know some of this, since you have done some python deb packaging [18:30:31] i'm using pydist-overrides [18:30:34] to fix a package dep [18:30:41] but i'm not sure how to make it fixed for python3 as well [18:30:56] my debian builds both python 2 and 3 packages using dh_python [18:31:20] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=778633 seems to indicate that i can use py3dist-overrides [18:31:28] but, that might be ina newer versino of dh_python than we have on carbon [18:33:35] if you have never used pydist-overrides, ignore, i'll figure it out evntually... [18:52:03] 6Operations, 10Continuous-Integration-Infrastructure, 10Phabricator, 10netops, and 3 others: Make sure phab can talk to gearman and nodepool instances can talk to phabricator - https://phabricator.wikimedia.org/T131375#2165704 (10mmodell) [18:53:37] 6Operations, 10Continuous-Integration-Infrastructure, 10Phabricator, 10netops, and 3 others: Make sure phab can talk to gearman and nodepool instances can talk to phabricator - https://phabricator.wikimedia.org/T131375#2165728 (10mmodell) So I'm guessing that iridium -> gallium:4730 is probably fixable wit... [18:53:52] AH HA I GOT IT! (not that anyone is listening!) [18:56:13] 6Operations, 10hardware-requests, 13Patch-For-Review: Allocate 2 analytics machines to experiment with a jupyterhub notebook service - https://phabricator.wikimedia.org/T130760#2165758 (10Cmjohnson) [18:56:16] 6Operations, 10ops-eqiad: replace h310 with h710 controller in notebook1001 & notebook1002 poweredge r720xd systems - https://phabricator.wikimedia.org/T131331#2165756 (10Cmjohnson) 5Open>3Resolved Raid controller replaced both setup w/ Raid 10 and 256k stripe. [19:00:04] thcipriani: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160331T1900). [19:00:21] train time [19:00:29] choo choo [19:02:09] huh, no bot. allwikis to wmf.19 patch: https://gerrit.wikimedia.org/r/#/c/280703/ [19:02:32] 6Operations, 10Continuous-Integration-Infrastructure, 10Phabricator, 10netops, and 3 others: Make sure phab can talk to gearman and nodepool instances can talk to phabricator - https://phabricator.wikimedia.org/T131375#2165259 (10Dzahn) >>! In T131375#2165682, @mmodell wrote: > @chasemp: connecting to gall... [19:02:54] 6Operations, 10ops-eqiad: replace h310 with h710 controller in notebook1001 & notebook1002 poweredge r720xd systems - https://phabricator.wikimedia.org/T131331#2165774 (10Ottomata) FYI, these boxes are OOW. Analytics is no longer using them because we’ve been told to replace OOW nodes. Not sure if using up a... [19:03:44] thcipriani: huh, right now I'm even forgetting the bot's name [19:04:05] our logging infra depending on IRC is suboptimal :/ [19:04:11] logmsgbot [19:04:16] Dear passengers, MediaWiki train wmf.19 with destination all wikis is at platform 1 and will depart in 5 minutes [19:04:19] (that part of the logging, that is, not all logging, obviously, for those following along) [19:04:29] logmsgbot: ping [19:04:56] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: all wikis to 1.27.0-wmf.19 [19:05:00] 6Operations, 10ops-eqiad, 6DC-Ops, 13Patch-For-Review: mw1026-69 are shut down and should be physically decommissioned - https://phabricator.wikimedia.org/T129060#2165778 (10Southparkfan) @Cmjohnson shouldn't T126350 be closed as a duplicate of this one instead of vice versa? [19:05:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:05:22] logmsgbot: welcome back [19:06:16] 6Operations, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Reduce the number of appservers we're using in eqiad - https://phabricator.wikimedia.org/T126242#2165786 (10Cmjohnson) [19:06:18] 6Operations, 10ops-eqiad, 6DC-Ops, 13Patch-For-Review: mw1026-69 are shut down and should be physically decommissioned - https://phabricator.wikimedia.org/T129060#2165785 (10Cmjohnson) 5duplicate>3Open [19:08:17] 6Operations, 10ops-eqiad, 6DC-Ops, 13Patch-For-Review: Decommission mw1037 - https://phabricator.wikimedia.org/T126350#2165797 (10Cmjohnson) 5Open>3Resolved resolving this part of T129060. [19:08:52] @seen grrrit-wm [19:08:52] mutante: Last time I saw grrrit-wm they were quitting the network with reason: Remote host closed the connection N/A at 3/31/2016 6:31:19 PM (37m33s ago) [19:09:03] SouthparkfanZNC: thx...you are right merged the wrong way. [19:10:04] /usr/lib/ruby/vendor_ruby/puppet-lint/bin.rb:78:in `block in run': invalid option: --no-puppet_url_without_modules-check (OptionParser::InvalidOption) [19:10:17] mutante: am pushing it [19:10:35] YuviPanda: thanks [19:10:50] that error is from jenkins and unrelated [19:11:42] 6Operations, 10Analytics-Cluster, 10hardware-requests: setup/deploy server analytics1003/WMF4541 - https://phabricator.wikimedia.org/T130840#2165810 (10Cmjohnson) [19:11:44] 6Operations, 10Analytics-Cluster, 10hardware-requests: update label on analytics1003/WMF4541 - https://phabricator.wikimedia.org/T130845#2165808 (10Cmjohnson) 5Open>3Resolved updated label and racktables [19:12:54] !log Starting slave for s2 on db1047 [19:12:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:14:15] you know that you are officially doing a better job than I do, volans, right? [19:14:44] lol [19:14:48] ok, so just "recheck" and jenkins-bot changes its mind [19:14:56] and that error is gone [19:19:47] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [19:20:37] (03Abandoned) 10Yuvipanda: labs: return CNAMEs only when asked for [puppet] - 10https://gerrit.wikimedia.org/r/278941 (owner: 10Yuvipanda) [19:21:36] (03PS5) 10ArielGlenn: snapshots: restructure directory handling a bit [puppet] - 10https://gerrit.wikimedia.org/r/280673 [19:24:55] (03PS3) 10Dzahn: contint:firewall: let phabricator talk to gearman [puppet] - 10https://gerrit.wikimedia.org/r/280706 (https://phabricator.wikimedia.org/T131375) [19:27:33] (03PS1) 10Ladsgroup: Add service-deploy beta public key for keyholder [puppet] - 10https://gerrit.wikimedia.org/r/280708 [19:34:47] PROBLEM - Host payments2001 is DOWN: PING CRITICAL - Packet loss = 100% [19:34:55] PROBLEM - Host pay-lvs2001 is DOWN: PING CRITICAL - Packet loss = 100% [19:35:05] PROBLEM - Host payments2003 is DOWN: PING CRITICAL - Packet loss = 100% [19:35:16] blarg [19:35:20] is the pfw falling over again? [19:35:27] Jeff_Green: ^ sadness [19:35:33] <_joe_> what should we do besides paging jeff? [19:35:36] woo there goes the other one [19:35:42] (03CR) 1020after4: [C: 04-1] "I'm trying to get this moved to hiera + secret() for the private keys," [puppet] - 10https://gerrit.wikimedia.org/r/280708 (owner: 10Ladsgroup) [19:35:57] PROBLEM - Host saiph is DOWN: PING CRITICAL - Packet loss = 100% [19:35:58] codfw isnt live still right? so its not user affecting but isnt good. [19:36:05] PROBLEM - Host rigel is DOWN: PING CRITICAL - Packet loss = 100% [19:36:11] <_joe_> it is live [19:36:17] <_joe_> can you phone jeff? [19:36:21] frack codfw is the primary codfw? he just replied in here [19:36:25] we dont need to phone him. [19:36:36] sorry, frack codfw is the primary frack? is what i meant to ask. [19:36:45] <_joe_> I messed his message [19:36:47] frack codfw is the failover site, not active [19:36:51] <_joe_> robh: I'm unsure [19:36:54] cool [19:36:55] page jeff but nothing is active there atm [19:37:02] Jeff_Green: I assumed you would send a big email when that changed =] [19:37:03] hi Jeff_Green :) [19:37:03] <_joe_> Jeff_Green: ok so it's not active-active? [19:37:27] <_joe_> I thought you did send traffic to codfw, wise choice not to [19:37:30] hey [19:37:36] <_joe_> given what's happening [19:37:54] PROBLEM - Host alnilam is DOWN: PING CRITICAL - Packet loss = 100% [19:38:02] PROBLEM - Host fdb2001 is DOWN: PING CRITICAL - Packet loss = 100% [19:38:09] PROBLEM - Host heka is DOWN: PING CRITICAL - Packet loss = 100% [19:38:46] <_joe_> paravoid: srx troubles in dallas again I guess, but I have no access to any of those clusters [19:38:52] why is icinga double-paging? [19:39:07] I got one per host [19:39:25] I got two per host by email, within ~5s [19:39:54] oh i see, one to me directly and the other to alerts@ [19:39:55] by email yes, 2 [19:40:16] (03CR) 10Ottomata: [C: 031] Update the varnishkafka module with latest changes. [puppet] - 10https://gerrit.wikimedia.org/r/280678 (https://phabricator.wikimedia.org/T129344) (owner: 10Elukey) [19:40:17] RECOVERY - Host saiph is UP: PING OK - Packet loss = 0%, RTA = 36.60 ms [19:40:25] RECOVERY - Host pay-lvs2001 is UP: PING OK - Packet loss = 0%, RTA = 37.15 ms [19:40:32] RECOVERY - Host rigel is UP: PING OK - Packet loss = 0%, RTA = 36.64 ms [19:40:39] -rw-rw---- 1 root wheel 0 Mar 31 19:32 /var/tmp/flowd_octeon_hm.core.0.gz [19:40:40] RECOVERY - Host fdb2001 is UP: PING OK - Packet loss = 0%, RTA = 36.41 ms [19:40:42] goddammit [19:40:47] RECOVERY - Host heka is UP: PING OK - Packet loss = 0%, RTA = 36.65 ms [19:40:57] RECOVERY - Host alnilam is UP: PING OK - Packet loss = 0%, RTA = 36.49 ms [19:41:04] RECOVERY - Host payments2003 is UP: PING OK - Packet loss = 0%, RTA = 36.40 ms [19:41:12] RECOVERY - Host payments2001 is UP: PING OK - Packet loss = 0%, RTA = 36.36 ms [19:41:13] paravoid: not the core you're looking for? [19:41:25] it's 0-byte [19:41:31] <_joe_> sigh [19:41:33] oh ha, i missed that [19:41:41] maybe there's a "make core not useless" setting? [19:41:55] I ran a storage cleanup, maybe there isn't enough space for it [19:42:00] although df showed quite a bit [19:42:02] but who knows... [19:42:36] ok [19:42:40] everything seems stable again [19:42:45] going afk for now... :/ [19:42:52] (03PS6) 10Ottomata: Run eventlogging services out of deployed eventlogging source path [puppet] - 10https://gerrit.wikimedia.org/r/280497 (https://phabricator.wikimedia.org/T131263) [19:43:03] so everybody remember...now that codfw has faceplanted 2x in 24hours, we can expect eqiad to do the same within a few days [19:43:28] not necessarily [19:43:45] hasn't always happened [19:44:00] true, just don't be surprised when it does :-) [19:44:19] (03CR) 1020after4: [C: 031] contint:firewall: let phabricator talk to gearman [puppet] - 10https://gerrit.wikimedia.org/r/280706 (https://phabricator.wikimedia.org/T131375) (owner: 10Dzahn) [19:45:52] (03CR) 10Ottomata: [C: 032] Run eventlogging services out of deployed eventlogging source path [puppet] - 10https://gerrit.wikimedia.org/r/280497 (https://phabricator.wikimedia.org/T131263) (owner: 10Ottomata) [19:46:36] (03PS1) 10GWicke: Start sampling regular & slow requests [puppet] - 10https://gerrit.wikimedia.org/r/280711 [19:46:40] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 65.52% of data above the critical threshold [5000000.0] [19:47:46] (03CR) 10GWicke: "Note: This depends on https://github.com/wikimedia/hyperswitch/pull/31 and https://github.com/wikimedia/service-runner/pull/93 being deplo" [puppet] - 10https://gerrit.wikimedia.org/r/280711 (owner: 10GWicke) [19:48:40] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 72.41% of data above the critical threshold [5000000.0] [19:50:10] PROBLEM - check_puppetrun on fdb2001 is CRITICAL: CRITICAL: Puppet has 5 failures [19:55:06] ^^^^ looking at fdb2001 [19:55:10] RECOVERY - check_puppetrun on fdb2001 is OK: OK: Puppet is currently enabled, last run 78 seconds ago with 0 failures [19:55:22] (03PS6) 10ArielGlenn: snapshots: restructure directory handling a bit [puppet] - 10https://gerrit.wikimedia.org/r/280673 [19:55:23] ^^^^ nevermind. [19:55:34] (03PS7) 10ArielGlenn: snapshots: restructure directory handling a bit [puppet] - 10https://gerrit.wikimedia.org/r/280673 [19:55:53] 6Operations, 10ops-codfw, 6DC-Ops: Check bast2001 for hardware problems - https://phabricator.wikimedia.org/T129316#2165900 (10Papaul) a:5Papaul>3Dzahn @Dzahn disk replacement complete. [19:56:39] 6Operations, 10RESTBase-Cassandra, 6Services: Investigate high read requests on restbase1012-a - https://phabricator.wikimedia.org/T131370#2165904 (10Eevans) 5Invalid>3Open >>! In T131370#2165465, @GWicke wrote: > Actually, https://grafana-admin.wikimedia.org/dashboard/db/restbase-cassandra-client-reques... [19:57:59] 6Operations, 10ops-codfw, 6DC-Ops: Check bast2001 for hardware problems - https://phabricator.wikimedia.org/T129316#2165908 (10Papaul) [19:58:00] RECOVERY - Host bast2001 is UP: PING OK - Packet loss = 0%, RTA = 37.66 ms [19:59:16] !log temporarily stopped puppet on eventlog1001 [19:59:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:01:22] oh, coool. . bast2001 recovery [20:01:41] !log stopping eventlogging, uninstalling globally installed eventlogging python code, running puppet, restarting eventlogging from /srv/deployment/eventlogging/eventlogging [20:01:42] papaul: !thank you [20:01:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:01:48] oops, wrong chat, but oh well! [20:04:00] (03PS8) 10ArielGlenn: snapshots: restructure directory handling a bit [puppet] - 10https://gerrit.wikimedia.org/r/280673 [20:05:33] (03PS9) 10ArielGlenn: snapshots: restructure directory handling a bit [puppet] - 10https://gerrit.wikimedia.org/r/280673 [20:06:32] 6Operations, 10RESTBase-Cassandra, 6Services: Investigate high read requests on restbase1012-a - https://phabricator.wikimedia.org/T131370#2165931 (10Eevans) >>! In T131370#2165904, @Eevans wrote: >>>! In T131370#2165465, @GWicke wrote: >> Actually, https://grafana-admin.wikimedia.org/dashboard/db/restbase-c... [20:07:14] 6Operations, 10RESTBase-Cassandra, 6Services: Investigate high read requests on restbase1012-a - https://phabricator.wikimedia.org/T131370#2165939 (10GWicke) > I'm not sure this is right. cassandra.$node.org.apache.cassandra.metrics.ClientRequest.Read.Latency is a dropwizard timer (see: https://dropwizard.g... [20:16:07] (03PS1) 10Andrew Bogott: Use 'admin' as the default auth project for 'novaadmin' [puppet] - 10https://gerrit.wikimedia.org/r/280718 (https://phabricator.wikimedia.org/T131395) [20:17:44] (03CR) 10ArielGlenn: [C: 032] snapshots: restructure directory handling a bit [puppet] - 10https://gerrit.wikimedia.org/r/280673 (owner: 10ArielGlenn) [20:18:45] (03PS2) 10Andrew Bogott: Use 'admin' as the default auth project for 'novaadmin' [puppet] - 10https://gerrit.wikimedia.org/r/280718 (https://phabricator.wikimedia.org/T131395) [20:19:52] (03CR) 10Andrew Bogott: [C: 032] Use 'admin' as the default auth project for 'novaadmin' [puppet] - 10https://gerrit.wikimedia.org/r/280718 (https://phabricator.wikimedia.org/T131395) (owner: 10Andrew Bogott) [20:21:20] !log bast2001 - reinstall after disk replacement [20:21:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:24:28] 6Operations, 10RESTBase-Cassandra, 6Services: Investigate high read requests on restbase1012-a - https://phabricator.wikimedia.org/T131370#2166104 (10Eevans) >>! In T131370#2165939, @GWicke wrote: >> I'm not sure this is right. cassandra.$node.org.apache.cassandra.metrics.ClientRequest.Read.Latency is a dro... [20:27:14] (03CR) 10Ottomata: [C: 031] "I also need this for https://phabricator.wikimedia.org/T118772" [puppet] - 10https://gerrit.wikimedia.org/r/279717 (https://phabricator.wikimedia.org/T130948) (owner: 10Mobrovac) [20:28:38] (03CR) 10Brion VIBBER: [C: 031] "Looks good per docs. :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279394 (https://phabricator.wikimedia.org/T128795) (owner: 10Fjalapeno) [20:31:31] RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 50.00% above the threshold [1000000.0] [20:38:55] !log bast2001 - revoking old, signing new puppet certs, salt key.. [20:39:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:39:53] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 50.00% above the threshold [1000000.0] [20:43:21] (03PS1) 10Ottomata: Create eventlogging::deployment::target define that abstracts scap::target for eventlogging targets [puppet] - 10https://gerrit.wikimedia.org/r/280730 (https://phabricator.wikimedia.org/T118772) [20:44:17] 6Operations, 10RESTBase-Cassandra, 6Services: Investigate high read requests on restbase1012-a - https://phabricator.wikimedia.org/T131370#2166227 (10Eevans) This is interesting: {P2844} There seems to be a disproportionate amount of traffic directed at https://en.wikipedia.org/wiki/Communications_and_Info... [20:45:37] (03CR) 10jenkins-bot: [V: 04-1] Create eventlogging::deployment::target define that abstracts scap::target for eventlogging targets [puppet] - 10https://gerrit.wikimedia.org/r/280730 (https://phabricator.wikimedia.org/T118772) (owner: 10Ottomata) [20:47:12] PROBLEM - HHVM rendering on mw1143 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:47:51] PROBLEM - Apache HTTP on mw1143 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:48:21] PROBLEM - SSH on mw1143 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:48:31] PROBLEM - HHVM processes on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:48:50] PROBLEM - RAID on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:48:51] PROBLEM - puppet last run on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:49:11] PROBLEM - configured eth on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:49:11] PROBLEM - nutcracker port on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:49:24] 6Operations, 10RESTBase-Cassandra, 6Services: Investigate high read requests on restbase1012-a - https://phabricator.wikimedia.org/T131370#2166241 (10GWicke) > We see a jump in restbase requests to, ~1.5k/s, starting at ~14:00 on the 29th, and a corresponding jump in client requests (~1.5k/s) on 1012-b at th... [20:49:31] PROBLEM - Check size of conntrack table on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:49:32] PROBLEM - nutcracker process on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:49:51] PROBLEM - DPKG on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:50:11] RECOVERY - HHVM processes on mw1143 is OK: PROCS OK: 6 processes with command name hhvm [20:55:21] PROBLEM - Disk space on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:55:40] PROBLEM - HHVM processes on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:57:01] PROBLEM - dhclient process on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:57:01] PROBLEM - salt-minion processes on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:59:27] (03PS1) 10ArielGlenn: snapshots: move dumps cron job script to same location as the rest [puppet] - 10https://gerrit.wikimedia.org/r/280754 [21:00:31] (03CR) 10jenkins-bot: [V: 04-1] snapshots: move dumps cron job script to same location as the rest [puppet] - 10https://gerrit.wikimedia.org/r/280754 (owner: 10ArielGlenn) [21:00:49] !log bast2001 has been reinstalled and can be used again. fingerprints at https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/bast2001.wikimedia.org [21:00:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:01:08] (03PS2) 10ArielGlenn: snapshots: move dumps cron job script to same location as the rest [puppet] - 10https://gerrit.wikimedia.org/r/280754 [21:02:26] (03CR) 10jenkins-bot: [V: 04-1] snapshots: move dumps cron job script to same location as the rest [puppet] - 10https://gerrit.wikimedia.org/r/280754 (owner: 10ArielGlenn) [21:03:10] this is the day of the typo [21:03:12] meh [21:04:21] PROBLEM - puppet last run on snapshot1004 is CRITICAL: CRITICAL: Puppet has 1 failures [21:06:13] (03PS3) 10ArielGlenn: snapshots: move dumps cron job script to same location as the rest [puppet] - 10https://gerrit.wikimedia.org/r/280754 [21:07:30] even typos trying to push the dang changeset [21:07:36] prolly means I need to pack it in soon [21:07:44] (03CR) 10jenkins-bot: [V: 04-1] snapshots: move dumps cron job script to same location as the rest [puppet] - 10https://gerrit.wikimedia.org/r/280754 (owner: 10ArielGlenn) [21:07:45] some days are cappy puppet days [21:07:49] *crappy [21:09:15] (03PS4) 10ArielGlenn: snapshots: move dumps cron job script to same location as the rest [puppet] - 10https://gerrit.wikimedia.org/r/280754 [21:10:06] (03PS1) 10Dzahn: base: add script to generate fingerprints [puppet] - 10https://gerrit.wikimedia.org/r/280757 [21:10:40] (03PS2) 10Dzahn: base: add script to generate fingerprints [puppet] - 10https://gerrit.wikimedia.org/r/280757 [21:14:20] RECOVERY - nutcracker port on mw1143 is OK: TCP OK - 0.000 second response time on port 11212 [21:14:21] RECOVERY - configured eth on mw1143 is OK: OK - interfaces up [21:14:31] RECOVERY - Check size of conntrack table on mw1143 is OK: OK: nf_conntrack is 0 % full [21:14:31] RECOVERY - nutcracker process on mw1143 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [21:14:50] RECOVERY - dhclient process on mw1143 is OK: PROCS OK: 0 processes with command name dhclient [21:14:50] RECOVERY - salt-minion processes on mw1143 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:14:50] RECOVERY - DPKG on mw1143 is OK: All packages OK [21:15:00] RECOVERY - Disk space on mw1143 is OK: DISK OK [21:15:01] RECOVERY - SSH on mw1143 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [21:15:05] (03PS3) 10Dzahn: base: add script to generate fingerprints [puppet] - 10https://gerrit.wikimedia.org/r/280757 [21:15:10] RECOVERY - HHVM processes on mw1143 is OK: PROCS OK: 6 processes with command name hhvm [21:16:51] (03CR) 10ArielGlenn: [C: 032] snapshots: move dumps cron job script to same location as the rest [puppet] - 10https://gerrit.wikimedia.org/r/280754 (owner: 10ArielGlenn) [21:19:05] 6Operations, 10RESTBase-Cassandra, 6Services: Investigate high read requests on restbase1012-a - https://phabricator.wikimedia.org/T131370#2166336 (10GWicke) @eevans, the toppartition results are consistent with https://grafana.wikimedia.org/dashboard/db/cassandra-restbase-eqiad?panelId=24&fullscreen. To me... [21:19:40] PROBLEM - configured eth on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:19:40] PROBLEM - nutcracker port on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:19:52] PROBLEM - Check size of conntrack table on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:20:22] PROBLEM - Disk space on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:20:31] PROBLEM - SSH on mw1143 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:20:40] PROBLEM - HHVM processes on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:21:12] (03CR) 10Awight: [C: 031] "Thanks, that will work for us!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250384 (https://phabricator.wikimedia.org/T130442) (owner: 10Nemo bis) [21:21:18] (03PS5) 10Awight: Use full URL in $wgNoticeHideUrls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250384 (https://phabricator.wikimedia.org/T130442) (owner: 10Nemo bis) [21:22:56] (03PS1) 10ArielGlenn: snapshots: update location of dumps cron script in jobs calling it [puppet] - 10https://gerrit.wikimedia.org/r/280761 [21:23:14] (03CR) 10Krinkle: [C: 04-1] "Please beware that https://en.wikipedia.org/w/index.php?title=Special:HideBanners is a redirect due because of normalisation being enforce" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250384 (https://phabricator.wikimedia.org/T130442) (owner: 10Nemo bis) [21:23:31] PROBLEM - nutcracker process on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:23:37] 6Operations, 13Patch-For-Review: reinstall bast2001 with jessie - https://phabricator.wikimedia.org/T128899#2166348 (10Dzahn) [21:23:39] 6Operations: Reinstall bast1001 with jessie - https://phabricator.wikimedia.org/T123721#2166349 (10Dzahn) [21:23:41] 6Operations, 10ops-codfw, 6DC-Ops: Check bast2001 for hardware problems - https://phabricator.wikimedia.org/T129316#2166346 (10Dzahn) 5Open>3Resolved thank you @papaul ! [21:23:50] PROBLEM - salt-minion processes on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:23:50] PROBLEM - DPKG on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:23:50] PROBLEM - dhclient process on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:24:11] (03CR) 10jenkins-bot: [V: 04-1] snapshots: update location of dumps cron script in jobs calling it [puppet] - 10https://gerrit.wikimedia.org/r/280761 (owner: 10ArielGlenn) [21:25:22] (03PS2) 10ArielGlenn: snapshots: update location of dumps cron script in jobs calling it [puppet] - 10https://gerrit.wikimedia.org/r/280761 [21:25:39] (03CR) 10Krinkle: "As mentioned in the commit message, to avoid a normalisation redirect, the url must have something to it that isn't semantically the same " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250384 (https://phabricator.wikimedia.org/T130442) (owner: 10Nemo bis) [21:26:17] 6Operations, 13Patch-For-Review: reinstall bast2001 with jessie - https://phabricator.wikimedia.org/T128899#2166356 (10Dzahn) reinstalled with jessie, re-signed puppet/salt. can be used again. fingerprints ``` RSA MD5:3f:18:b6:2d:12:1c:81:93:74:a2:eb:86:2c:7c:80:41 SHA256:saX7tsDLjsHCU67XroGcw+tAw... [21:26:20] !log disabling puppet on cerium, updating config and deploying restbase to staging. Testing https://gerrit.wikimedia.org/r/#/c/280711/ [21:26:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:26:28] (03CR) 10Krinkle: "Revoking -1 since the url is later extended in JavaScript and never used directly." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250384 (https://phabricator.wikimedia.org/T130442) (owner: 10Nemo bis) [21:26:31] RECOVERY - nutcracker port on mw1143 is OK: TCP OK - 0.000 second response time on port 11212 [21:26:31] RECOVERY - configured eth on mw1143 is OK: OK - interfaces up [21:26:42] 6Operations: Reinstall bast1001 with jessie - https://phabricator.wikimedia.org/T123721#2166358 (10Dzahn) [21:26:44] 6Operations, 13Patch-For-Review: reinstall bast2001 with jessie - https://phabricator.wikimedia.org/T128899#2166357 (10Dzahn) 5Open>3Resolved [21:26:50] RECOVERY - Check size of conntrack table on mw1143 is OK: OK: nf_conntrack is 0 % full [21:26:51] RECOVERY - nutcracker process on mw1143 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [21:26:56] 6Operations: reinstall bast2001 with jessie - https://phabricator.wikimedia.org/T128899#2089630 (10Dzahn) [21:27:10] RECOVERY - Apache HTTP on mw1143 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 7.723 second response time [21:27:11] RECOVERY - salt-minion processes on mw1143 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:27:11] RECOVERY - dhclient process on mw1143 is OK: PROCS OK: 0 processes with command name dhclient [21:27:11] RECOVERY - DPKG on mw1143 is OK: All packages OK [21:27:22] RECOVERY - Disk space on mw1143 is OK: DISK OK [21:27:30] RECOVERY - SSH on mw1143 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [21:27:31] RECOVERY - HHVM processes on mw1143 is OK: PROCS OK: 6 processes with command name hhvm [21:27:48] (03CR) 10ArielGlenn: [C: 032] snapshots: update location of dumps cron script in jobs calling it [puppet] - 10https://gerrit.wikimedia.org/r/280761 (owner: 10ArielGlenn) [21:27:51] RECOVERY - RAID on mw1143 is OK: OK: no RAID installed [21:27:51] RECOVERY - puppet last run on mw1143 is OK: OK: Puppet is currently enabled, last run 1 hour ago with 0 failures [21:28:10] RECOVERY - HHVM rendering on mw1143 is OK: HTTP OK: HTTP/1.1 200 OK - 66485 bytes in 0.418 second response time [21:29:44] 6Operations, 13Patch-For-Review, 7Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2166360 (10Dzahn) [21:32:11] (03PS1) 10ArielGlenn: hold off full dumps cron starting til the 4th of this month [puppet] - 10https://gerrit.wikimedia.org/r/280762 [21:33:48] (03CR) 10ArielGlenn: [C: 032] hold off full dumps cron starting til the 4th of this month [puppet] - 10https://gerrit.wikimedia.org/r/280762 (owner: 10ArielGlenn) [21:37:51] 6Operations, 10RESTBase-Cassandra, 6Services: Investigate high read requests on restbase1012-a - https://phabricator.wikimedia.org/T131370#2166381 (10GWicke) The only related log message I have found in logstash is a timeout from a request from restbase to http://mobileapps.svc.eqiad.wmnet:8888/en.wikipedia.... [22:02:08] !log reenable puppet on cerium in restbase staging [22:02:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:02:20] (03PS1) 10Yuvipanda: labs: Do Public DNS translation for telnet project too [puppet] - 10https://gerrit.wikimedia.org/r/280768 [22:03:50] (03PS2) 10Yuvipanda: labs: Do Public DNS translation for telnet project too [puppet] - 10https://gerrit.wikimedia.org/r/280768 [22:04:36] (03CR) 10Yuvipanda: [C: 032 V: 032] labs: Do Public DNS translation for telnet project too [puppet] - 10https://gerrit.wikimedia.org/r/280768 (owner: 10Yuvipanda) [22:05:13] (03PS1) 10ArielGlenn: snapshots: fix up all variable refs to dump dirs in templates [puppet] - 10https://gerrit.wikimedia.org/r/280769 [22:05:35] (03Abandoned) 10Ladsgroup: Add service-deploy beta public key for keyholder [puppet] - 10https://gerrit.wikimedia.org/r/280708 (owner: 10Ladsgroup) [22:06:27] !log update restbase to ba39d2bc canary on restbase1005 [22:06:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:06:32] (03CR) 10jenkins-bot: [V: 04-1] snapshots: fix up all variable refs to dump dirs in templates [puppet] - 10https://gerrit.wikimedia.org/r/280769 (owner: 10ArielGlenn) [22:07:20] (03CR) 10GWicke: "This has been tested in staging. The corresponding deploy is going out right now." [puppet] - 10https://gerrit.wikimedia.org/r/280711 (owner: 10GWicke) [22:08:10] (03PS2) 10ArielGlenn: snapshots: fix up all variable refs to dump dirs in templates [puppet] - 10https://gerrit.wikimedia.org/r/280769 [22:10:01] (03PS3) 10ArielGlenn: snapshots: fix up all variable refs to dump dirs in templates [puppet] - 10https://gerrit.wikimedia.org/r/280769 [22:13:20] (03PS2) 10Ottomata: Create eventlogging::deployment::target define that abstracts scap::target for eventlogging targets [puppet] - 10https://gerrit.wikimedia.org/r/280730 (https://phabricator.wikimedia.org/T118772) [22:13:22] (03PS1) 10Ottomata: [WIP] Add new scap::source define to ease bootstrapping of repositories on deploy servers [puppet] - 10https://gerrit.wikimedia.org/r/280771 (https://phabricator.wikimedia.org/T118772) [22:14:58] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Add new scap::source define to ease bootstrapping of repositories on deploy servers [puppet] - 10https://gerrit.wikimedia.org/r/280771 (https://phabricator.wikimedia.org/T118772) (owner: 10Ottomata) [22:15:53] !log started update restbase to ba39d2bc [22:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:17:43] (03PS2) 10Ottomata: [WIP] Add new scap::source define to ease bootstrapping of repositories on deploy servers [puppet] - 10https://gerrit.wikimedia.org/r/280771 (https://phabricator.wikimedia.org/T118772) [22:19:02] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Add new scap::source define to ease bootstrapping of repositories on deploy servers [puppet] - 10https://gerrit.wikimedia.org/r/280771 (https://phabricator.wikimedia.org/T118772) (owner: 10Ottomata) [22:20:31] (03PS4) 10ArielGlenn: snapshots: fix up all variable refs to dump dirs in templates [puppet] - 10https://gerrit.wikimedia.org/r/280769 [22:20:49] (03PS5) 10ArielGlenn: snapshots: fix up all variable refs to dump dirs in templates [puppet] - 10https://gerrit.wikimedia.org/r/280769 [22:22:08] !log finished update restbase to ba39d2bc [22:22:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:23:59] (03PS2) 10MarcoAurelio: Bump portals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280456 (https://phabricator.wikimedia.org/T130514) (owner: 10Thcipriani) [22:24:24] (03PS2) 10Yuvipanda: Start sampling regular & slow requests [puppet] - 10https://gerrit.wikimedia.org/r/280711 (owner: 10GWicke) [22:24:31] (03CR) 10Yuvipanda: [C: 032 V: 032] Start sampling regular & slow requests [puppet] - 10https://gerrit.wikimedia.org/r/280711 (owner: 10GWicke) [22:27:50] PROBLEM - puppet last run on labservices1001 is CRITICAL: CRITICAL: Puppet has 1 failures [22:30:11] PROBLEM - puppet last run on holmium is CRITICAL: CRITICAL: Puppet has 1 failures [22:30:59] (03PS6) 10ArielGlenn: snapshots: fix up all variable refs to dump dirs in templates [puppet] - 10https://gerrit.wikimedia.org/r/280769 [22:37:36] andrewbogott: ^ hmm, so that's failing because novaadmin isn't a member of the 'telnet' project [22:39:21] andrewbogott: I added it and it works ok [22:40:11] RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [22:40:56] YuviPanda: is the 'telnet' project new? [22:43:39] andrewbogott, pretty sure it is, TimStarling's project [22:44:20] I just clicked the "create project" button in wikitech.wikimedia.org [22:44:31] (03PS7) 10ArielGlenn: snapshots: fix up all variable refs to dump dirs in templates [puppet] - 10https://gerrit.wikimedia.org/r/280769 [22:44:39] so many channels [22:44:54] TimStarling: yeah, I think I must have broken that auto-add feature when I switched projects out of ldap [22:45:02] going for a record on how many times I can do this wrong [22:45:11] at almost 2 am the answer is, lots and lots [22:50:33] (03PS8) 10ArielGlenn: snapshots: fix up all variable refs to dump dirs in templates [puppet] - 10https://gerrit.wikimedia.org/r/280769 [22:51:03] !log rolling restart restbase. Apply https://gerrit.wikimedia.org/r/#/c/280711/ config change [22:51:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:55:55] 7Blocked-on-Operations, 6Operations, 10Phabricator, 10Traffic: Phabricator needs to expose notification daemon (websocket) - https://phabricator.wikimedia.org/T112765#2166588 (10greg) >>! In T112765#1822062, @chasemp wrote: > We need to make a plan to get connectivity through to the end host for this. Thi... [22:56:30] RECOVERY - puppet last run on holmium is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [22:58:29] Aww, nothing to swat. [22:58:39] (03CR) 10ArielGlenn: [C: 032] snapshots: fix up all variable refs to dump dirs in templates [puppet] - 10https://gerrit.wikimedia.org/r/280769 (owner: 10ArielGlenn) [22:58:58] awight: We're adding one right now!!! :) [22:59:13] purrfect [23:00:04] RoanKattouw ostriches Krenair MaxSem awight: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160331T2300). [23:01:37] ahem [23:01:42] dapatrick, yt? [23:01:45] Yep. [23:01:46] I'm here. [23:02:37] MaxSem: I'm happy to take this one, up to you. [23:03:22] awight, thanks! I've sent all 3 today's commits to Zuul, you just need to push them when they get merged [23:03:37] e-z, thanks [23:06:15] (03PS1) 10ArielGlenn: snapshots: one more dblist dir fixup [puppet] - 10https://gerrit.wikimedia.org/r/280782 [23:07:55] (03CR) 10ArielGlenn: [C: 032] snapshots: one more dblist dir fixup [puppet] - 10https://gerrit.wikimedia.org/r/280782 (owner: 10ArielGlenn) [23:11:37] dapatrick: MaxSem: Ready to push your patches, whenever you are. [23:11:50] Okay, sure go ahead. [23:13:06] !log awight@tin Synchronized php-1.27.0-wmf.19/extensions/OATHAuth: SWAT deployment of OATHAuth fixes (duration: 00m 46s) [23:13:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:13:40] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [23:14:14] dapatrick: Sorry-false alarm. Deploying for real now. [23:14:22] Okay. [23:14:38] !log awight@tin Synchronized php-1.27.0-wmf.19/extensions/OATHAuth: SWAT deployment of OATHAuth fixes, take 2 (duration: 00m 32s) [23:14:40] Submodules get me every time... [23:14:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:14:59] dapatrick: Should be deployed now. Please lmk if you're able to test the feature. [23:17:36] awight: Thanks. We lost an i18n message somehow, but the core features work fine. [23:18:54] dapatrick: Cool--I assume it's best to least this deployed and not rollback for the message? [23:19:04] Actually--is it a new message? [23:19:10] I should have run a full scap. [23:19:14] It's a new message, yes. [23:19:38] Okay. I'll start the scap, then! [23:19:53] Thanks! [23:20:25] * awight mumbles to self. I think we need a full scap whenever any messages change, so there are other i18n string updates which haven't been deployed either. [23:21:16] !log awight@tin Started scap: (no message) [23:21:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:24:52] 6Operations, 10RESTBase-Cassandra, 6Services: Investigate high read requests on restbase1012-a - https://phabricator.wikimedia.org/T131370#2166709 (10GWicke) Okay, @pchelolo resolved the mystery by deploying the sampled logging logic. The source of those requests is the mobile content service, and there was... [23:25:22] (03PS4) 10Dzahn: contint:firewall: let phabricator talk to gearman [puppet] - 10https://gerrit.wikimedia.org/r/280706 (https://phabricator.wikimedia.org/T131375) [23:25:36] (03CR) 10Dzahn: [C: 032] contint:firewall: let phabricator talk to gearman [puppet] - 10https://gerrit.wikimedia.org/r/280706 (https://phabricator.wikimedia.org/T131375) (owner: 10Dzahn) [23:26:20] (03CR) 10Luke081515: [C: 04-1] "Before we dedeploy this, we should discuss that with the affected communitys imo, for details, look at the task." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280624 (https://phabricator.wikimedia.org/T131340) (owner: 10Catrope) [23:30:01] RECOVERY - puppet last run on snapshot1004 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [23:30:44] 6Operations, 10Continuous-Integration-Infrastructure, 10Phabricator, 10netops, and 4 others: Make sure phab can talk to gearman and nodepool instances can talk to phabricator - https://phabricator.wikimedia.org/T131375#2166726 (10Dzahn) on gallium: Notice: /Stage[main]/Contint::Firewall/Ferm::Service[gear... [23:31:42] twentyafterfour: ^ the gearman part should be fixed now [23:45:19] (03PS1) 10Dzahn: Revert "install_server: make multatuli the new bast, not hooft" [puppet] - 10https://gerrit.wikimedia.org/r/280788 [23:45:29] (03PS2) 10Dzahn: Revert "install_server: make multatuli the new bast, not hooft" [puppet] - 10https://gerrit.wikimedia.org/r/280788 [23:46:16] (03CR) 10Dzahn: [C: 032] Revert "install_server: make multatuli the new bast, not hooft" [puppet] - 10https://gerrit.wikimedia.org/r/280788 (owner: 10Dzahn) [23:48:00] !log awight@tin Finished scap: (no message) (duration: 26m 44s) [23:48:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:48:25] dapatrick: Feel free to check the i18n messages when you get a moment. [23:48:43] awight: Checking now. [23:49:43] 6Operations, 10RESTBase-Cassandra, 6Services: Investigate high read requests on restbase1012-a - https://phabricator.wikimedia.org/T131370#2166776 (10Eevans) >>! In T131370#2166709, @GWicke wrote: > Okay, @pchelolo resolved the mystery by deploying the sampled logging logic. The source of those requests is t... [23:49:46] Yay! 2-step working on wikitech... [23:51:21] :) [23:51:49] csteipp: for all users now or a different kind? [23:51:51] csteipp: https://en.wikipedia.org/wiki/2-step_garage ? [23:52:41] mutante: Just you put your token in on a second screen, if you have OATH enabled. [23:53:06] If you don't, then you never see the token prompt [23:53:27] * YuviPanda now has an android watch that can show me my 2fa, which is the most use I've found for it now [23:54:24] confirmed working. i had enabled before and it works before and after, the difference is the separate screen [23:57:56] (03PS1) 10Dzahn: install_server: re-use amslvs1 for bast3001 [puppet] - 10https://gerrit.wikimedia.org/r/280791 (https://phabricator.wikimedia.org/T123712) [23:58:22] YuviPanda: monitoring alerts on android watch ? [23:58:48] mutante: I've actually blocked messaging from the watch, so no :D [23:59:01] just _to_ the watch :p [23:59:21] yea, i was wondering what it actually does [23:59:51] mutante: it listens to all notifications on your phone and ports them over