[00:00:49] RECOVERY - Disk space on elastic1002 is OK: DISK OK [00:03:20] https://fa.wikipedia.org/wiki/%D9%88%DB%8C%DA%98%D9%87:%D9%85%D8%B4%D8%A7%D8%B1%DA%A9%D8%AA%E2%80%8C%D9%87%D8%A7/Ladsgroup?uselang=en [00:03:39] the last four edits was made with help of ORES [00:04:52] o/ [00:10:09] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet has 1 failures [00:15:09] RECOVERY - check_puppetrun on boron is OK: OK: Puppet is currently enabled, last run 289 seconds ago with 0 failures [00:21:09] PROBLEM - salt-minion processes on maps1002 is CRITICAL: Connection refused by host [00:25:10] RECOVERY - salt-minion processes on maps1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [00:26:49] PROBLEM - configured eth on maps1002 is CRITICAL: Connection refused by host [00:27:49] PROBLEM - Disk space on maps1002 is CRITICAL: Connection refused by host [00:27:49] PROBLEM - DPKG on maps1002 is CRITICAL: Connection refused by host [00:27:59] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:28:57] 06Operations: Reconsider the aligning arrows puppet lint - https://phabricator.wikimedia.org/T137763#2378054 (10yuvipanda) [00:29:50] RECOVERY - Disk space on maps1002 is OK: DISK OK [00:29:50] RECOVERY - DPKG on maps1002 is OK: All packages OK [00:31:00] RECOVERY - configured eth on maps1002 is OK: OK - interfaces up [00:31:20] PROBLEM - puppet last run on maps1002 is CRITICAL: Connection refused by host [00:32:37] PROBLEM - salt-minion processes on maps1002 is CRITICAL: Connection refused by host [00:34:07] RECOVERY - salt-minion processes on maps1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [00:34:37] PROBLEM - MD RAID on maps1002 is CRITICAL: Connection refused by host [00:37:26] PROBLEM - MD RAID on install2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:40:37] RECOVERY - MD RAID on maps1002 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [00:45:27] RECOVERY - MD RAID on install2001 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [00:46:25] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2378083 (10Danny_B) [00:47:51] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2362379 (10Danny_B) >>! In T137224#2376481, @Danny_B wrote: > `/patch/` links have pretty similar equivalent in Diffu... [00:55:33] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2378096 (10mmodell) [00:55:56] RECOVERY - puppet last run on maps1002 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [01:02:12] 06Operations, 10Traffic: Increase request limits for GETs to /api/rest_v1/ - https://phabricator.wikimedia.org/T118365#2378110 (10GWicke) Quick status update: We have since introduced per-entrypoint limits in the REST API. Initially, this is targeted at [uncacheable transforms](https://en.wikipedia.org/api/res... [01:20:57] PROBLEM - puppet last run on mw1131 is CRITICAL: CRITICAL: Puppet has 1 failures [01:39:06] 06Operations, 10ops-codfw: install2001 hardware troubles - https://phabricator.wikimedia.org/T137647#2378160 (10faidon) [01:41:17] PROBLEM - graphite.wmflabs.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:41:26] RECOVERY - puppet last run on install2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:43:07] RECOVERY - graphite.wmflabs.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1701 bytes in 0.714 second response time [01:46:06] PROBLEM - Disk space on elastic1012 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 80305 MB (15% inode=99%) [01:47:36] RECOVERY - puppet last run on mw1131 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:00:27] RECOVERY - Disk space on elastic1012 is OK: DISK OK [02:08:26] PROBLEM - MD RAID on maps1002 is CRITICAL: Connection refused by host [02:08:47] PROBLEM - configured eth on maps1002 is CRITICAL: Connection refused by host [02:11:37] PROBLEM - puppet last run on maps1002 is CRITICAL: Connection refused by host [02:12:26] RECOVERY - MD RAID on maps1002 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [02:12:47] RECOVERY - configured eth on maps1002 is OK: OK - interfaces up [02:15:47] RECOVERY - puppet last run on maps1002 is OK: OK: Puppet is currently enabled, last run 21 minutes ago with 0 failures [02:20:16] PROBLEM - Disk space on maps1002 is CRITICAL: Connection refused by host [02:20:37] PROBLEM - Start and verify pages via webservices on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 187 bytes in 10.962 second response time [02:24:17] RECOVERY - Disk space on maps1002 is OK: DISK OK [02:24:37] RECOVERY - Start and verify pages via webservices on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 6.579 second response time [02:31:06] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures [02:33:51] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.5) (duration: 12m 14s) [02:33:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:39:50] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Jun 14 02:39:50 UTC 2016 (duration 5m 59s) [02:39:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:54:30] PROBLEM - graphite.wmflabs.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:58:29] RECOVERY - graphite.wmflabs.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1701 bytes in 0.011 second response time [03:04:34] PROBLEM - graphite.wmflabs.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:06:24] RECOVERY - graphite.wmflabs.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1701 bytes in 0.009 second response time [03:27:44] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:40:10] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet has 1 failures [03:42:15] 06Operations, 10DBA, 06Labs, 10Tool-Labs, 10Traffic: Antigng-bot improper non-api http requests - https://phabricator.wikimedia.org/T137707#2378215 (10Antigng_) Labs replicas can't do that job, as revision tables are removed on such databases. Dumps are not updated such often. [03:45:10] RECOVERY - check_puppetrun on boron is OK: OK: Puppet is currently enabled, last run 294 seconds ago with 0 failures [03:56:01] PROBLEM - graphite.wmflabs.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:02:11] RECOVERY - graphite.wmflabs.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1701 bytes in 0.195 second response time [04:10:10] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet has 1 failures [04:15:10] RECOVERY - check_puppetrun on boron is OK: OK: Puppet is currently enabled, last run 292 seconds ago with 0 failures [04:15:40] PROBLEM - MD RAID on maps1002 is CRITICAL: Connection refused by host [04:16:00] PROBLEM - configured eth on maps1002 is CRITICAL: Connection refused by host [04:18:01] RECOVERY - configured eth on maps1002 is OK: OK - interfaces up [04:21:41] RECOVERY - MD RAID on maps1002 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [04:22:41] PROBLEM - graphite.wmflabs.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:22:42] PROBLEM - salt-minion processes on maps1002 is CRITICAL: Connection refused by host [04:24:21] PROBLEM - dhclient process on maps1002 is CRITICAL: Connection refused by host [04:24:22] PROBLEM - puppet last run on maps1002 is CRITICAL: Connection refused by host [04:26:50] RECOVERY - graphite.wmflabs.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1701 bytes in 5.442 second response time [04:27:32] PROBLEM - Disk space on maps1002 is CRITICAL: Connection refused by host [04:27:51] PROBLEM - MD RAID on maps1002 is CRITICAL: Connection refused by host [04:29:22] PROBLEM - DPKG on maps1002 is CRITICAL: Connection refused by host [04:30:20] PROBLEM - configured eth on maps1002 is CRITICAL: Connection refused by host [04:34:16] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures [04:37:57] RECOVERY - salt-minion processes on maps1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [04:38:17] RECOVERY - DPKG on maps1002 is OK: All packages OK [04:38:37] RECOVERY - Disk space on maps1002 is OK: DISK OK [04:38:57] RECOVERY - dhclient process on maps1002 is OK: PROCS OK: 0 processes with command name dhclient [04:38:58] RECOVERY - MD RAID on maps1002 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [04:39:36] RECOVERY - configured eth on maps1002 is OK: OK - interfaces up [04:39:47] PROBLEM - graphite.wmflabs.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:43:47] RECOVERY - graphite.wmflabs.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1701 bytes in 6.798 second response time [04:56:16] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [04:56:58] RECOVERY - puppet last run on maps1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:57:59] (03PS1) 10KartikMistry: apertium-swe: Initial Debian packaging [debs/contenttranslation/apertium-swe] - 10https://gerrit.wikimedia.org/r/294244 (https://phabricator.wikimedia.org/T137767) [05:02:28] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures [05:04:19] (03PS1) 10KartikMistry: apertium-swe-nor: Initial Debian packaging [debs/contenttranslation/apertium-swe-nor] - 10https://gerrit.wikimedia.org/r/294245 (https://phabricator.wikimedia.org/T137767) [05:27:06] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:28:20] (03PS1) 10Jdlrobson: Enable lazy loaded images on Ukranian and Farsi Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294247 (https://phabricator.wikimedia.org/T134003) [05:32:27] 06Operations: Reconsider the aligning arrows puppet lint - https://phabricator.wikimedia.org/T137763#2378054 (10mmodell) I like the aligned arrows but it doesn't need to be a lint error. [05:33:46] PROBLEM - Start and verify pages via webservices on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 187 bytes in 10.998 second response time [05:36:06] RECOVERY - Start and verify pages via webservices on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 24.016 second response time [05:45:47] PROBLEM - graphite.wmflabs.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:46:50] 07Blocked-on-Operations, 10Continuous-Integration-Infrastructure, 10Packaging, 05Gerrit-Migration, and 2 others: Package xhpast (libphutil) - https://phabricator.wikimedia.org/T137770#2378301 (10mmodell) [05:48:06] 07Blocked-on-Operations, 10Continuous-Integration-Infrastructure, 10Packaging, 05Gerrit-Migration, and 2 others: Package xhpast (libphutil) - https://phabricator.wikimedia.org/T137770#2378316 (10mmodell) This is the last thing blocking the completion of T130950 @fgiunchedi is this something you can help... [05:49:57] RECOVERY - graphite.wmflabs.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1701 bytes in 8.626 second response time [05:55:10] (03PS1) 10KartikMistry: apertium-swe-dan: Initial Debian packaging [debs/contenttranslation/apertium-swe-dan] - 10https://gerrit.wikimedia.org/r/294248 (https://phabricator.wikimedia.org/T137767) [06:05:36] PROBLEM - graphite.wmflabs.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:06:15] PROBLEM - Disk space on mw2238 is CRITICAL: Timeout while attempting connection [06:06:15] PROBLEM - Disk space on mw2233 is CRITICAL: Timeout while attempting connection [06:06:15] PROBLEM - Disk space on mw2234 is CRITICAL: Timeout while attempting connection [06:06:15] PROBLEM - Disk space on mw2236 is CRITICAL: Timeout while attempting connection [06:06:15] PROBLEM - Disk space on mw2239 is CRITICAL: Timeout while attempting connection [06:06:16] PROBLEM - Disk space on mw2237 is CRITICAL: Timeout while attempting connection [06:06:16] PROBLEM - Disk space on mw2240 is CRITICAL: Timeout while attempting connection [06:06:17] PROBLEM - Disk space on mw2235 is CRITICAL: Timeout while attempting connection [06:06:35] PROBLEM - MD RAID on mw2238 is CRITICAL: Timeout while attempting connection [06:06:35] PROBLEM - MD RAID on mw2235 is CRITICAL: Timeout while attempting connection [06:06:35] PROBLEM - MD RAID on mw2234 is CRITICAL: Timeout while attempting connection [06:06:36] PROBLEM - MD RAID on mw2236 is CRITICAL: Timeout while attempting connection [06:06:36] PROBLEM - MD RAID on mw2233 is CRITICAL: Timeout while attempting connection [06:06:36] PROBLEM - MD RAID on mw2237 is CRITICAL: Timeout while attempting connection [06:06:36] PROBLEM - MD RAID on mw2239 is CRITICAL: Timeout while attempting connection [06:06:37] PROBLEM - MD RAID on mw2240 is CRITICAL: Timeout while attempting connection [06:07:25] PROBLEM - configured eth on mw2238 is CRITICAL: Timeout while attempting connection [06:07:25] PROBLEM - configured eth on mw2233 is CRITICAL: Timeout while attempting connection [06:07:26] PROBLEM - configured eth on mw2235 is CRITICAL: Timeout while attempting connection [06:07:26] PROBLEM - configured eth on mw2236 is CRITICAL: Timeout while attempting connection [06:07:26] PROBLEM - configured eth on mw2237 is CRITICAL: Timeout while attempting connection [06:07:26] PROBLEM - configured eth on mw2234 is CRITICAL: Timeout while attempting connection [06:07:26] PROBLEM - configured eth on mw2240 is CRITICAL: Timeout while attempting connection [06:07:27] PROBLEM - configured eth on mw2239 is CRITICAL: Timeout while attempting connection [06:07:27] RECOVERY - graphite.wmflabs.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1701 bytes in 0.285 second response time [06:07:35] PROBLEM - Apache HTTP on mw2233 is CRITICAL: Connection timed out [06:07:35] PROBLEM - Apache HTTP on mw2235 is CRITICAL: Connection timed out [06:07:35] PROBLEM - Apache HTTP on mw2236 is CRITICAL: Connection timed out [06:07:35] PROBLEM - Apache HTTP on mw2238 is CRITICAL: Connection timed out [06:07:35] PROBLEM - Apache HTTP on mw2240 is CRITICAL: Connection timed out [06:07:36] PROBLEM - Apache HTTP on mw2239 is CRITICAL: Connection timed out [06:07:36] PROBLEM - Apache HTTP on mw2237 is CRITICAL: Connection timed out [06:07:37] PROBLEM - dhclient process on mw2234 is CRITICAL: Connection refused by host [06:07:37] PROBLEM - dhclient process on mw2233 is CRITICAL: Connection refused by host [06:07:45] PROBLEM - dhclient process on mw2238 is CRITICAL: Timeout while attempting connection [06:07:45] PROBLEM - dhclient process on mw2236 is CRITICAL: Timeout while attempting connection [06:07:45] PROBLEM - dhclient process on mw2235 is CRITICAL: Timeout while attempting connection [06:07:45] PROBLEM - dhclient process on mw2240 is CRITICAL: Timeout while attempting connection [06:07:45] PROBLEM - dhclient process on mw2237 is CRITICAL: Timeout while attempting connection [06:07:46] PROBLEM - dhclient process on mw2239 is CRITICAL: Timeout while attempting connection [06:08:05] PROBLEM - mediawiki-installation DSH group on mw2233 is CRITICAL: Host mw2233 is not in mediawiki-installation dsh group [06:08:06] PROBLEM - mediawiki-installation DSH group on mw2234 is CRITICAL: Host mw2234 is not in mediawiki-installation dsh group [06:08:06] PROBLEM - mediawiki-installation DSH group on mw2238 is CRITICAL: Host mw2238 is not in mediawiki-installation dsh group [06:08:06] PROBLEM - mediawiki-installation DSH group on mw2239 is CRITICAL: Host mw2239 is not in mediawiki-installation dsh group [06:08:06] PROBLEM - mediawiki-installation DSH group on mw2236 is CRITICAL: Host mw2236 is not in mediawiki-installation dsh group [06:08:06] PROBLEM - mediawiki-installation DSH group on mw2237 is CRITICAL: Host mw2237 is not in mediawiki-installation dsh group [06:08:07] PROBLEM - mediawiki-installation DSH group on mw2235 is CRITICAL: Host mw2235 is not in mediawiki-installation dsh group [06:08:07] PROBLEM - mediawiki-installation DSH group on mw2240 is CRITICAL: Host mw2240 is not in mediawiki-installation dsh group [06:08:26] PROBLEM - nutcracker port on mw2233 is CRITICAL: Connection refused by host [06:08:26] PROBLEM - nutcracker port on mw2234 is CRITICAL: Connection refused by host [06:08:35] PROBLEM - nutcracker port on mw2238 is CRITICAL: Timeout while attempting connection [06:08:35] PROBLEM - nutcracker port on mw2237 is CRITICAL: Timeout while attempting connection [06:08:35] PROBLEM - nutcracker port on mw2240 is CRITICAL: Timeout while attempting connection [06:08:35] PROBLEM - nutcracker port on mw2235 is CRITICAL: Timeout while attempting connection [06:08:35] PROBLEM - nutcracker port on mw2239 is CRITICAL: Timeout while attempting connection [06:08:36] PROBLEM - nutcracker port on mw2236 is CRITICAL: Timeout while attempting connection [06:08:36] PROBLEM - nutcracker process on mw2233 is CRITICAL: Connection refused by host [06:08:37] PROBLEM - nutcracker process on mw2234 is CRITICAL: Connection refused by host [06:08:47] PROBLEM - nutcracker process on mw2239 is CRITICAL: Timeout while attempting connection [06:08:47] PROBLEM - nutcracker process on mw2237 is CRITICAL: Timeout while attempting connection [06:08:47] PROBLEM - nutcracker process on mw2235 is CRITICAL: Timeout while attempting connection [06:08:47] PROBLEM - nutcracker process on mw2240 is CRITICAL: Timeout while attempting connection [06:08:47] PROBLEM - nutcracker process on mw2236 is CRITICAL: Timeout while attempting connection [06:08:48] PROBLEM - nutcracker process on mw2238 is CRITICAL: Timeout while attempting connection [06:08:56] PROBLEM - puppet last run on mw2233 is CRITICAL: Connection refused by host [06:08:56] PROBLEM - puppet last run on mw2234 is CRITICAL: Connection refused by host [06:09:05] PROBLEM - puppet last run on mw2238 is CRITICAL: Timeout while attempting connection [06:09:05] PROBLEM - puppet last run on mw2237 is CRITICAL: Timeout while attempting connection [06:09:06] PROBLEM - puppet last run on mw2236 is CRITICAL: Timeout while attempting connection [06:09:06] PROBLEM - puppet last run on mw2235 is CRITICAL: Timeout while attempting connection [06:09:06] PROBLEM - puppet last run on mw2240 is CRITICAL: Timeout while attempting connection [06:09:06] PROBLEM - puppet last run on mw2239 is CRITICAL: Timeout while attempting connection [06:09:15] PROBLEM - salt-minion processes on mw2233 is CRITICAL: Connection refused by host [06:09:15] PROBLEM - salt-minion processes on mw2234 is CRITICAL: Connection refused by host [06:09:25] PROBLEM - salt-minion processes on mw2235 is CRITICAL: Timeout while attempting connection [06:09:25] PROBLEM - salt-minion processes on mw2238 is CRITICAL: Timeout while attempting connection [06:09:25] PROBLEM - salt-minion processes on mw2239 is CRITICAL: Timeout while attempting connection [06:09:25] PROBLEM - salt-minion processes on mw2236 is CRITICAL: Timeout while attempting connection [06:09:25] PROBLEM - salt-minion processes on mw2237 is CRITICAL: Timeout while attempting connection [06:09:26] PROBLEM - salt-minion processes on mw2240 is CRITICAL: Timeout while attempting connection [06:09:30] (03PS1) 10KartikMistry: apertium-cat: Initial Debian packaging [debs/contenttranslation/apertium-cat] - 10https://gerrit.wikimedia.org/r/294250 (https://phabricator.wikimedia.org/T137768) [06:09:35] RECOVERY - Apache HTTP on mw2233 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.073 second response time [06:09:46] PROBLEM - Check size of conntrack table on mw2233 is CRITICAL: Connection refused by host [06:09:56] PROBLEM - Check size of conntrack table on mw2237 is CRITICAL: Timeout while attempting connection [06:09:56] PROBLEM - Check size of conntrack table on mw2238 is CRITICAL: Timeout while attempting connection [06:09:56] PROBLEM - Check size of conntrack table on mw2239 is CRITICAL: Timeout while attempting connection [06:09:56] PROBLEM - Check size of conntrack table on mw2240 is CRITICAL: Timeout while attempting connection [06:09:56] PROBLEM - Check size of conntrack table on mw2236 is CRITICAL: Timeout while attempting connection [06:09:57] PROBLEM - Check size of conntrack table on mw2235 is CRITICAL: Timeout while attempting connection [06:10:06] PROBLEM - DPKG on mw2233 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [06:10:06] PROBLEM - DPKG on mw2234 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [06:10:15] PROBLEM - DPKG on mw2238 is CRITICAL: Timeout while attempting connection [06:10:15] PROBLEM - DPKG on mw2239 is CRITICAL: Timeout while attempting connection [06:10:15] PROBLEM - DPKG on mw2237 is CRITICAL: Timeout while attempting connection [06:10:15] PROBLEM - DPKG on mw2235 is CRITICAL: Timeout while attempting connection [06:10:15] PROBLEM - DPKG on mw2240 is CRITICAL: Timeout while attempting connection [06:10:16] PROBLEM - DPKG on mw2236 is CRITICAL: Timeout while attempting connection [06:10:16] RECOVERY - Disk space on mw2233 is OK: DISK OK [06:10:17] RECOVERY - Disk space on mw2234 is OK: DISK OK [06:10:26] RECOVERY - nutcracker port on mw2233 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [06:10:26] RECOVERY - nutcracker port on mw2234 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [06:10:36] RECOVERY - MD RAID on mw2233 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [06:10:36] RECOVERY - MD RAID on mw2234 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [06:10:38] 06Operations, 13Patch-For-Review: Tracking and Reducing cron-spam from root@ - https://phabricator.wikimedia.org/T132324#2378364 (10jcrespo) [06:10:46] RECOVERY - nutcracker process on mw2233 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [06:10:46] RECOVERY - nutcracker process on mw2234 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [06:11:16] RECOVERY - salt-minion processes on mw2233 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [06:11:16] RECOVERY - salt-minion processes on mw2234 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [06:11:26] RECOVERY - configured eth on mw2234 is OK: OK - interfaces up [06:11:26] RECOVERY - configured eth on mw2233 is OK: OK - interfaces up [06:11:30] (03PS2) 10Muehlenhoff: Define ferm service dynamicproxy-api-http in role::labs::novaproxy [puppet] - 10https://gerrit.wikimedia.org/r/293733 [06:11:46] RECOVERY - dhclient process on mw2233 is OK: PROCS OK: 0 processes with command name dhclient [06:11:46] RECOVERY - dhclient process on mw2234 is OK: PROCS OK: 0 processes with command name dhclient [06:11:55] RECOVERY - Check size of conntrack table on mw2233 is OK: OK: nf_conntrack is 0 % full [06:13:45] RECOVERY - Apache HTTP on mw2237 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.075 second response time [06:14:15] RECOVERY - DPKG on mw2233 is OK: All packages OK [06:14:15] RECOVERY - DPKG on mw2234 is OK: All packages OK [06:15:46] RECOVERY - Apache HTTP on mw2235 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.075 second response time [06:15:46] RECOVERY - Apache HTTP on mw2236 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.077 second response time [06:15:56] RECOVERY - dhclient process on mw2237 is OK: PROCS OK: 0 processes with command name dhclient [06:16:05] RECOVERY - Check size of conntrack table on mw2237 is OK: OK: nf_conntrack is 0 % full [06:16:16] RECOVERY - DPKG on mw2235 is OK: All packages OK [06:16:26] RECOVERY - Disk space on mw2235 is OK: DISK OK [06:16:26] RECOVERY - Disk space on mw2237 is OK: DISK OK [06:16:45] RECOVERY - nutcracker port on mw2237 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [06:16:46] RECOVERY - nutcracker port on mw2235 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [06:16:46] RECOVERY - nutcracker port on mw2236 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [06:16:55] RECOVERY - MD RAID on mw2235 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [06:16:55] RECOVERY - MD RAID on mw2236 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [06:16:55] RECOVERY - MD RAID on mw2237 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [06:16:56] RECOVERY - nutcracker process on mw2236 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [06:16:56] RECOVERY - nutcracker process on mw2235 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [06:16:56] RECOVERY - nutcracker process on mw2237 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [06:17:36] RECOVERY - salt-minion processes on mw2235 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [06:17:36] RECOVERY - salt-minion processes on mw2237 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [06:17:36] RECOVERY - salt-minion processes on mw2236 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [06:17:45] RECOVERY - configured eth on mw2237 is OK: OK - interfaces up [06:17:45] RECOVERY - configured eth on mw2236 is OK: OK - interfaces up [06:17:45] RECOVERY - configured eth on mw2235 is OK: OK - interfaces up [06:17:57] RECOVERY - dhclient process on mw2235 is OK: PROCS OK: 0 processes with command name dhclient [06:18:05] RECOVERY - dhclient process on mw2236 is OK: PROCS OK: 0 processes with command name dhclient [06:18:06] RECOVERY - Check size of conntrack table on mw2235 is OK: OK: nf_conntrack is 0 % full [06:18:06] RECOVERY - Check size of conntrack table on mw2236 is OK: OK: nf_conntrack is 0 % full [06:18:25] RECOVERY - DPKG on mw2237 is OK: All packages OK [06:18:27] RECOVERY - Disk space on mw2236 is OK: DISK OK [06:20:26] RECOVERY - DPKG on mw2236 is OK: All packages OK [06:23:08] (03PS1) 10KartikMistry: apertium-fra: Initial Debian packaging [debs/contenttranslation/apertium-fra] - 10https://gerrit.wikimedia.org/r/294252 (https://phabricator.wikimedia.org/T137768) [06:23:10] (03PS1) 10Jcrespo: Pool new s1 servers db1080, db1083, db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294253 (https://phabricator.wikimedia.org/T133398) [06:24:06] RECOVERY - Apache HTTP on mw2239 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.078 second response time [06:24:56] (03PS2) 10Jcrespo: Pool new s1 servers db1080, db1083, db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294253 (https://phabricator.wikimedia.org/T133398) [06:26:26] RECOVERY - dhclient process on mw2239 is OK: PROCS OK: 0 processes with command name dhclient [06:26:27] RECOVERY - Check size of conntrack table on mw2239 is OK: OK: nf_conntrack is 0 % full [06:26:41] (03CR) 10Jcrespo: [C: 032] Pool new s1 servers db1080, db1083, db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294253 (https://phabricator.wikimedia.org/T133398) (owner: 10Jcrespo) [06:26:55] RECOVERY - Disk space on mw2239 is OK: DISK OK [06:27:07] RECOVERY - nutcracker port on mw2239 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [06:27:16] RECOVERY - MD RAID on mw2239 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [06:27:26] RECOVERY - nutcracker process on mw2239 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [06:27:45] PROBLEM - configured eth on maps1002 is CRITICAL: Connection refused by host [06:27:48] 06Operations, 10Ops-Access-Requests, 06WMF-NDA-Requests: NDA-Request Jan Dittrich - https://phabricator.wikimedia.org/T136560#2378416 (10Jan_Dittrich) Login to graphana works. Thanks. [06:28:06] RECOVERY - salt-minion processes on mw2239 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [06:28:16] RECOVERY - configured eth on mw2239 is OK: OK - interfaces up [06:28:26] RECOVERY - Apache HTTP on mw2240 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.075 second response time [06:28:56] RECOVERY - DPKG on mw2239 is OK: All packages OK [06:29:26] RECOVERY - MD RAID on mw2240 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [06:29:36] RECOVERY - nutcracker process on mw2240 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [06:29:45] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Pool db1052, db1080, db1083, db1089 (duration: 01m 31s) [06:29:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:29:55] RECOVERY - configured eth on maps1002 is OK: OK - interfaces up [06:29:56] PROBLEM - puppet last run on mw2233 is CRITICAL: CRITICAL: Puppet has 8 failures [06:29:56] PROBLEM - puppet last run on mw2234 is CRITICAL: CRITICAL: Puppet has 8 failures [06:30:16] RECOVERY - salt-minion processes on mw2240 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [06:30:27] PROBLEM - Disk space on maps1002 is CRITICAL: Connection refused by host [06:30:27] RECOVERY - configured eth on mw2240 is OK: OK - interfaces up [06:30:27] PROBLEM - MD RAID on maps1002 is CRITICAL: Connection refused by host [06:30:36] PROBLEM - puppet last run on maps1002 is CRITICAL: Connection refused by host [06:30:46] RECOVERY - dhclient process on mw2240 is OK: PROCS OK: 0 processes with command name dhclient [06:30:56] RECOVERY - Check size of conntrack table on mw2240 is OK: OK: nf_conntrack is 0 % full [06:31:17] RECOVERY - Disk space on mw2240 is OK: DISK OK [06:31:36] RECOVERY - nutcracker port on mw2240 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [06:31:45] PROBLEM - dhclient process on maps1002 is CRITICAL: Connection refused by host [06:31:46] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:46] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:46] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:56] RECOVERY - DPKG on mw2240 is OK: All packages OK [06:33:05] PROBLEM - puppet last run on cp4013 is CRITICAL: CRITICAL: puppet fail [06:33:36] PROBLEM - puppet last run on db2016 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:17] RECOVERY - dhclient process on maps1002 is OK: PROCS OK: 0 processes with command name dhclient [06:34:25] PROBLEM - puppet last run on mw2237 is CRITICAL: CRITICAL: Puppet has 8 failures [06:35:06] PROBLEM - DPKG on maps1002 is CRITICAL: Connection refused by host [06:35:06] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:26] PROBLEM - puppet last run on subra is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:36] 06Operations, 10Traffic, 06Wikipedia-iOS-App-Product-Backlog: Wikipedia app hits loads.php on bits.wikimedia.org - https://phabricator.wikimedia.org/T132969#2378436 (10JMinor) p:05Normal>03Triage [06:35:37] PROBLEM - puppet last run on mw2236 is CRITICAL: CRITICAL: Puppet has 8 failures [06:37:06] RECOVERY - DPKG on maps1002 is OK: All packages OK [06:37:59] (03PS1) 10Urbanecm: Add autopatrolled group in kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294254 (https://phabricator.wikimedia.org/T130808) [06:38:55] PROBLEM - salt-minion processes on maps1002 is CRITICAL: Connection refused by host [06:39:46] RECOVERY - MD RAID on maps1002 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [06:40:06] PROBLEM - Restbase root url on restbase2003 is CRITICAL: Connection refused [06:40:56] PROBLEM - restbase endpoints health on restbase2003 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.32.124, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [06:41:35] RECOVERY - Disk space on maps1002 is OK: DISK OK [06:42:05] PROBLEM - Apache HTTP on mw2236 is CRITICAL: Connection refused [06:42:36] PROBLEM - dhclient process on maps1002 is CRITICAL: Connection refused by host [06:43:05] <_joe_> !log rebooting mw2228 [06:43:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:43:16] PROBLEM - NTP on mw2240 is CRITICAL: NTP CRITICAL: Offset unknown [06:43:25] PROBLEM - DPKG on maps1002 is CRITICAL: Connection refused by host [06:43:46] PROBLEM - puppet last run on mw2239 is CRITICAL: CRITICAL: Puppet has 9 failures [06:44:36] RECOVERY - dhclient process on maps1002 is OK: PROCS OK: 0 processes with command name dhclient [06:45:07] RECOVERY - salt-minion processes on maps1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [06:45:25] RECOVERY - NTP on mw2240 is OK: NTP OK: Offset 0.001886606216 secs [06:45:26] RECOVERY - DPKG on maps1002 is OK: All packages OK [06:46:05] PROBLEM - NTP on mw2238 is CRITICAL: NTP CRITICAL: No response from NTP server [06:47:23] <_joe_> !log rebooting mw2228 [06:48:37] PROBLEM - Apache HTTP on mw2233 is CRITICAL: Connection refused [06:50:06] RECOVERY - puppet last run on mw2228 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [06:53:35] PROBLEM - graphite.wmflabs.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:54:26] (03PS1) 10Giuseppe Lavagetto: mediawiki::jobrunner: systemd compatibility [puppet] - 10https://gerrit.wikimedia.org/r/294255 [06:55:06] PROBLEM - Apache HTTP on mw2237 is CRITICAL: Connection refused [06:55:25] RECOVERY - puppet last run on maps1002 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [06:55:26] RECOVERY - graphite.wmflabs.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1701 bytes in 0.295 second response time [06:55:41] (03CR) 10jenkins-bot: [V: 04-1] mediawiki::jobrunner: systemd compatibility [puppet] - 10https://gerrit.wikimedia.org/r/294255 (owner: 10Giuseppe Lavagetto) [06:56:17] <_joe_> whats wrong with you, jenkins> [06:56:27] RECOVERY - puppet last run on subra is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:16] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [06:57:16] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:57:46] PROBLEM - Apache HTTP on mw2234 is CRITICAL: Connection refused [06:57:47] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:15] RECOVERY - Apache HTTP on mw2238 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.074 second response time [06:58:16] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:57] RECOVERY - puppet last run on cp4013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:00:57] RECOVERY - puppet last run on db2016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:01:27] RECOVERY - Restbase root url on restbase2003 is OK: HTTP OK: HTTP/1.1 200 - 15273 bytes in 0.167 second response time [07:01:55] RECOVERY - nutcracker port on mw2238 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [07:01:55] RECOVERY - dhclient process on mw2238 is OK: PROCS OK: 0 processes with command name dhclient [07:02:40] RECOVERY - configured eth on mw2238 is OK: OK - interfaces up [07:02:40] RECOVERY - Disk space on mw2238 is OK: DISK OK [07:02:51] RECOVERY - MD RAID on mw2238 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [07:03:19] (03PS1) 10Jcrespo: Depool db1024, increase weight of db1080, db1083 and db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294258 (https://phabricator.wikimedia.org/T133398) [07:03:30] RECOVERY - nutcracker process on mw2238 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [07:03:31] RECOVERY - restbase endpoints health on restbase2003 is OK: All endpoints are healthy [07:03:52] RECOVERY - salt-minion processes on mw2238 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [07:04:11] (03PS2) 10Jcrespo: Depool db1024, increase weight of db1082, db1087 and db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294258 (https://phabricator.wikimedia.org/T133398) [07:05:31] RECOVERY - Check size of conntrack table on mw2238 is OK: OK: nf_conntrack is 0 % full [07:05:35] <_joe_> !log rolling reboot of mw2233-40 [07:05:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:05:41] RECOVERY - DPKG on mw2238 is OK: All packages OK [07:06:20] PROBLEM - HHVM rendering on mw1156 is CRITICAL: Connection timed out [07:06:36] (03CR) 10Jcrespo: [C: 032] Depool db1024, increase weight of db1082, db1087 and db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294258 (https://phabricator.wikimedia.org/T133398) (owner: 10Jcrespo) [07:06:51] PROBLEM - Apache HTTP on mw1156 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:08:01] PROBLEM - HHVM processes on mw1156 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:08:32] PROBLEM - dhclient process on mw1156 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:08:40] PROBLEM - SSH on mw1156 is CRITICAL: Connection timed out [07:08:50] RECOVERY - puppet last run on mw2236 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [07:09:11] PROBLEM - nutcracker process on mw1156 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:09:20] PROBLEM - Check size of conntrack table on mw1156 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:09:21] PROBLEM - nutcracker port on mw1156 is CRITICAL: Timeout while attempting connection [07:09:21] RECOVERY - puppet last run on mw2234 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [07:09:31] RECOVERY - Apache HTTP on mw2234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 8.261 second response time [07:09:32] (03PS1) 10KartikMistry: apertium-cy-en: Rebuilt for Jessie [debs/contenttranslation/apertium-cy-en] - 10https://gerrit.wikimedia.org/r/294260 (https://phabricator.wikimedia.org/T107306) [07:09:41] RECOVERY - Apache HTTP on mw2236 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 8.280 second response time [07:10:01] RECOVERY - puppet last run on mw2233 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:10:10] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet has 1 failures [07:10:21] RECOVERY - puppet last run on mw2235 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:10:30] RECOVERY - Apache HTTP on mw2233 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 8.492 second response time [07:10:50] RECOVERY - Apache HTTP on mw2237 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 8.211 second response time [07:11:27] I think I brought down mw1156 on scap [07:11:50] PROBLEM - Host mw1156 is DOWN: PING CRITICAL - Packet loss = 100% [07:13:11] PROBLEM - puppet last run on cp3043 is CRITICAL: CRITICAL: puppet fail [07:15:10] RECOVERY - check_puppetrun on boron is OK: OK: Puppet is currently enabled, last run 286 seconds ago with 0 failures [07:15:16] <_joe_> jynus: no it's an API server OOMing [07:15:16] [7750662.329178] BUG: soft lockup - CPU#8 stuck for 22s! [hhvm:28371] [07:15:27] I see now [07:16:10] but scap timeout is > 10 minutes? [07:17:11] RECOVERY - NTP on mw2238 is OK: NTP OK: Offset -0.005183577538 secs [07:18:02] PROBLEM - Apache HTTP on mw2240 is CRITICAL: Connection refused [07:18:14] <_joe_> jynus: no idea [07:18:51] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1024, increase weight of db1082, db1087 and db1092 (duration: 10m 50s) [07:18:53] 06Operations, 10DBA, 10Wikidata, 07Performance: EntityUsageTable::getUsedEntityIdStrings query on wbc_entity_usage table is sometimes fast, sometimes slow - https://phabricator.wikimedia.org/T116404#2378551 (10adrianheine) [07:18:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:19:12] !log powercycling mw1156, could not regain control after OOM [07:19:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:21:01] RECOVERY - dhclient process on mw1156 is OK: PROCS OK: 0 processes with command name dhclient [07:21:10] RECOVERY - SSH on mw1156 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.7 (protocol 2.0) [07:21:11] RECOVERY - Host mw1156 is UP: PING OK - Packet loss = 0%, RTA = 1.13 ms [07:21:30] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 1.130 second response time [07:21:41] RECOVERY - nutcracker process on mw1156 is OK: PROCS OK: 1 process with UID = 109 (nutcracker), command name nutcracker [07:21:42] RECOVERY - Check size of conntrack table on mw1156 is OK: OK: nf_conntrack is 0 % full [07:21:50] RECOVERY - nutcracker port on mw1156 is OK: TCP OK - 0.000 second response time on port 11212 [07:22:11] PROBLEM - puppet last run on mw2238 is CRITICAL: CRITICAL: Puppet has 8 failures [07:22:40] doing a pull now [07:22:40] RECOVERY - HHVM processes on mw1156 is OK: PROCS OK: 11 processes with command name hhvm [07:25:01] RECOVERY - HHVM rendering on mw1156 is OK: HTTP OK: HTTP/1.1 200 OK - 67474 bytes in 0.348 second response time [07:25:46] (03PS5) 10WMDE-leszek: Load the RevisionSlider extension on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287936 (https://phabricator.wikimedia.org/T134770) (owner: 10Addshore) [07:26:48] 06Operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, and 4 others: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#2378596 (10Arrbee) [07:28:40] (03CR) 10WMDE-leszek: "Rebased PS4 so this could be merged" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287936 (https://phabricator.wikimedia.org/T134770) (owner: 10Addshore) [07:37:03] (03PS1) 10Giuseppe Lavagetto: nutcracker: fix defaults file for systemd [puppet] - 10https://gerrit.wikimedia.org/r/294263 [07:40:58] RECOVERY - puppet last run on cp3043 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:41:45] 06Operations, 10Traffic, 07HTTPS: xx.wikipedia.com returns a certificate error - https://phabricator.wikimedia.org/T137779#2378679 (10cookies52) [07:44:08] PROBLEM - Apache HTTP on mw2238 is CRITICAL: Connection refused [07:45:03] 06Operations, 10Traffic, 10domains, 07HTTPS: xx.wikipedia.com returns a certificate error - https://phabricator.wikimedia.org/T137779#2378687 (10Peachey88) [07:52:27] RECOVERY - Apache HTTP on mw2238 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.130 second response time [07:54:18] PROBLEM - puppet last run on maps1002 is CRITICAL: Connection refused by host [07:55:59] (03PS1) 10KartikMistry: apertium-en-ca: New upstream release and Jessie build [debs/contenttranslation/apertium-en-ca] - 10https://gerrit.wikimedia.org/r/294264 (https://phabricator.wikimedia.org/T107306) [07:58:52] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/3108/mw1017.eqiad.wmnet/ does the right thing." [puppet] - 10https://gerrit.wikimedia.org/r/294263 (owner: 10Giuseppe Lavagetto) [08:01:27] PROBLEM - configured eth on maps1002 is CRITICAL: Connection refused by host [08:02:29] PROBLEM - DPKG on maps1002 is CRITICAL: Connection refused by host [08:02:48] PROBLEM - HHVM rendering on mw1154 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:02:58] RECOVERY - puppet last run on maps1002 is OK: OK: Puppet is currently enabled, last run 8 minutes ago with 0 failures [08:03:37] PROBLEM - salt-minion processes on mw1154 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:03:38] PROBLEM - dhclient process on mw1154 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:04:18] PROBLEM - nutcracker process on mw1154 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:04:27] PROBLEM - HHVM processes on mw1154 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:06:08] PROBLEM - MD RAID on maps1002 is CRITICAL: Connection refused by host [08:06:38] RECOVERY - DPKG on maps1002 is OK: All packages OK [08:07:37] PROBLEM - salt-minion processes on maps1002 is CRITICAL: Connection refused by host [08:08:08] RECOVERY - MD RAID on maps1002 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [08:09:28] PROBLEM - puppet last run on mw2240 is CRITICAL: CRITICAL: Puppet has 1 failures [08:09:36] (03PS1) 10Jcrespo: Depool db1040 for cloning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294265 (https://phabricator.wikimedia.org/T133398) [08:10:28] PROBLEM - nutcracker process on mw1154 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:10:38] (03PS2) 10Jcrespo: Depool db1040 for cloning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294265 (https://phabricator.wikimedia.org/T133398) [08:11:48] RECOVERY - salt-minion processes on maps1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:11:49] RECOVERY - configured eth on maps1002 is OK: OK - interfaces up [08:12:09] PROBLEM - Apache HTTP on mw1154 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:12:52] <_joe_> I don't get the reason of those alarms on maps1002 [08:12:57] <_joe_> now looking at mw1154 [08:13:40] I assume a large spike in activity? [08:14:00] <_joe_> jynus: that server is completely idle [08:14:05] oh [08:14:14] then it has to be network [08:16:11] <_joe_> mw1154 is an imagescaler, so I am investigating what's up there [08:16:47] thank you [08:19:22] (03CR) 10Jcrespo: [C: 032] Depool db1040 for cloning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294265 (https://phabricator.wikimedia.org/T133398) (owner: 10Jcrespo) [08:23:23] <_joe_> !log powercycling mw1154, unresponsive [08:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:23:48] PROBLEM - puppet last run on mw1154 is CRITICAL: Timeout while attempting connection [08:26:28] RECOVERY - salt-minion processes on mw1154 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:26:38] RECOVERY - dhclient process on mw1154 is OK: PROCS OK: 0 processes with command name dhclient [08:26:48] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 2.706 second response time [08:27:17] RECOVERY - nutcracker process on mw1154 is OK: PROCS OK: 1 process with UID = 109 (nutcracker), command name nutcracker [08:27:27] RECOVERY - HHVM processes on mw1154 is OK: PROCS OK: 11 processes with command name hhvm [08:27:37] 06Operations: Reconsider the aligning arrows puppet lint - https://phabricator.wikimedia.org/T137763#2378054 (10Joe) The arrow alignment is one of the basic puppet style guidelines everyone on the net seems to agree on. Since we want our puppet modules to be reused by others, we can't get rid of the arrow allig... [08:27:58] PROBLEM - graphite.wmflabs.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:28:53] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1040 for cloning (duration: 00m 32s) [08:28:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:30:08] RECOVERY - graphite.wmflabs.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1701 bytes in 8.661 second response time [08:30:08] RECOVERY - HHVM rendering on mw1154 is OK: HTTP OK: HTTP/1.1 200 OK - 67483 bytes in 1.939 second response time [08:34:34] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures [08:41:13] PROBLEM - MD RAID on maps1002 is CRITICAL: Connection refused by host [08:41:24] PROBLEM - configured eth on maps1002 is CRITICAL: Connection refused by host [08:43:14] RECOVERY - MD RAID on maps1002 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [08:49:44] RECOVERY - configured eth on maps1002 is OK: OK - interfaces up [08:53:24] PROBLEM - dhclient process on maps1002 is CRITICAL: Connection refused by host [08:54:00] * gehel is checking maps1002... [08:54:13] PROBLEM - puppet last run on maps1002 is CRITICAL: Connection refused by host [08:55:34] RECOVERY - dhclient process on maps1002 is OK: PROCS OK: 0 processes with command name dhclient [08:57:23] maps100? are not yet in service, silencing them... [08:57:25] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:58:01] 07Blocked-on-Operations, 10Continuous-Integration-Infrastructure, 10Packaging, 05Gerrit-Migration, and 2 others: Package xhpast (libphutil) - https://phabricator.wikimedia.org/T137770#2378794 (10fgiunchedi) @mmodell sure I can help with it! I see the `debian` directory is on the deployment repo, though gen... [08:58:34] RECOVERY - puppet last run on maps1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [08:58:36] 06Operations, 07Puppet: Reconsider the aligning arrows puppet lint - https://phabricator.wikimedia.org/T137763#2378795 (10Peachey88) [08:59:03] !log stopping db1040 for cloning to new s4 hosts [08:59:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:04:27] !log gallium: manually removing cron entry zuul_repack from user zuul. Causes cron spam due to zuul merger no more being on gallium T137418 [09:04:28] T137418: Remove zuul-merger from gallium - https://phabricator.wikimedia.org/T137418 [09:04:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:06:04] PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1040.eqiad.wmnet:3306 - retry-time: 60 retries: 86400 message: Cant connect to MySQL server on db1040.eqiad.wmnet (111 Connection refused) [09:06:14] that is normal [09:06:28] I will ack it [09:06:54] 06Operations, 10Continuous-Integration-Infrastructure (phase-out-gallium), 13Patch-For-Review: Remove zuul-merger from gallium - https://phabricator.wikimedia.org/T137418#2378806 (10hashar) zuul@gallium.wikimedia still had a cron entry: ``` # Puppet Name: zuul_repack PATH=/usr/bin:/bin:/usr/sbin:/sbin 7 4 *... [09:10:24] 06Operations, 10ops-codfw, 10media-storage: rack/setup/deploy ms-be202[2-7] - https://phabricator.wikimedia.org/T136630#2378822 (10fgiunchedi) we could do that too, namely install all 6x systems in row D and expand swift there. If row D is generally underutilized let's go with that instead, thanks! [09:20:22] (03PS4) 10Filippo Giunchedi: swift: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291173 (owner: 10BryanDavis) [09:20:30] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291173 (owner: 10BryanDavis) [09:27:55] !log roll-restart swift proxy in codfw and eqiad [09:27:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:31:24] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures [09:39:45] (03PS1) 10Catrope: Add nonecho.dblist and echo.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294269 (https://phabricator.wikimedia.org/T137771) [09:40:05] 06Operations, 10Continuous-Integration-Infrastructure, 13Patch-For-Review, 07WorkType-NewFunctionality: Phase out operations-puppet-pep8 Jenkins job and tools/puppet_pep8.py - https://phabricator.wikimedia.org/T114887#2378883 (10hashar) @bd808 has sent a long serie of patches that bring up operations/puppe... [09:40:13] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet has 1 failures [09:40:57] (03Abandoned) 10Hashar: tox entry point to run pep8==1.4.6 [puppet] - 10https://gerrit.wikimedia.org/r/244148 (https://phabricator.wikimedia.org/T114887) (owner: 10Hashar) [09:44:13] (03PS1) 10Hashar: Get rid of .pep8 files [puppet] - 10https://gerrit.wikimedia.org/r/294270 (https://phabricator.wikimedia.org/T114887) [09:44:16] (03PS1) 10Catrope: Only run processEchoEmailBatch.php on Echo-enabled wikis [puppet] - 10https://gerrit.wikimedia.org/r/294271 (https://phabricator.wikimedia.org/T137771) [09:45:13] RECOVERY - check_puppetrun on boron is OK: OK: Puppet is currently enabled, last run 296 seconds ago with 0 failures [09:46:08] hey all, I'm looking at your kafka debian packaging, and it seems that the package depends (and build-depends) on libcglib-java, which is not available in jessie [09:47:57] https://packages.debian.org/jessie/libcglib3-java ? [09:48:12] looks like they've added a number [09:48:21] (03CR) 10Jcrespo: [C: 031] Only run processEchoEmailBatch.php on Echo-enabled wikis [puppet] - 10https://gerrit.wikimedia.org/r/294271 (https://phabricator.wikimedia.org/T137771) (owner: 10Catrope) [09:51:17] (03PS1) 10Giuseppe Lavagetto: mediawiki: add other servers to the dsh list [puppet] - 10https://gerrit.wikimedia.org/r/294273 [09:51:19] (03PS1) 10Giuseppe Lavagetto: conftool-data: add newly imaged hosts with jessie [puppet] - 10https://gerrit.wikimedia.org/r/294274 [09:51:41] 06Operations, 10ops-codfw: ms-be2003.codfw.wmnet: slot=4 dev=sde failed - https://phabricator.wikimedia.org/T137785#2378894 (10fgiunchedi) 03NEW [09:52:35] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/294270 (https://phabricator.wikimedia.org/T114887) (owner: 10Hashar) [09:53:23] PROBLEM - graphite.wmflabs.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:54:50] Reedy: right; usually when the maintainers add a number the new version is not api-compatible, but in that case the package seems to build just fine [09:56:43] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [09:57:34] RECOVERY - graphite.wmflabs.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1701 bytes in 0.311 second response time [09:58:52] 06Operations, 06Discovery, 06Labs, 10hardware-requests, 03Discovery-Search-Sprint: eqiad: (2) Relevance forge servers - https://phabricator.wikimedia.org/T131184#2378914 (10mark) [09:59:02] 07Blocked-on-Operations, 06Operations, 10hardware-requests: Evaluate replacing SATA disks on ganeti100X.eqiad.wmnet with SSDs - https://phabricator.wikimedia.org/T132679#2378917 (10mark) [10:04:31] RECOVERY - MegaRAID on ms-be2003 is OK: OK: optimal, 13 logical, 13 physical [10:07:14] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/294270 (https://phabricator.wikimedia.org/T114887) (owner: 10Hashar) [10:10:10] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet has 1 failures [10:10:48] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: add other servers to the dsh list [puppet] - 10https://gerrit.wikimedia.org/r/294273 (owner: 10Giuseppe Lavagetto) [10:10:56] (03CR) 10Hashar: [C: 031] "That was solely for running pep8 in each directories. No more needed since puppet.git is now fully flake8 compliant and one would typicall" [puppet] - 10https://gerrit.wikimedia.org/r/294270 (https://phabricator.wikimedia.org/T114887) (owner: 10Hashar) [10:13:30] (03PS2) 10Giuseppe Lavagetto: conftool-data: add newly imaged hosts with jessie [puppet] - 10https://gerrit.wikimedia.org/r/294274 [10:14:04] 06Operations, 10Continuous-Integration-Infrastructure, 13Patch-For-Review, 15User-bd808, 07WorkType-NewFunctionality: Phase out operations-puppet-pep8 Jenkins job and tools/puppet_pep8.py - https://phabricator.wikimedia.org/T114887#2378998 (10hashar) 05Open>03Resolved a:03bd808 I have: * removed th... [10:15:10] RECOVERY - check_puppetrun on boron is OK: OK: Puppet is currently enabled, last run 298 seconds ago with 0 failures [10:16:04] 06Operations, 10Analytics-Cluster, 10Packaging: libcglib3-java replaces libcglib-java in Jessie - https://phabricator.wikimedia.org/T137791#2379016 (10Reedy) [10:16:14] olasd: ^ Filed a task [10:16:29] 06Operations, 07Puppet: Reconsider the aligning arrows puppet lint - https://phabricator.wikimedia.org/T137763#2378054 (10fgiunchedi) [while looking at the bikeshed] for me the main annoyance is doing the work twice, namely paying attention to arrow alignment while writing the code and fix it manually afterwar... [10:16:43] (03CR) 10Giuseppe Lavagetto: [C: 032] conftool-data: add newly imaged hosts with jessie [puppet] - 10https://gerrit.wikimedia.org/r/294274 (owner: 10Giuseppe Lavagetto) [10:17:42] Reedy: thanks a bunch [10:17:48] 06Operations, 06Discovery, 06Maps, 03Discovery-Maps-Sprint: Send logs to logstash for maps services (katotherian, tilerator, tileratorui) - https://phabricator.wikimedia.org/T137618#2379035 (10Gehel) I now see logs in logstash. I'm completely unsure why I did not before (I might have just been blind). I wa... [10:32:21] 07Blocked-on-Operations, 10Datasets-Archiving, 10Dumps-Generation, 10Flow, 03Collab-Team-2016-Apr-Jun-Q4: Publish recurring Flow dumps at http://dumps.wikimedia.org/ - https://phabricator.wikimedia.org/T119511#2379060 (10ArielGlenn) Uh, this is done, insofar as they show up on dumps.wm.org as files to be... [10:34:17] labsdb1003 running low on space [10:34:57] and also 3GB of swap [10:43:13] 06Operations, 07Puppet: Reconsider the aligning arrows puppet lint - https://phabricator.wikimedia.org/T137763#2378054 (10akosiaris) I definitely appreciate aligned arrows when reading puppet code. I also use vim, so with the expection of nested arrows, where I 've more than once had to instruct puppet-lint to... [10:43:53] (03CR) 10Alexandros Kosiaris: [C: 032] Get rid of .pep8 files [puppet] - 10https://gerrit.wikimedia.org/r/294270 (https://phabricator.wikimedia.org/T114887) (owner: 10Hashar) [10:44:02] (03PS2) 10Alexandros Kosiaris: Get rid of .pep8 files [puppet] - 10https://gerrit.wikimedia.org/r/294270 (https://phabricator.wikimedia.org/T114887) (owner: 10Hashar) [10:48:09] (03PS1) 10Filippo Giunchedi: [STRAWMAN] let puppet-lint fix arrow alignment [puppet] - 10https://gerrit.wikimedia.org/r/294282 (https://phabricator.wikimedia.org/T137763) [10:48:40] PROBLEM - puppet last run on mw2082 is CRITICAL: CRITICAL: puppet fail [10:49:22] (03CR) 10jenkins-bot: [V: 04-1] [STRAWMAN] let puppet-lint fix arrow alignment [puppet] - 10https://gerrit.wikimedia.org/r/294282 (https://phabricator.wikimedia.org/T137763) (owner: 10Filippo Giunchedi) [10:50:52] the irony [10:52:15] !log oblivian@palladium conftool action : set/weight=30; selector: cluster=appserver,dc=eqiad,name=mw12[67].* [10:52:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:52:51] PROBLEM - changeprop endpoints health on scb1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.16, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [10:53:24] <_joe_> godog: indeed [10:56:21] <_joe_> !log pooling the new jessie appservers, mw1263-71 [10:56:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:57:10] RECOVERY - changeprop endpoints health on scb1001 is OK: All endpoints are healthy [10:58:50] PROBLEM - graphite.wmflabs.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:00:11] 06Operations, 10ops-eqiad, 13Patch-For-Review: Rack and Set up new application servers mw1261-1283 - https://phabricator.wikimedia.org/T133798#2379148 (10Joe) `mw1261-mw1271` have been installed, properly set up and pooled into the appservers cluster. [11:01:01] RECOVERY - graphite.wmflabs.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1701 bytes in 0.574 second response time [11:01:15] _joe_ starting the reimage of mw1272.eqiad.wmnet and mw1273.eqiad.wmnet [11:02:11] ACKNOWLEDGEMENT - puppet last run on ms-be2003 is CRITICAL: CRITICAL: Puppet has 1 failures Filippo Giunchedi https://phabricator.wikimedia.org/T137785 [11:02:19] (03PS3) 10Gehel: Enable 'has_spec' on Kartotherian service. [puppet] - 10https://gerrit.wikimedia.org/r/294028 (https://phabricator.wikimedia.org/T137617) [11:02:59] <_joe_> elukey: ok cool [11:04:00] <_joe_> !log pooling all the new codfw appservers that have been installed - mw2215-mw2240 (T135466) [11:04:01] T135466: rack/setup/deploy new codfw mw app servers - https://phabricator.wikimedia.org/T135466 [11:04:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:06:21] PROBLEM - changeprop endpoints health on scb1002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.16.21, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [11:06:56] (03CR) 10Gehel: [C: 032] "Looks good and puppet compiler aggrees: https://puppet-compiler.wmflabs.org/3109/" [puppet] - 10https://gerrit.wikimedia.org/r/294028 (https://phabricator.wikimedia.org/T137617) (owner: 10Gehel) [11:08:31] RECOVERY - changeprop endpoints health on scb1002 is OK: All endpoints are healthy [11:08:55] (03PS1) 10Gehel: Move service logs to /srv/log [puppet] - 10https://gerrit.wikimedia.org/r/294285 (https://phabricator.wikimedia.org/T137618) [11:09:29] (03CR) 10jenkins-bot: [V: 04-1] Move service logs to /srv/log [puppet] - 10https://gerrit.wikimedia.org/r/294285 (https://phabricator.wikimedia.org/T137618) (owner: 10Gehel) [11:10:37] (03PS2) 10Filippo Giunchedi: [STRAWMAN] let puppet-lint fix arrow alignment [puppet] - 10https://gerrit.wikimedia.org/r/294282 (https://phabricator.wikimedia.org/T137763) [11:11:01] (03PS2) 10Gehel: Move service logs to /srv/log [puppet] - 10https://gerrit.wikimedia.org/r/294285 (https://phabricator.wikimedia.org/T137618) [11:13:40] !log T134242 install qemu-system-common, qemu-system-x86 1:2.5+dfsg-4~bpo8+1 from jessie-backports on ganeti200{1,2,3,4,5,6} [11:13:40] RECOVERY - mediawiki-installation DSH group on mw2236 is OK: OK [11:13:41] RECOVERY - mediawiki-installation DSH group on mw2234 is OK: OK [11:13:41] RECOVERY - mediawiki-installation DSH group on mw2233 is OK: OK [11:13:41] RECOVERY - mediawiki-installation DSH group on mw2237 is OK: OK [11:13:41] RECOVERY - mediawiki-installation DSH group on mw2238 is OK: OK [11:13:41] RECOVERY - mediawiki-installation DSH group on mw2235 is OK: OK [11:13:41] RECOVERY - mediawiki-installation DSH group on mw2240 is OK: OK [11:13:41] T134242: kvm on ganeti instances getting stuck - https://phabricator.wikimedia.org/T134242 [11:13:42] RECOVERY - mediawiki-installation DSH group on mw2239 is OK: OK [11:13:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:14:01] RECOVERY - MariaDB Slave IO: s4 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [11:14:38] (03CR) 10Mobrovac: [C: 031] "LGTM, but I see that service::node doesn't ensure service::configuration::log_dir actually exists before creating the child dir, so please" [puppet] - 10https://gerrit.wikimedia.org/r/294285 (https://phabricator.wikimedia.org/T137618) (owner: 10Gehel) [11:15:32] !log T134242 rebooting alsafi.wikimedia.org hassaleh.codfw.wmnet kraz.wikimedia.org mx2001.wikimedia.org planet2001.codfw.wmnet pollux.wikimedia.org pybal-test2001.codfw.wmnet pybal-test2002.codfw.wmnet pybal-test2003.codfw.wmnet for qemu-kvm upgrade [11:15:33] T134242: kvm on ganeti instances getting stuck - https://phabricator.wikimedia.org/T134242 [11:16:41] RECOVERY - puppet last run on mw2082 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [11:17:20] RECOVERY - puppet last run on mw2240 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [11:17:32] RECOVERY - Apache HTTP on mw2240 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.539 second response time [11:17:51] (03CR) 10Gehel: [C: 032] Move service logs to /srv/log [puppet] - 10https://gerrit.wikimedia.org/r/294285 (https://phabricator.wikimedia.org/T137618) (owner: 10Gehel) [11:21:29] 06Operations: kvm on ganeti instances getting stuck - https://phabricator.wikimedia.org/T134242#2379191 (10akosiaris) All of `codfw` VMs have been upgraded to qemu 2.5. I 'll wait a few more days for any problems to manifest and then do `eqiad`. [11:22:09] 06Operations: kvm on ganeti instances getting stuck - https://phabricator.wikimedia.org/T134242#2379196 (10akosiaris) a:03akosiaris [11:28:05] (03PS1) 10Gehel: Revert "Move service logs to /srv/log" [puppet] - 10https://gerrit.wikimedia.org/r/294286 (https://phabricator.wikimedia.org/T137618) [11:29:35] (03CR) 10Gehel: [C: 032] Revert "Move service logs to /srv/log" [puppet] - 10https://gerrit.wikimedia.org/r/294286 (https://phabricator.wikimedia.org/T137618) (owner: 10Gehel) [11:30:05] !log scb disabling puppet for 10 mins or so to keep change-prop down [11:30:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:36:57] 06Operations, 10cassandra: 1000+ keyspace metrics you didn't see coming - https://phabricator.wikimedia.org/T137304#2379243 (10fgiunchedi) no problem, I see some metrics are still being updated though, e.g. ``` 3436550197 304 -rw-r--r-- 1 _graphite _graphite 309088 Jun 14 11:25 /var/lib/carbon/whisper/cas... [11:38:00] PROBLEM - kartotherian endpoints health on maps2001 is CRITICAL: Generic error: paths [11:38:49] PROBLEM - graphite.wmflabs.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:38:59] PROBLEM - changeprop endpoints health on scb1002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.16.21, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [11:39:30] PROBLEM - changeprop endpoints health on scb1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.16, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [11:42:18] (03PS1) 10Gehel: Send service logs by default to /srv/log. [puppet] - 10https://gerrit.wikimedia.org/r/294288 [11:42:55] (03PS1) 10Mobrovac: service::configuration: Set the default log_dir to /srv/log [puppet] - 10https://gerrit.wikimedia.org/r/294289 [11:43:04] hehe gehel ^^^ [11:43:33] wow! I'm almost as fast as you are! Now let's see if the implementation looks alike... [11:43:44] gehel: running pcc for my change [11:43:49] (03PS1) 10ArielGlenn: add new hostname for dumps rsync for crc.nd.edu mirror [puppet] - 10https://gerrit.wikimedia.org/r/294290 [11:44:20] (03CR) 10jenkins-bot: [V: 04-1] service::configuration: Set the default log_dir to /srv/log [puppet] - 10https://gerrit.wikimedia.org/r/294289 (owner: 10Mobrovac) [11:44:36] gr, it must be a lint issue [11:44:59] RECOVERY - graphite.wmflabs.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1701 bytes in 0.019 second response time [11:45:06] (03CR) 10Gehel: service::configuration: Set the default log_dir to /srv/log (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/294289 (owner: 10Mobrovac) [11:45:23] 06Operations, 10Monitoring, 07Graphite: "carbon-cache too many creates" on graphite1001 - https://phabricator.wikimedia.org/T137380#2379253 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi it is usually actionable in the sense that too many files are being created and thus disk might be filling too quic... [11:45:36] (03CR) 10ArielGlenn: [C: 032] add new hostname for dumps rsync for crc.nd.edu mirror [puppet] - 10https://gerrit.wikimedia.org/r/294290 (owner: 10ArielGlenn) [11:45:50] RECOVERY - changeprop endpoints health on scb1001 is OK: All endpoints are healthy [11:46:57] (03CR) 10Mobrovac: service::configuration: Set the default log_dir to /srv/log (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/294289 (owner: 10Mobrovac) [11:50:57] (03PS2) 10Mobrovac: service::configuration: Set the default log_dir to /srv/log [puppet] - 10https://gerrit.wikimedia.org/r/294289 [11:55:06] (03CR) 10Gehel: service::configuration: Set the default log_dir to /srv/log (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/294289 (owner: 10Mobrovac) [11:55:23] mobrovac: ^ [11:55:51] RECOVERY - changeprop endpoints health on scb1002 is OK: All endpoints are healthy [12:04:20] (03CR) 10Mobrovac: service::configuration: Set the default log_dir to /srv/log (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/294289 (owner: 10Mobrovac) [12:04:40] (03CR) 10Mobrovac: "PCC OK - https://puppet-compiler.wmflabs.org/3110/" [puppet] - 10https://gerrit.wikimedia.org/r/294289 (owner: 10Mobrovac) [12:04:56] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures [12:05:38] PROBLEM - Disk space on mw1273 is CRITICAL: Timeout while attempting connection [12:05:38] PROBLEM - Disk space on mw1272 is CRITICAL: Timeout while attempting connection [12:06:07] PROBLEM - MD RAID on mw1273 is CRITICAL: Timeout while attempting connection [12:06:07] PROBLEM - MD RAID on mw1272 is CRITICAL: Timeout while attempting connection [12:07:03] this is me ---^ [12:07:06] PROBLEM - configured eth on mw1272 is CRITICAL: Timeout while attempting connection [12:07:06] PROBLEM - configured eth on mw1273 is CRITICAL: Timeout while attempting connection [12:07:06] PROBLEM - Apache HTTP on mw1272 is CRITICAL: Connection timed out [12:07:06] PROBLEM - Apache HTTP on mw1273 is CRITICAL: Connection timed out [12:07:07] new appservers [12:07:16] PROBLEM - dhclient process on mw1273 is CRITICAL: Timeout while attempting connection [12:07:16] PROBLEM - dhclient process on mw1272 is CRITICAL: Timeout while attempting connection [12:07:37] PROBLEM - mediawiki-installation DSH group on mw1272 is CRITICAL: Host mw1272 is not in mediawiki-installation dsh group [12:07:37] PROBLEM - mediawiki-installation DSH group on mw1273 is CRITICAL: Host mw1273 is not in mediawiki-installation dsh group [12:07:52] (03CR) 10Alexandros Kosiaris: [C: 04-1] Send service logs by default to /srv/log. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/294288 (owner: 10Gehel) [12:08:06] PROBLEM - nutcracker port on mw1273 is CRITICAL: Timeout while attempting connection [12:08:06] PROBLEM - nutcracker port on mw1272 is CRITICAL: Timeout while attempting connection [12:08:18] PROBLEM - nutcracker process on mw1272 is CRITICAL: Timeout while attempting connection [12:08:18] PROBLEM - nutcracker process on mw1273 is CRITICAL: Timeout while attempting connection [12:08:37] PROBLEM - puppet last run on mw1273 is CRITICAL: Timeout while attempting connection [12:08:37] PROBLEM - puppet last run on mw1272 is CRITICAL: Timeout while attempting connection [12:08:46] 06Operations, 07Puppet, 13Patch-For-Review: Reconsider the aligning arrows puppet lint - https://phabricator.wikimedia.org/T137763#2378054 (10BBlack) Personally, I'm not a fan of arrow alignment enforcement with automatic gating reviews like we do today. It's common coding/style advice to always separate cl... [12:08:57] PROBLEM - salt-minion processes on mw1273 is CRITICAL: Timeout while attempting connection [12:08:57] PROBLEM - salt-minion processes on mw1272 is CRITICAL: Timeout while attempting connection [12:09:36] PROBLEM - Check size of conntrack table on mw1272 is CRITICAL: Timeout while attempting connection [12:09:36] PROBLEM - Check size of conntrack table on mw1273 is CRITICAL: Timeout while attempting connection [12:09:47] PROBLEM - DPKG on mw1273 is CRITICAL: Timeout while attempting connection [12:11:20] (03PS1) 10Hashar: Initial debianization [debs/geckodriver] - 10https://gerrit.wikimedia.org/r/294293 (https://phabricator.wikimedia.org/T137797) [12:16:54] 06Operations, 07Puppet, 13Patch-For-Review: Reconsider the aligning arrows puppet lint - https://phabricator.wikimedia.org/T137763#2379351 (10BBlack) Also: something's perhaps wrong with puppetlint when doing arrow-alignment-fix in isolation from other checks. In the strawman commitdiff for that, most of th... [12:18:12] (03CR) 10Muehlenhoff: [C: 031] "nitpick on the commit message, but looks good to me" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/294289 (owner: 10Mobrovac) [12:19:52] (03CR) 10Hashar: "Build a dummy package targeting unstable. Result at https://people.wikimedia.org/~hashar/debs/gecokdriver_0.8.0/" [debs/geckodriver] - 10https://gerrit.wikimedia.org/r/294293 (https://phabricator.wikimedia.org/T137797) (owner: 10Hashar) [12:21:01] 06Operations, 10Traffic, 10domains, 07HTTPS: xx.wikipedia.com returns a certificate error - https://phabricator.wikimedia.org/T137779#2379369 (10BBlack) [12:21:03] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review: Secure redirect service for large count of non-canonical / junk domains - https://phabricator.wikimedia.org/T133548#2379371 (10BBlack) [12:22:09] (03CR) 10Mobrovac: service::configuration: Set the default log_dir to /srv/log (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/294289 (owner: 10Mobrovac) [12:23:43] (03PS1) 10Jcrespo: Repool db1024, pool for the first time db1090 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294295 (https://phabricator.wikimedia.org/T133398) [12:23:50] (03CR) 10Hashar: [C: 04-1] "Needs to have cargo to install to DESTDIR / tweak install etc. The .deb is missing all compiled files." [debs/geckodriver] - 10https://gerrit.wikimedia.org/r/294293 (https://phabricator.wikimedia.org/T137797) (owner: 10Hashar) [12:25:12] (03CR) 10Jcrespo: [C: 032] Repool db1024, pool for the first time db1090 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294295 (https://phabricator.wikimedia.org/T133398) (owner: 10Jcrespo) [12:26:16] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [12:27:19] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1024, pool for the first time db1090 with low weight (duration: 00m 38s) [12:27:20] (03CR) 10Muehlenhoff: "You'll need a .install file in debian/" [debs/geckodriver] - 10https://gerrit.wikimedia.org/r/294293 (https://phabricator.wikimedia.org/T137797) (owner: 10Hashar) [12:27:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:30:56] RECOVERY - Apache HTTP on mw1272 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.002 second response time [12:34:29] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures [12:35:06] RECOVERY - configured eth on mw1272 is OK: OK - interfaces up [12:35:16] RECOVERY - dhclient process on mw1272 is OK: PROCS OK: 0 processes with command name dhclient [12:35:16] RECOVERY - nutcracker port on mw1272 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [12:35:17] RECOVERY - Check size of conntrack table on mw1272 is OK: OK: nf_conntrack is 0 % full [12:35:27] RECOVERY - nutcracker process on mw1272 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [12:36:06] RECOVERY - salt-minion processes on mw1272 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:36:17] RECOVERY - MD RAID on mw1272 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [12:36:26] RECOVERY - Disk space on mw1272 is OK: DISK OK [12:36:35] 06Operations, 07Puppet, 13Patch-For-Review: Reconsider the aligning arrows puppet lint - https://phabricator.wikimedia.org/T137763#2378054 (10faidon) Aligned arrows is pretty common in puppet code out there. It was popular even before the official style guide was drafted. I've been using them since I started... [12:37:29] (03PS3) 10Mobrovac: service::configuration: Set the default log_dir to /srv/log [puppet] - 10https://gerrit.wikimedia.org/r/294289 [12:37:42] (03PS4) 10BBlack: tlsproxy: turn proxy_request_buffering off for v4 [puppet] - 10https://gerrit.wikimedia.org/r/287996 [12:38:06] (03CR) 10BBlack: [C: 032 V: 032] tlsproxy: turn proxy_request_buffering off for v4 [puppet] - 10https://gerrit.wikimedia.org/r/287996 (owner: 10BBlack) [12:38:37] RECOVERY - Apache HTTP on mw1273 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.010 second response time [12:39:23] (03PS4) 10BBlack: Support optional keepalives and websockets for v4 only [puppet] - 10https://gerrit.wikimedia.org/r/287941 (https://phabricator.wikimedia.org/T134870) [12:42:47] RECOVERY - Disk space on mw1273 is OK: DISK OK [12:43:17] RECOVERY - salt-minion processes on mw1273 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:43:17] RECOVERY - MD RAID on mw1273 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [12:43:47] RECOVERY - Check size of conntrack table on mw1273 is OK: OK: nf_conntrack is 0 % full [12:43:47] RECOVERY - dhclient process on mw1273 is OK: PROCS OK: 0 processes with command name dhclient [12:44:07] RECOVERY - configured eth on mw1273 is OK: OK - interfaces up [12:44:27] RECOVERY - nutcracker port on mw1273 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [12:44:46] RECOVERY - nutcracker process on mw1273 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [12:45:20] can we do something about all these mw* alerts please? [12:48:57] RECOVERY - DPKG on mw1273 is OK: All packages OK [12:50:10] (03PS5) 10BBlack: Support optional keepalives and websockets for v4 only [puppet] - 10https://gerrit.wikimedia.org/r/287941 (https://phabricator.wikimedia.org/T134870) [12:51:45] (03CR) 10BBlack: [C: 032] Support optional keepalives and websockets for v4 only [puppet] - 10https://gerrit.wikimedia.org/r/287941 (https://phabricator.wikimedia.org/T134870) (owner: 10BBlack) [12:53:15] (03PS2) 10BBlack: cache_misc: add stream.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/287956 (https://phabricator.wikimedia.org/T134871) [12:53:42] (03CR) 10BBlack: [C: 032 V: 032] cache_misc: add stream.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/287956 (https://phabricator.wikimedia.org/T134871) (owner: 10BBlack) [12:55:48] !log change-prop deployed f34fb06c99 [12:55:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:56:53] (03PS2) 10BBlack: cache_misc: turn on keepalives + websocket support [puppet] - 10https://gerrit.wikimedia.org/r/287958 (https://phabricator.wikimedia.org/T134870) [12:57:26] (03CR) 10BBlack: [C: 032 V: 032] cache_misc: turn on keepalives + websocket support [puppet] - 10https://gerrit.wikimedia.org/r/287958 (https://phabricator.wikimedia.org/T134870) (owner: 10BBlack) [12:58:06] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:58:44] !log installing apache trusty updates on canary app servers [12:58:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:59:00] (03PS1) 10BBlack: cache_maps: turn on keepalives [puppet] - 10https://gerrit.wikimedia.org/r/294301 (https://phabricator.wikimedia.org/T107749) [12:59:18] (03CR) 10BBlack: [C: 032 V: 032] cache_maps: turn on keepalives [puppet] - 10https://gerrit.wikimedia.org/r/294301 (https://phabricator.wikimedia.org/T107749) (owner: 10BBlack) [13:03:44] PROBLEM - changeprop endpoints health on scb1002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.16.21, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [13:04:55] PROBLEM - puppet last run on nescio is CRITICAL: CRITICAL: puppet fail [13:05:38] knwon ^ [13:07:35] RECOVERY - changeprop endpoints health on scb1002 is OK: All endpoints are healthy [13:07:45] (03PS1) 10BBlack: tlsproxy::instance: add missing hiera varnish_version4 [puppet] - 10https://gerrit.wikimedia.org/r/294302 [13:08:02] (03CR) 10BBlack: [C: 032 V: 032] tlsproxy::instance: add missing hiera varnish_version4 [puppet] - 10https://gerrit.wikimedia.org/r/294302 (owner: 10BBlack) [13:09:43] (03PS4) 10Gehel: service::configuration: Set the default log_dir to /srv/log [puppet] - 10https://gerrit.wikimedia.org/r/294289 (owner: 10Mobrovac) [13:10:58] Hey folks. I'm trying to find the icigna configuration for ORES. Where should I look? [13:11:41] (03CR) 10Gehel: [C: 032] service::configuration: Set the default log_dir to /srv/log [puppet] - 10https://gerrit.wikimedia.org/r/294289 (owner: 10Mobrovac) [13:12:29] halfak: the per host part is happening in service::uwsgi. The LVS part in hieradata/common/lvs/configuration.yaml [13:12:49] Thanks akosiaris [13:14:49] akosiaris, I'm looking for the icinga healthtest that actually requests a score from the wmflabs service. [13:15:01] It should be hitting the test model with a unix timestamp as a rev_id. [13:15:13] We had a major downtime event we didn't know about last week because it wasn't working. [13:15:14] PROBLEM - graphite.wmflabs.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:16:20] halfak: anything ores.wmflabs.org related is in the badly named icinga/manifests/monitor/ores.pp [13:16:22] It should be hitting something like https://ores.wmflabs.org/scores/testwiki/reverted/1234567890/ [13:17:01] yeah, that would be modules/nagios_common/files/check_commands/check_ores_workers [13:17:14] RECOVERY - graphite.wmflabs.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1701 bytes in 0.039 second response time [13:18:17] Any idea why this wouldn't report an issue if the endpoint was returning a 503 error? [13:18:52] (03CR) 10Gehel: Send service logs by default to /srv/log. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/294288 (owner: 10Gehel) [13:19:16] (03Abandoned) 10Gehel: Send service logs by default to /srv/log. [puppet] - 10https://gerrit.wikimedia.org/r/294288 (owner: 10Gehel) [13:20:37] akosiaris, ^ [13:20:46] * halfak is trying to work out how `check_http` works [13:21:11] (03PS3) 10Gehel: Make all map sources public for admins of tileratorui [puppet] - 10https://gerrit.wikimedia.org/r/294192 (https://phabricator.wikimedia.org/T137053) (owner: 10Yurik) [13:21:37] looks relevant [13:21:37] https://www.monitoring-plugins.org/doc/man/check_http.html [13:22:34] <_joe_> akosiaris, halfak does ores expose a swagger spec? [13:23:15] <_joe_> if so, you can use service_checker (and the monitoring part of the swagger spec) to automatically do such checks [13:23:38] _joe_, it does! [13:23:38] (03CR) 10Gehel: [C: 032] Make all map sources public for admins of tileratorui [puppet] - 10https://gerrit.wikimedia.org/r/294192 (https://phabricator.wikimedia.org/T137053) (owner: 10Yurik) [13:24:03] But regretfully, there's some internal caching behavior, so we have to be a little bit clever with checking these endpoints. [13:24:28] E.g. /enwiki/reverted/{current_timestamp} is not guaranteed to work [13:24:37] <_joe_> halfak: how can you circumvent the cache> [13:24:44] but /testwiki/reverted/{any int} is guaranteed to work. [13:25:13] Oh... actually.. i forgot that there is another way to circumvent the cache. You can ask for ?features! [13:25:19] That's relatively new. [13:25:39] <_joe_> I mean I'd expect ores to obey http caching headers [13:26:01] <_joe_> or at least have some QS param you can use [13:26:01] _joe_, Oh, it's not caching the *response* so much as the score requested. [13:26:10] <_joe_> halfak: ha! [13:26:14] <_joe_> ok got it :) [13:26:18] _joe_: it does expose a swagger spec but not under /?spec. I 'll make a task for that [13:26:25] (03Abandoned) 10JanZerebecki: Add a notification parameter of analytics to cassandra monitoring [puppet] - 10https://gerrit.wikimedia.org/r/293916 (https://phabricator.wikimedia.org/T137422) (owner: 10JanZerebecki) [13:26:28] (03PS1) 10Mobrovac: Change Prop: Disable transclusion update rules [puppet] - 10https://gerrit.wikimedia.org/r/294303 [13:26:32] Is "?spec" standard? [13:26:45] Oh. actually we do [13:26:46] https://ores.wmflabs.org/v2/?spec [13:26:51] <_joe_> halfak: service_checker assumes you use it :) [13:26:52] Woops. J/j [13:26:53] *j/k [13:27:00] https://ores.wmflabs.org/v2/spec/ [13:27:04] There [13:27:13] No problem to make this also work at ?spec [13:27:51] sounds great. lemme file that task and get that working so we can enable service_checker [13:27:52] Either way, I really need to work out why 503's didn't trigger the icinga event right now. [13:28:00] (03CR) 10Ppchelko: [C: 031] Change Prop: Disable transclusion update rules [puppet] - 10https://gerrit.wikimedia.org/r/294303 (owner: 10Mobrovac) [13:28:07] Because the whole major downtime and not wanting it to happen again thing. [13:28:27] !log rebooting install2001, T137647 [13:28:28] T137647: install2001 hardware troubles - https://phabricator.wikimedia.org/T137647 [13:28:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:29:37] It looks like a 503 should have resulted in an icinga event, but it didn't happen. [13:30:33] akosiaris, is it possible that https://github.com/wikimedia/operations-puppet/blob/production/modules/nagios_common/files/check_commands/check_ores_workers is currently disabled? [13:31:14] PROBLEM - Host install2001 is DOWN: PING CRITICAL - Packet loss = 100% [13:31:25] (03CR) 10Giuseppe Lavagetto: [C: 032] Change Prop: Disable transclusion update rules [puppet] - 10https://gerrit.wikimedia.org/r/294303 (owner: 10Mobrovac) [13:31:51] (03CR) 10Hashar: "The package manager cargo can be use to install a previously compiled package:" [debs/geckodriver] - 10https://gerrit.wikimedia.org/r/294293 (https://phabricator.wikimedia.org/T137797) (owner: 10Hashar) [13:32:35] RECOVERY - Host install2001 is UP: PING OK - Packet loss = 0%, RTA = 85.71 ms [13:33:05] moritzm: I cant find out an env variable to determine where the auto install step would put its material :( $(DEB_DESTDIR) is unset, targetting $(DEB_DESTDIR)/usr fails :( [13:33:15] RECOVERY - puppet last run on nescio is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:34:19] I've confirmed that I get a "503 SERVICE UNAVAILABLE" response the state the service was in for the health check path "/v2/scores/testwiki/reverted/13458/" [13:35:30] hashar: try "DH_VERBOSE = 1" in debian/rules [13:35:57] halfak: no, it is enabled. But it returns a 301 for some reason [13:36:06] http:? [13:36:12] Ha! [13:36:17] We redirect to https! [13:36:26] But check_http says it follows redirects [13:37:22] (03PS1) 10BBlack: disable keepalives on maps+misc [puppet] - 10https://gerrit.wikimedia.org/r/294305 (https://phabricator.wikimedia.org/T107749) [13:37:25] PROBLEM - puppet last run on install2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:37:25] akosiaris, my notes: https://phabricator.wikimedia.org/T137592#2379533 [13:37:28] only when -f is specified [13:37:40] (03CR) 10BBlack: [C: 032 V: 032] disable keepalives on maps+misc [puppet] - 10https://gerrit.wikimedia.org/r/294305 (https://phabricator.wikimedia.org/T107749) (owner: 10BBlack) [13:37:47] Gotcha. so it seems that could be the problem? Maybe we should just specify that -e 200 [13:37:55] Since, only 200 response is going to be OK [13:38:29] Shall I submit a patch for this line?: https://github.com/wikimedia/operations-puppet/blob/production/modules/nagios_common/files/check_commands/check_ores_workers#L8 [13:38:44] moritzm: I keep forgetting about DH_VERBOSE :( [13:39:07] happy to remind :-) [13:39:08] halfak: yeah sure [13:39:11] kk [13:39:48] the check should just be over https [13:39:58] moritzm: then since I override dh_auto_build /dh_auto_install , I am not sure dh_verbose will be any useful :D [13:40:08] !log installing apache trusty updates on codfw app servers [13:40:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:40:12] ideally 2 checks should exist [13:40:15] PROBLEM - changeprop endpoints health on scb1002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.16.21, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [13:40:26] both the redirect one and the https one [13:41:01] akosiaris, OK to put two lines in that `check_ores_workers` script? [13:41:55] halfak: you will need something more than 2 lines. Right now it implicitly returns the last command's return code [13:42:11] if you put 2, you will lose the before last commands return code [13:42:17] command's* [13:42:36] akosiaris, command_1 && command_2 [13:42:43] in essence, you need to 2 different checks [13:43:11] you don't want 1 check for ORES to also go CRITICAL for say the nginx lb not working [13:43:22] or not redirecting more precisely to https [13:43:47] akosiaris, not sure what you mean. [13:44:08] that file essentially constitutes 1 check [13:44:13] Understood [13:44:21] the output of that last command and the return code will be reported directly to icinga [13:44:29] Sure. [13:44:29] and it basically checks the redirect [13:44:45] RECOVERY - changeprop endpoints health on scb1002 is OK: All endpoints are healthy [13:45:00] you need a different check to check something else like ORES returning a 200 [13:45:01] 06Operations, 06Discovery, 06Maps, 07Epic: Epic: switch Maps to production status - https://phabricator.wikimedia.org/T133744#2379562 (10Gehel) [13:45:03] 06Operations, 06Discovery, 06Maps, 03Discovery-Maps-Sprint, 13Patch-For-Review: Send logs to logstash for maps services (katotherian, tilerator, tileratorui) - https://phabricator.wikimedia.org/T137618#2379561 (10Gehel) 05Open>03Resolved [13:45:36] akosiaris, that is the intention of the current check. [13:45:59] then the current check fails to implement the intention ;-) [13:46:08] 06Operations, 07HHVM: Issue rotating hhvm logs - https://phabricator.wikimedia.org/T137689#2379563 (10Joe) a:03Joe [13:46:12] Indeed it does. but it seems you called for it to check the redirect too [13:46:15] RECOVERY - puppet last run on install2001 is OK: OK: Puppet is currently enabled, last run 18 minutes ago with 0 failures [13:46:19] amend that check with a check_http -S and a https:// [13:46:33] but you will probably need another check for the redirect working [13:46:53] or is that something you don't care about ? [13:46:56] How about we do '-f' and leave it http? [13:47:32] which will mean what when that does critical ? [13:47:42] that the redirect is broken or that ores is broken ? [13:47:50] Either the redirect is broken (bad) or the endpoint is down (bad). [13:47:57] !log T136971 Cutting MediaWiki branches 1.28.0-wmf.6 [13:47:57] T136971: MW-1.28.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T136971 [13:48:00] yeah but which of the 2 ? [13:48:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:48:06] which of the 2 should you be debugging ? [13:48:20] akosiaris, honestly, I don't mind checking that out. It would be very easy to find out which one with a web browser and 3 seconds. [13:48:49] \ [13:48:56] may some root please delete a directory for me on tin please? It is leftover from last week deployment and I cant delete it :( ssh sudo tin.eqiad.wmnet rmdir /tmp/make-wmf-branch [13:49:01] if you are ok with that, fine by me [13:49:17] /tmp/make-wmf-branch on tin is rwxrwxr-x thcipriani wikidev [13:49:20] akosiaris, one more thing. What is "oresweb"? [13:49:29] And where does it come from? [13:49:49] hashar, done [13:49:54] hashar: blerg, sorry, forgot to remove after the last branch cut :( [13:50:01] jynus: looks like /tmp restricts deletion :) [13:50:04] thcipriani|afk: no problems! [13:50:17] thank you jynus ! [13:50:18] (03PS1) 10BBlack: cache_misc: pass all stream.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/294307 (https://phabricator.wikimedia.org/T134871) [13:50:24] (we should probably add that step to make-wmf-branch :)) [13:50:31] (03CR) 10BBlack: [C: 032 V: 032] cache_misc: pass all stream.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/294307 (https://phabricator.wikimedia.org/T134871) (owner: 10BBlack) [13:51:11] halfak: it's a placeholder. -H sets the Host: HTTP header and -I sets the hostname/IP to connect to. That part there could be localhost FWIW [13:51:35] Gotcha. [13:51:36] Cool [13:51:57] Looks like I'm adding "-f -e 200" to the command. [13:52:03] Patchset incomming [13:52:06] -f follow [13:52:15] and you don't need the -e 200 then [13:52:35] cause if it follows the redirect it is going to either get a 200 or fail [13:52:54] PROBLEM - puppet last run on install2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:53:10] (03PS1) 10Halfak: Adds follow and expect 200 OK to check_ores_workers [puppet] - 10https://gerrit.wikimedia.org/r/294309 [13:53:34] akosiaris, well, don't I want to make sure it's 200 after the redirect? [13:53:45] it does that implicitly [13:53:53] it == check_http [13:54:17] Well, in this case, it was not failing when it landed on a 301 [13:54:25] What else is it going to not fail on? [13:54:31] 301 is a valid result as well [13:54:37] it is not a mistake [13:54:40] Sure. What other valid results are there? [13:54:49] IMO, 301 is a mistake :) [13:54:58] It was unintended behavior for sure. [13:55:01] 4XX will raising a warning [13:55:05] RECOVERY - puppet last run on install2001 is OK: OK: Puppet is currently enabled, last run 27 minutes ago with 0 failures [13:55:06] 5XX a critical [13:55:44] OK. I want to get paged if it returns anything other than 200 [13:55:45] all 3XX will result in OK as well as 2XX [13:55:50] er, paged ? [13:56:07] page == get an sms on your phone for ops [13:56:12] is that what you mean ? [13:56:25] Yes. [13:57:04] so, -e 200 will raise a critical [13:57:22] it will do so because it expects a 200 but gets back a 301 [13:57:30] But even with -f? [13:57:32] the -f follow does not work in that case [13:57:34] :-( [13:57:36] *-f follow [13:57:45] Well, that's less useful [13:57:59] Are you sure this is the behavior? It seems bad. [13:58:11] /usr/lib/nagios/plugins/check_http -f follow -e 200 -H ores.wmflabs.org -I ores.wmflabs.org -u "/scores/testwiki/reverted/" [13:58:12] HTTP CRITICAL - Invalid HTTP response received from host: HTTP/1.1 301 Moved Permanently [13:58:52] bad path, but point taken. [13:58:54] (03PS1) 10Giuseppe Lavagetto: hhvm: add su directive to logrotate recipe [puppet] - 10https://gerrit.wikimedia.org/r/294311 (https://phabricator.wikimedia.org/T137689) [13:59:10] IMO, the -e *should* represent the ultimate status code. [13:59:15] But here we are [13:59:18] actually not so bad path [13:59:20] /usr/lib/nagios/plugins/check_http -f follow -H ores.wmflabs.org -I ores.wmflabs.org -u "/scores/testwiki/reverted/"HTTP OK: HTTP/1.1 200 OK - 339 bytes in 0.027 second response time |time=0.027190s;;;0.000000 size=339B;;;0 [13:59:30] it is a valid endpoint after all [14:00:01] Yes. It is a valid endpoint, but not the intended endpoint [14:00:24] Heading to meeting. Will run some more tests and get a patchset later. [14:00:26] Thanks for your help akosiaris [14:00:32] don't mention it [14:04:26] PROBLEM - puppet last run on install2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:05:34] RECOVERY - puppet last run on mw1272 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [14:05:34] PROBLEM - MD RAID on install2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:05:54] RECOVERY - puppet last run on mw1273 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [14:06:35] RECOVERY - puppet last run on install2001 is OK: OK: Puppet is currently enabled, last run 38 minutes ago with 0 failures [14:06:57] mw127[23] are new happy appservers [14:07:03] (03PS1) 10Mobrovac: Kartotherian: Use the root URL for the checker script [puppet] - 10https://gerrit.wikimedia.org/r/294313 [14:07:14] gehel: yurik: ^^^ [14:07:35] RECOVERY - MD RAID on install2001 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [14:08:48] mobrovac, thx! what does $a = $b ? {} in ruby means? [14:09:05] is that "use $b unless its falsey, in which case use {}" ? [14:09:07] yurik: in puppet code, it's a switch/case [14:09:16] ah, thx [14:09:29] * yurik is not happy with puppets [14:09:44] (03CR) 10Gehel: [C: 032] Kartotherian: Use the root URL for the checker script [puppet] - 10https://gerrit.wikimedia.org/r/294313 (owner: 10Mobrovac) [14:09:49] (03PS2) 10Giuseppe Lavagetto: mediawiki::jobrunner: systemd compatibility [puppet] - 10https://gerrit.wikimedia.org/r/294255 [14:10:05] mobrovac, shouldn't it be '/_info' IIF its true? [14:10:45] no yurik, if there's a spec, then it should be retrieved via host:port/?spec, otherwise, just do a stupid check against host:port/_info (i.e. no spec is expected there) [14:10:47] yurik: notpe [14:10:58] (03CR) 10jenkins-bot: [V: 04-1] mediawiki::jobrunner: systemd compatibility [puppet] - 10https://gerrit.wikimedia.org/r/294255 (owner: 10Giuseppe Lavagetto) [14:11:03] gotcha, thx. [14:11:10] what is confusing is that ?spec is added behind the scene... [14:11:30] * yurik loves magic, especially in code [14:11:50] * gehel has a love / hate relationship with magic... [14:12:01] yurik: run "/usr/local/lib/nagios/plugins/service_checker -t 5 10.192.0.144 http://10.192.0.144:6533" on maps2001 [14:12:10] * yurik loves to create magic, but hates when someone else creates it [14:12:54] I have a a feeling we have an issue with parameter replacement in test [14:12:56] (03PS1) 10KartikMistry: apertium-en-es: Rebuilt for Jessie [debs/contenttranslation/apertium-en-es] - 10https://gerrit.wikimedia.org/r/294314 (https://phabricator.wikimedia.org/T107306) [14:13:12] gehel, i have the same feeling [14:13:23] mobrovac, shouldn't the params auto-replace? [14:13:34] and the newlines are missing :( [14:14:12] I'm not sure icinga supports new lines in error messages... [14:14:25] which params? [14:14:31] * mobrovac is lacking context [14:14:59] mobrovac, https://gist.github.com/nyurik/057ec49301ce8b5d0e57a5e6eb9c5431 [14:15:12] (03PS1) 10Elukey: Add mw127[23] to the Mediawiki DSH scap list. [puppet] - 10https://gerrit.wikimedia.org/r/294315 [14:15:21] mobrovac: comming from the following spec: https://github.com/wikimedia/maps-kartotherian/blob/master/spec.yaml [14:15:33] (03CR) 10Elukey: [C: 032] Add mw127[23] to the Mediawiki DSH scap list. [puppet] - 10https://gerrit.wikimedia.org/r/294315 (owner: 10Elukey) [14:15:35] PROBLEM - puppet last run on install2001 is CRITICAL: CRITICAL: puppet fail [14:18:45] (03PS2) 10KartikMistry: apertium-en-es: Rebuilt for Jessie [debs/contenttranslation/apertium-en-es] - 10https://gerrit.wikimedia.org/r/294314 (https://phabricator.wikimedia.org/T107306) [14:22:10] gehel: yurik: have you tried manually issuing the requests against the service on the node? [14:22:32] gehel: yurik: e.g. /{src}/{z},{lat},{lon},{w}x{h}@{scale}x.{format} (by replacing the params with the same ones as from the spec) [14:23:14] actually no, let me try [14:23:31] mobrovac: doing it right now... [14:23:37] gehel: yurik: for /_info/home the spec is wrong, it says the checker script should receive a 200, but the service sends a 301 back [14:23:39] gehel, i will try the last one [14:23:54] PROBLEM - puppet last run on mw2092 is CRITICAL: CRITICAL: Puppet has 1 failures [14:23:55] checking... [14:25:10] actually /{src}/{z},{lat},{lon},{w}x{h}.{format} is wrong - there is no test params [14:25:14] i'm certain you can resolve most of the issues just by looking at the output of the checker script, and comparing it with what you get manually and the spec [14:26:25] PROBLEM - puppet last run on ms-be2017 is CRITICAL: CRITICAL: puppet fail [14:26:47] mobrovac: yep, at least some of the tests are passing, the issue is on our side [14:32:05] (03PS2) 10Hashar: Initial debianization [debs/geckodriver] - 10https://gerrit.wikimedia.org/r/294293 (https://phabricator.wikimedia.org/T137797) [14:33:04] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures [14:35:01] (03CR) 10Hashar: "The binary is now included in the .deb package. I have made it work by adding in debian/rules:" [debs/geckodriver] - 10https://gerrit.wikimedia.org/r/294293 (https://phabricator.wikimedia.org/T137797) (owner: 10Hashar) [14:36:19] (03PS2) 10Giuseppe Lavagetto: hhvm: add su directive to logrotate recipe [puppet] - 10https://gerrit.wikimedia.org/r/294311 (https://phabricator.wikimedia.org/T137689) [14:37:59] (03PS1) 10BBlack: websockets VCL: after HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/294320 (https://phabricator.wikimedia.org/T134870) [14:38:22] (03CR) 10BBlack: [C: 032 V: 032] websockets VCL: after HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/294320 (https://phabricator.wikimedia.org/T134870) (owner: 10BBlack) [14:38:54] (03CR) 10Ema: [C: 031] hhvm: add su directive to logrotate recipe [puppet] - 10https://gerrit.wikimedia.org/r/294311 (https://phabricator.wikimedia.org/T137689) (owner: 10Giuseppe Lavagetto) [14:40:56] (03CR) 10Muehlenhoff: "Neither rustc nor cargo are in jessie, though." [debs/geckodriver] - 10https://gerrit.wikimedia.org/r/294293 (https://phabricator.wikimedia.org/T137797) (owner: 10Hashar) [14:43:41] (03PS3) 10Giuseppe Lavagetto: hhvm: add su directive to logrotate recipe [puppet] - 10https://gerrit.wikimedia.org/r/294311 (https://phabricator.wikimedia.org/T137689) [14:45:03] wait what [14:45:16] <_joe_> ? [14:45:26] oh [14:45:29] nevermind :) [14:48:06] (03CR) 10Giuseppe Lavagetto: [C: 032] hhvm: add su directive to logrotate recipe [puppet] - 10https://gerrit.wikimedia.org/r/294311 (https://phabricator.wikimedia.org/T137689) (owner: 10Giuseppe Lavagetto) [14:48:14] RECOVERY - puppet last run on mw2092 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [14:50:17] (03PS1) 10KartikMistry: apertium-en-gl: Rebuilt for Jessie and other fixes [debs/contenttranslation/apertium-en-gl] - 10https://gerrit.wikimedia.org/r/294322 (https://phabricator.wikimedia.org/T107306) [14:51:45] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 1 failures [14:52:55] RECOVERY - puppet last run on ms-be2017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:53:17] (03PS1) 10BBlack: stream: use hash(X-Client-IP) for backend selection [puppet] - 10https://gerrit.wikimedia.org/r/294323 (https://phabricator.wikimedia.org/T134871) [14:55:15] PROBLEM - puppet last run on mw2237 is CRITICAL: CRITICAL: Puppet has 1 failures [14:55:55] PROBLEM - puppet last run on mw1138 is CRITICAL: CRITICAL: Puppet has 1 failures [14:56:25] PROBLEM - puppet last run on mw1181 is CRITICAL: CRITICAL: Puppet has 1 failures [14:56:34] PROBLEM - puppet last run on mw2094 is CRITICAL: CRITICAL: Puppet has 1 failures [14:56:35] PROBLEM - puppet last run on mw1218 is CRITICAL: CRITICAL: Puppet has 1 failures [14:56:54] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [14:56:55] PROBLEM - puppet last run on mw2133 is CRITICAL: CRITICAL: Puppet has 1 failures [14:56:55] PROBLEM - puppet last run on mw2107 is CRITICAL: CRITICAL: Puppet has 1 failures [14:57:04] PROBLEM - puppet last run on mw1165 is CRITICAL: CRITICAL: Puppet has 1 failures [14:57:05] PROBLEM - puppet last run on mw1245 is CRITICAL: CRITICAL: Puppet has 1 failures [14:57:37] (03CR) 10BBlack: [C: 032] stream: use hash(X-Client-IP) for backend selection [puppet] - 10https://gerrit.wikimedia.org/r/294323 (https://phabricator.wikimedia.org/T134871) (owner: 10BBlack) [14:57:45] 06Operations, 10Ops-Access-Requests, 06WMF-NDA-Requests: NDA request for @WMDE-leszek - https://phabricator.wikimedia.org/T133145#2379731 (10WMDE-leszek) Thank you @jcrespo, I can access grafana-admin.wikimedia.org now! [14:57:54] PROBLEM - puppet last run on mw2202 is CRITICAL: CRITICAL: Puppet has 1 failures [14:57:55] PROBLEM - puppet last run on mw1198 is CRITICAL: CRITICAL: Puppet has 1 failures [15:00:04] anomie, ostriches, thcipriani, marktraceur, and Krenair: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160614T1500). Please do the needful. [15:00:04] James_F, RoanKattouw, and tgr: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:00:53] * James_F waves. [15:01:01] I'm here for RoanKattouw's too if needed. [15:01:12] I can SWAT today. [15:01:13] * kart_ ere [15:01:16] here* [15:01:31] Wait. Why I'm not in the list? [15:01:41] (03PS5) 10Thcipriani: Enable VisualEditor by default on eleven Wikivoyages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292746 (https://phabricator.wikimedia.org/T136990) (owner: 10Jforrester) [15:02:05] kart_: perhaps you forgot to save? happened to me once [15:02:12] Right :D [15:02:13] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292746 (https://phabricator.wikimedia.org/T136990) (owner: 10Jforrester) [15:02:56] (03Merged) 10jenkins-bot: Enable VisualEditor by default on eleven Wikivoyages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292746 (https://phabricator.wikimedia.org/T136990) (owner: 10Jforrester) [15:03:57] thcipriani: add me to please :) [15:04:19] kart_: yup :) [15:04:34] ACKNOWLEDGEMENT - kartotherian endpoints health on maps2001 is CRITICAL: /{src}/{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Small scaled map) is CRITICAL: Test Small scaled map returned the unexpected status 404 (expecting: 200): /v4/marker/{base}-{size}-{symbol}+{color}@{scale}x.png (scaled pushpin marker with an icon) is CRITICAL: Test scaled pushpin marker with an icon returned the unexpected status 400 (expecting: 200): /v4/marker/ [15:06:13] thcipriani, are you deploying? [15:06:23] yurik: yes [15:06:29] i need to depl new kartotherian service via scap3 to fix ^^ [15:06:36] will that affect you, should i wait? [15:07:10] !log thcipriani@tin Synchronized dblists/visualeditor-default.dblist: SWAT: [[gerrit:292746|Enable VisualEditor by default on eleven Wikivoyages]] (duration: 01m 49s) [15:07:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:07:17] thcipriani, or can we go at the same time? [15:07:25] yurik: should be fine now just poke me when you're done [15:07:30] ok [15:07:47] ^ James_F check sync please (unsure if I need to touch InitialiseSettings in this instance) [15:08:33] Checking. [15:08:45] tgr: ping for SWAT [15:09:02] thcipriani: here [15:09:14] thcipriani, does scap3 use the spec.yaml for testing, same as icinga? [15:09:17] tgr: okie doke, wanted to make sure before merge. [15:09:32] !log deployed & restarted kartotherian [15:09:35] thcipriani, done [15:09:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:09:40] gehel, ^ [15:09:56] thcipriani: Yup, everything's tickety-boo. [15:10:25] yurik: thanks. Depends on configuration. IIRC the nrpe checks for services use the spec.yaml. mobrovac should fact-check that statement, however :) [15:10:36] James_F: is that good? do we want that? [15:10:54] thcipriani: Sorry, yes, everything is great. [15:10:57] :D [15:11:04] RECOVERY - mediawiki-installation DSH group on mw1272 is OK: OK [15:11:04] RECOVERY - mediawiki-installation DSH group on mw1273 is OK: OK [15:11:22] it sounded good, but I wasn't completely sure :) [15:13:36] 06Operations, 10Traffic, 13Patch-For-Review: Support websockets in cache_misc - https://phabricator.wikimedia.org/T134870#2379830 (10BBlack) Seems to basically work. Ideally we'd limit the websocket VCL capabilities based on req.http.Host as well, but that will have to come later with further refactoring of... [15:13:43] 06Operations, 10Traffic, 13Patch-For-Review: Support websockets in cache_misc - https://phabricator.wikimedia.org/T134870#2379832 (10BBlack) 05Open>03Resolved a:03BBlack [15:13:45] 06Operations, 10Phabricator, 10Traffic: Phabricator needs to expose notification daemon (websocket) - https://phabricator.wikimedia.org/T112765#2379835 (10BBlack) [15:13:47] James_F: you're fine testing RoanKattouw 's changes? [15:14:00] Sorry, I'm here now [15:14:11] Yes, but I'll slack off and let him do it. :-D [15:14:14] RoanKattouw: okie doke, np :) [15:14:21] (03PS2) 10Thcipriani: Add nonecho.dblist and echo.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294269 (https://phabricator.wikimedia.org/T137771) (owner: 10Catrope) [15:14:38] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294269 (https://phabricator.wikimedia.org/T137771) (owner: 10Catrope) [15:15:02] thcipriani: You'll want to sync the dblists first, then CommonSettings, then InitialiseSettings [15:15:35] Not to be confused with tickety-boom [15:15:37] RoanKattouw: yup, I was just sorting that out, thanks :) [15:15:41] (03Merged) 10jenkins-bot: Add nonecho.dblist and echo.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294269 (https://phabricator.wikimedia.org/T137771) (owner: 10Catrope) [15:16:03] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [15:16:44] (03CR) 10WMDE-Fisch: [C: 031] "*poke*" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287936 (https://phabricator.wikimedia.org/T134770) (owner: 10Addshore) [15:17:43] (03PS1) 10Jcrespo: Repool db1040, pool for the first time db1081, db1084, db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294329 (https://phabricator.wikimedia.org/T133398) [15:18:08] !log thcipriani@tin Synchronized dblists: SWAT: [[gerrit:294269|Add nonecho.dblist and echo.dblist]] PART I (duration: 00m 30s) [15:18:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:18:51] (03PS3) 10Giuseppe Lavagetto: mediawiki::jobrunner: systemd compatibility [puppet] - 10https://gerrit.wikimedia.org/r/294255 [15:18:53] !log thcipriani@tin Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:294269|Add nonecho.dblist and echo.dblist]] PART II (duration: 00m 30s) [15:18:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:19:32] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:294269|Add nonecho.dblist and echo.dblist]] PART III (duration: 00m 28s) [15:19:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:19:57] ^ RoanKattouw sync'd, check please [15:20:04] Looking [15:20:14] 06Operations, 10Traffic, 13Patch-For-Review: Move stream.wikimedia.org (rcstream) behind cache_misc - https://phabricator.wikimedia.org/T134871#2379847 (10BBlack) This seems to be working now. It's fully-configured on cache_misc other than switching the DNS resolution for stream.wm.o to cache_misc, and can... [15:20:25] RECOVERY - puppet last run on mw1138 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [15:20:27] (03PS6) 10JanZerebecki: Load the RevisionSlider extension on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287936 (https://phabricator.wikimedia.org/T134770) (owner: 10Addshore) [15:20:54] RECOVERY - puppet last run on mw2094 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [15:21:04] RECOVERY - puppet last run on mw1218 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [15:21:35] thcipriani: Looks good [15:21:40] 06Operations, 10Traffic, 10Wikimedia-Stream: Move stream.wikimedia.org (rcstream) behind cache_misc - https://phabricator.wikimedia.org/T134871#2379849 (10BBlack) [15:21:44] RoanKattouw: cool, thanks for checking [15:22:03] RECOVERY - puppet last run on mw2133 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:22:04] RECOVERY - puppet last run on mw2107 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:22:13] RECOVERY - puppet last run on mw1165 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:22:24] RECOVERY - puppet last run on mw1245 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [15:22:33] RECOVERY - puppet last run on mw2237 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:22:53] RECOVERY - puppet last run on mw2202 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:22:53] RECOVERY - puppet last run on mw1181 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:23:14] RECOVERY - puppet last run on mw1198 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [15:23:47] thcipriani: Ahm, I don't see the dblists in /srv/mediawiki on terbium or tin, or even in /srv/mediawiki-staging on tin... [15:24:07] I think you forgot to git pull [15:24:18] crap [15:24:20] :-) [15:24:25] I fetched, but didn't rebase [15:24:58] kk, I'll try that again. [15:25:52] !log thcipriani@tin Synchronized dblists: SWAT: [[gerrit:294269|Add nonecho.dblist and echo.dblist]] PART I (duration: 00m 28s) [15:25:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:26:01] !log reimage ms-fe3001 with jessie T117972 [15:26:02] T117972: swift upgrade plans: jessie and swift 2.x - https://phabricator.wikimedia.org/T117972 [15:26:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:26:44] !log thcipriani@tin Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:294269|Add nonecho.dblist and echo.dblist]] PART II (duration: 00m 26s) [15:27:28] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:294269|Add nonecho.dblist and echo.dblist]] PART III (duration: 00m 27s) [15:27:33] RoanKattouw: ^ sync'd. thank you for the thorough check on a no-op [15:27:45] I will do another thorough check now [15:28:25] (03PS1) 10Jgreen: Modify secret.rb to accept a file list and use first match, like http://www.puppetcookbook.com/posts/select-a-file-based-on-a-fact.html [puppet] - 10https://gerrit.wikimedia.org/r/294331 [15:29:30] 06Operations, 10Traffic, 10Wikimedia-Stream: Move stream.wikimedia.org (rcstream) behind cache_misc - https://phabricator.wikimedia.org/T134871#2379875 (10BBlack) Looking into the Websockets RFC ( https://tools.ietf.org/html/rfc6455 ), it says in section 4.1: ``` Once the client's opening handshake has been... [15:30:18] thcipriani, can i push one more minor kartotherian scap3 out? [15:31:07] (03CR) 10Jgreen: "I'm not sure if this is functionality we want for production, but we use this file search feature in frack." [puppet] - 10https://gerrit.wikimedia.org/r/294331 (owner: 10Jgreen) [15:31:08] thcipriani: Looking good [15:31:13] And not like a no-op this time [15:31:23] (03PS1) 10Gehel: Enable kartotherian service check for all maps servers [puppet] - 10https://gerrit.wikimedia.org/r/294333 (https://phabricator.wikimedia.org/T137617) [15:31:25] RoanKattouw: :) thanks [15:31:38] yurik: can I ping you post-SWAT? Only a few more patches. [15:31:51] thcipriani, no worries, let me know. CC: gehel [15:34:05] (03PS1) 10Filippo Giunchedi: install_server: ms-fe300[12] on jessie [puppet] - 10https://gerrit.wikimedia.org/r/294334 (https://phabricator.wikimedia.org/T117972) [15:34:44] (03PS2) 10Filippo Giunchedi: install_server: ms-fe300[12] on jessie [puppet] - 10https://gerrit.wikimedia.org/r/294334 (https://phabricator.wikimedia.org/T117972) [15:34:53] PROBLEM - MD RAID on install2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:34:58] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] install_server: ms-fe300[12] on jessie [puppet] - 10https://gerrit.wikimedia.org/r/294334 (https://phabricator.wikimedia.org/T117972) (owner: 10Filippo Giunchedi) [15:35:04] !log thcipriani@tin Synchronized php-1.28.0-wmf.5/extensions/CentralAuth/includes/CentralAuthHooks.php: SWAT: [[gerrit:294318|Account for changed login process]] (duration: 00m 26s) [15:35:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:35:08] ^ tgr check please [15:35:31] (I also pulled in the change to 1.28.0-wmf.6, it'll go out with everything else in that branch) [15:36:41] 06Operations, 10DBA, 06Labs, 10Tool-Labs, 10Traffic: Antigng-bot improper non-api http requests - https://phabricator.wikimedia.org/T137707#2379894 (10Antigng_) If you don't give me a good reason why cp1008.wikimedia.org:3128 / index.php?action=raw shouldn't be used, I will start some of my jobs that don... [15:36:53] RECOVERY - MD RAID on install2001 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [15:37:15] thcipriani: verified, thanks! [15:37:26] tgr: thanks for checking! [15:38:09] (03PS2) 10Halfak: Adds follow and expect 200 OK to check_ores_workers [puppet] - 10https://gerrit.wikimedia.org/r/294309 [15:38:18] (03PS1) 10Hashar: Group0 to 1.28.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294336 [15:38:50] (03PS2) 10Hashar: Group0 to 1.28.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294336 (https://phabricator.wikimedia.org/T136971) [15:39:07] Nikerabbit: are you fine with https://gerrit.wikimedia.org/r/#/c/291908/2 going out as is? [15:40:18] !log hashar@tin Started scap: testwiki to php-1.28.0-wmf.6 and rebuild l10n cache [15:40:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:41:21] hashar: I was just finishing SWAT :( [15:41:25] oh man [15:41:32] thcipriani: want me to cancel that? [15:41:45] hashar: yeah if you would [15:41:48] I do the dance before swat usually so no conflict [15:41:50] !log hashar@tin scap aborted: testwiki to php-1.28.0-wmf.6 and rebuild l10n cache (duration: 01m 31s) [15:41:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:41:56] thcipriani: done! sorry :( [15:42:30] hashar: no problem, probably didn't make it too far in l10n cache rebuilding, thanks for the cancel :) [15:43:08] I need a window pinned on my desktop with the deployment calendar [15:43:25] kart_: mind manually rebasing your patch, gerrit is not happy when I try [15:44:45] (03PS3) 10Hashar: Initial debianization [debs/geckodriver] - 10https://gerrit.wikimedia.org/r/294293 (https://phabricator.wikimedia.org/T137797) [15:44:54] 06Operations, 10DBA, 06Labs, 10Tool-Labs, 10Traffic: Antigng-bot improper non-api http requests - https://phabricator.wikimedia.org/T137707#2379907 (10Joe) @Antigng_ just to understand, what is your bot doing? If dumps are not refreshed fast enough for you, maybe you should make your bot follow one of th... [15:45:20] (03PS3) 10Halfak: Adds HTTP follow to check_ores_workers [puppet] - 10https://gerrit.wikimedia.org/r/294309 [15:45:29] thcipriani: let me do that. [15:47:31] thcipriani: maybe too late but yeah as-is is okay as well [15:47:46] Nikerabbit: not too late, thanks for the reply :) [15:49:16] (03PS3) 10KartikMistry: Beta: Enable Compact Language Links for new users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291908 (https://phabricator.wikimedia.org/T136161) [15:49:38] (03CR) 10Hashar: "I have added a basic man page using help2man" [debs/geckodriver] - 10https://gerrit.wikimedia.org/r/294293 (https://phabricator.wikimedia.org/T137797) (owner: 10Hashar) [15:50:04] thcipriani: done. Please check :) [15:50:17] (03PS4) 10Thcipriani: Beta: Enable Compact Language Links for new users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291908 (https://phabricator.wikimedia.org/T136161) (owner: 10KartikMistry) [15:50:23] 06Operations, 10DBA, 06Labs, 10Tool-Labs, 10Traffic: Antigng-bot improper non-api http requests - https://phabricator.wikimedia.org/T137707#2379915 (10Antigng_) Most of my tasks don't generate such " unacceptable amount of traffic". [15:50:30] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291908 (https://phabricator.wikimedia.org/T136161) (owner: 10KartikMistry) [15:51:02] (03Merged) 10jenkins-bot: Beta: Enable Compact Language Links for new users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291908 (https://phabricator.wikimedia.org/T136161) (owner: 10KartikMistry) [15:53:06] kart_: should go out with the next beta-scap-eqiad [15:53:19] !log thcipriani@tin Synchronized wmf-config: SWAT: [[gerrit:291908|Beta: Enable Compact Language Links for new users]] (duration: 00m 31s) [15:53:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:53:37] yurik: clear for kartotherian update [15:53:47] thcipriani, thx, in progress [15:54:38] !log Restarting cassandra-metrics-collector on restbase1007 : T137304 [15:54:39] T137304: 1000+ keyspace metrics you didn't see coming - https://phabricator.wikimedia.org/T137304 [15:54:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:56:12] 06Operations, 10DBA, 06Labs, 10Tool-Labs, 10Traffic: Antigng-bot improper non-api http requests - https://phabricator.wikimedia.org/T137707#2376097 (10BBlack) >>! In T137707#2376258, @Antigng_ wrote: > My bot was using /w/index.php?action=raw to fetch the content of each page/redirect at zhwiki, then it... [15:56:20] 06Operations, 10cassandra: 1000+ keyspace metrics you didn't see coming - https://phabricator.wikimedia.org/T137304#2379929 (10Eevans) >>! In T137304#2379243, @fgiunchedi wrote: > no problem, I see some metrics are still being updated though, e.g. > ``` > 3436550197 304 -rw-r--r-- 1 _graphite _graphite 30... [15:56:21] thcipriani: how much time it will take? [15:57:05] RECOVERY - kartotherian endpoints health on maps2001 is OK: All endpoints are healthy [15:57:11] kart_: runs every 10 mins IIRC. 16:04 will be the next one. [15:57:12] !log deployed & restarted kartotherian (fixing spec.config tests) [15:57:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:57:19] thcipriani, gehel, done [15:57:26] yurik: :) [15:57:33] and it passes!!! [15:57:38] gehel, ^ :) [15:57:48] hashar: should be all clear to do the l10n update/testwiki scap [15:58:07] 06Operations, 10Traffic, 10Wikimedia-Stream: Move stream.wikimedia.org (rcstream) behind cache_misc - https://phabricator.wikimedia.org/T134871#2379935 (10BBlack) Another datapoint, in nginx logs on rcs1001, most successful operations seem to be non-SSL: ``` root@rcs1001:/var/log/nginx# grep -v rcstream_stat... [15:58:48] <_joe_> ihm puppetswat time [15:58:54] PROBLEM - puppet last run on mw2193 is CRITICAL: CRITICAL: Puppet has 1 failures [15:59:05] <_joe_> thcipriani: is swat done? [15:59:11] _joe_: yup [15:59:26] thcipriani: thanks, OK! [15:59:52] <_joe_> RoanKattouw: around? [16:00:03] _joe_: Yup [16:00:04] godog, moritzm, and coreyfloyd: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160614T1600). [16:00:04] RoanKattouw: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:17] <_joe_> I am doing SWAT today [16:00:31] ok [16:01:11] (03PS2) 10Giuseppe Lavagetto: Only run processEchoEmailBatch.php on Echo-enabled wikis [puppet] - 10https://gerrit.wikimedia.org/r/294271 (https://phabricator.wikimedia.org/T137771) (owner: 10Catrope) [16:01:21] <_joe_> RoanKattouw: it's pretty simple, I'll merge right away [16:01:32] <_joe_> I verified the dblist is in mediawiki-config [16:01:51] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Only run processEchoEmailBatch.php on Echo-enabled wikis [puppet] - 10https://gerrit.wikimedia.org/r/294271 (https://phabricator.wikimedia.org/T137771) (owner: 10Catrope) [16:03:34] 06Operations, 10Traffic, 10Wikimedia-Stream: Move stream.wikimedia.org (rcstream) behind cache_misc - https://phabricator.wikimedia.org/T134871#2379955 (10BBlack) Obviously, if we can't fix the existing non-SSL clients in a timely fashion (or can't assume they can handle redirects), our other option is to pu... [16:04:27] <_joe_> RoanKattouw: done [16:04:34] Thanks! [16:05:21] (03PS1) 10Andrew Bogott: WIP: Horizon tab for modifying instance puppet config [puppet] - 10https://gerrit.wikimedia.org/r/294342 (https://phabricator.wikimedia.org/T91990) [16:05:56] (03Abandoned) 10Andrew Bogott: Define PUPPETMASTER_API for Horizon [puppet] - 10https://gerrit.wikimedia.org/r/294188 (owner: 10Andrew Bogott) [16:06:11] 06Operations, 13Patch-For-Review: Tracking and Reducing cron-spam from root@ - https://phabricator.wikimedia.org/T132324#2379979 (10Catrope) [16:06:19] 06Operations, 10ops-codfw: install2001 hardware troubles - https://phabricator.wikimedia.org/T137647#2374163 (10Papaul) physical observation: everything looks good on the server. all lids are green no sign of server overheating, No error reported in the log. Next step is to run a full hardware diagnostic. Is i... [16:06:31] (03CR) 10jenkins-bot: [V: 04-1] WIP: Horizon tab for modifying instance puppet config [puppet] - 10https://gerrit.wikimedia.org/r/294342 (https://phabricator.wikimedia.org/T91990) (owner: 10Andrew Bogott) [16:07:56] 06Operations, 10ops-codfw: install2001 hardware troubles - https://phabricator.wikimedia.org/T137647#2379982 (10faidon) Yes, that would be fine, please do! (note that I rebooted the server earlier as well, to rule out the possibility it was a software issue) [16:08:43] 06Operations, 10ops-codfw, 10media-storage: rack/setup/deploy ms-be202[2-7] - https://phabricator.wikimedia.org/T136630#2379986 (10Papaul) I chat will Filippo on IRC so the final layout is to but all 6 servers in row D ( D1, D3, D4,D5, D7 and D8) [16:09:57] 06Operations, 10Traffic, 10Wikimedia-Stream: Move stream.wikimedia.org (rcstream) behind cache_misc - https://phabricator.wikimedia.org/T134871#2379993 (10BBlack) Tried the sample JS client code too, from https://wikitech.wikimedia.org/wiki/RCStream#JavaScript . Same basic results. It works fine if I prepe... [16:10:02] (03PS2) 10Andrew Bogott: WIP: Horizon tab for modifying instance puppet config [puppet] - 10https://gerrit.wikimedia.org/r/294342 (https://phabricator.wikimedia.org/T91990) [16:10:41] 06Operations, 10DBA, 06Labs, 10Tool-Labs, 10Traffic: Antigng-bot improper non-api http requests - https://phabricator.wikimedia.org/T137707#2379995 (10Antigng_) Also, there doesn't exist a clear request rate limit for mediawiki api, as the rest api does. If you want to set one, you should document it. [16:10:42] thcipriani: config works fine in beta as expected. Thanks! [16:11:07] kart_: glad to hear it, thanks for following up :) [16:11:11] 06Operations, 10DBA, 06Labs, 10Tool-Labs, 10Traffic: Antigng-bot improper non-api http requests - https://phabricator.wikimedia.org/T137707#2379997 (10jcrespo) For the API part, I would like to add that API infrastructure (application servers and databases) is specifically prepared to be separated from n... [16:15:04] (03PS1) 10BBlack: frontend VCL: stream.wm.o TLS exception [puppet] - 10https://gerrit.wikimedia.org/r/294346 (https://phabricator.wikimedia.org/T134871) [16:15:43] (03CR) 10BBlack: [C: 032 V: 032] frontend VCL: stream.wm.o TLS exception [puppet] - 10https://gerrit.wikimedia.org/r/294346 (https://phabricator.wikimedia.org/T134871) (owner: 10BBlack) [16:22:23] PROBLEM - test icmp reachability to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 37 probes of 390 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [16:22:53] 06Operations, 10DBA, 06Labs, 10Tool-Labs, 10Traffic: Antigng-bot improper non-api http requests - https://phabricator.wikimedia.org/T137707#2380046 (10Antigng_) I don't think api.php?action=query&prop=revisions&rvprop=content can be the same performant as index.php?action=raw, and the latter is the easie... [16:23:04] RECOVERY - puppet last run on mw2193 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [16:23:23] PROBLEM - Host install2001 is DOWN: PING CRITICAL - Packet loss = 100% [16:25:13] PROBLEM - check_puppetrun on pay-lvs2002 is CRITICAL: CRITICAL: puppet fail [16:25:13] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: puppet fail [16:25:13] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [16:25:14] PROBLEM - check_puppetrun on saiph is CRITICAL: CRITICAL: puppet fail [16:26:38] 06Operations, 13Patch-For-Review: Tracking and Reducing cron-spam from root@ - https://phabricator.wikimedia.org/T132324#2380110 (10ema) [16:26:39] 06Operations, 10Traffic, 13Patch-For-Review: cronspam from cpXXXX hosts due to update-ocsp-all and zero_fetch - https://phabricator.wikimedia.org/T132835#2380107 (10ema) 05Open>03Resolved a:03ema Both update-ocsp-all and zerofetch are now logging to syslog instead of cronspamming. [16:26:53] ^^ frack puppet, I'm aware. working on puppetmaster [16:30:13] PROBLEM - check_puppetrun on bismuth is CRITICAL: CRITICAL: Puppet has 32 failures [16:30:13] RECOVERY - check_puppetrun on lutetium is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [16:30:13] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet has 34 failures [16:30:13] RECOVERY - check_puppetrun on boron is OK: OK: Puppet is currently enabled, last run 145 seconds ago with 0 failures [16:30:14] PROBLEM - check_puppetrun on pay-lvs2002 is CRITICAL: CRITICAL: puppet fail [16:30:14] RECOVERY - check_puppetrun on saiph is OK: OK: Puppet is currently enabled, last run 225 seconds ago with 0 failures [16:30:14] PROBLEM - check_puppetrun on payments2001 is CRITICAL: CRITICAL: puppet fail [16:32:36] 06Operations, 10DBA, 06Labs, 10Tool-Labs, 10Traffic: Antigng-bot improper non-api http requests - https://phabricator.wikimedia.org/T137707#2380159 (10jcrespo) > I would appreciate it if there was a way to perform api.php?action=raw Please file a separate bug report for that. [16:33:42] (03PS2) 10BBlack: VCL: move fe 503-retry to top of v3 vcl_error [puppet] - 10https://gerrit.wikimedia.org/r/293720 [16:33:48] (03CR) 10BBlack: [C: 032 V: 032] VCL: move fe 503-retry to top of v3 vcl_error [puppet] - 10https://gerrit.wikimedia.org/r/293720 (owner: 10BBlack) [16:35:09] PROBLEM - check_puppetrun on bismuth is CRITICAL: CRITICAL: Puppet has 32 failures [16:35:09] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet has 34 failures [16:35:09] PROBLEM - check_puppetrun on pay-lvs2002 is CRITICAL: CRITICAL: puppet fail [16:35:18] PROBLEM - check_puppetrun on payments2001 is CRITICAL: CRITICAL: puppet fail [16:35:46] (03PS2) 10Jcrespo: Repool db1040, pool for the first time db1081, db1084, db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294329 (https://phabricator.wikimedia.org/T133398) [16:37:41] (03CR) 10Jcrespo: [C: 032] Repool db1040, pool for the first time db1081, db1084, db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294329 (https://phabricator.wikimedia.org/T133398) (owner: 10Jcrespo) [16:38:45] 06Operations, 10ops-codfw, 10media-storage: rack/setup/deploy ms-be202[2-7] - https://phabricator.wikimedia.org/T136630#2380191 (10RobH) Updated from IRC Chat: These can all do 10GbE and won't be in the same service cluster, so they can share racks. Please place half of these in D2, and the other half in D7. [16:40:07] 06Operations, 10ops-codfw, 10media-storage: rack/setup/deploy ms-be202[2-7] - https://phabricator.wikimedia.org/T136630#2380201 (10RobH) [16:40:08] PROBLEM - check_puppetrun on bismuth is CRITICAL: CRITICAL: Puppet has 32 failures [16:40:08] PROBLEM - check_puppetrun on pay-lvs2002 is CRITICAL: CRITICAL: puppet fail [16:40:09] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet has 34 failures [16:40:18] RECOVERY - check_puppetrun on payments2001 is OK: OK: Puppet is currently enabled, last run 77 seconds ago with 0 failures [16:41:41] !log reimage ms-fe3002 with jessie T117972 [16:41:42] T117972: swift upgrade plans: jessie and swift 2.x - https://phabricator.wikimedia.org/T117972 [16:41:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:43:21] did something just change with planet.wikimedia? [16:44:28] greg-g: like what? [16:45:08] PROBLEM - check_puppetrun on bismuth is CRITICAL: CRITICAL: Puppet has 32 failures [16:45:08] PROBLEM - check_puppetrun on pay-lvs2002 is CRITICAL: CRITICAL: puppet fail [16:45:09] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet has 34 failures [16:45:53] 06Operations, 10ops-codfw: install2001 hardware troubles - https://phabricator.wikimedia.org/T137647#2380226 (10Papaul) Hardware diagnostic shows not HW problem. I checked first the BIOS settings and "system Profile" was set to "Performance per watt DAPC " supposed to " "Perforamce per Watt OS" i change it... [16:46:00] bblack: my news reader (newsblur, hosted) just got a ton of posts, oldest from may 15th [16:46:35] greg-g: no idea about that [16:47:50] 06Operations, 10cassandra, 10hardware-requests: Staging / Test environment(s) for RESTBase - https://phabricator.wikimedia.org/T136340#2380227 (10RobH) In reviewing this request, it isn't clear to me how the administration of these machines would exist. Would these be normal production machines, on producti... [16:48:12] bblack: yeah, could be on newsblur's side, their update/resync cycle is not clear [16:48:19] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1040, pool for the first time db1081, db1084, db1091 (duration: 00m 34s) [16:48:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:50:09] RECOVERY - check_puppetrun on bismuth is OK: OK: Puppet is currently enabled, last run 141 seconds ago with 0 failures [16:50:09] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet has 34 failures [16:50:09] PROBLEM - check_puppetrun on pay-lvs2001 is CRITICAL: CRITICAL: puppet fail [16:50:09] PROBLEM - check_puppetrun on pay-lvs2002 is CRITICAL: CRITICAL: puppet fail [16:55:08] PROBLEM - check_puppetrun on pay-lvs2002 is CRITICAL: CRITICAL: puppet fail [16:55:09] PROBLEM - check_puppetrun on pay-lvs2001 is CRITICAL: CRITICAL: puppet fail [16:55:09] RECOVERY - check_puppetrun on americium is OK: OK: Puppet is currently enabled, last run 91 seconds ago with 0 failures [16:55:09] PROBLEM - check_puppetrun on saiph is CRITICAL: CRITICAL: puppet fail [16:59:29] PROBLEM - graphite.wmflabs.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:00:04] yurik, gwicke, cscott, arlolra, and subbu: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160614T1700). [17:00:08] PROBLEM - check_puppetrun on pay-lvs2001 is CRITICAL: CRITICAL: puppet fail [17:00:08] PROBLEM - check_puppetrun on pay-lvs2002 is CRITICAL: CRITICAL: puppet fail [17:00:09] RECOVERY - check_puppetrun on saiph is OK: OK: Puppet is currently enabled, last run 191 seconds ago with 0 failures [17:00:13] none today. [17:00:18] PROBLEM - check_puppetrun on payments2001 is CRITICAL: CRITICAL: puppet fail [17:00:25] (03PS4) 10Yuvipanda: icinga: Adds HTTP follow to check_ores_workers [puppet] - 10https://gerrit.wikimedia.org/r/294309 (owner: 10Halfak) [17:00:34] (03PS5) 10Yuvipanda: icinga: Adds HTTP follow to check_ores_workers [puppet] - 10https://gerrit.wikimedia.org/r/294309 (owner: 10Halfak) [17:01:35] (03CR) 10Yuvipanda: [C: 032 V: 032] icinga: Adds HTTP follow to check_ores_workers [puppet] - 10https://gerrit.wikimedia.org/r/294309 (owner: 10Halfak) [17:01:49] RECOVERY - graphite.wmflabs.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1701 bytes in 4.113 second response time [17:05:10] RECOVERY - check_puppetrun on pay-lvs2001 is OK: OK: Puppet is currently enabled, last run 91 seconds ago with 0 failures [17:05:10] PROBLEM - check_puppetrun on pay-lvs2002 is CRITICAL: CRITICAL: puppet fail [17:05:10] PROBLEM - check_puppetrun on payments2001 is CRITICAL: CRITICAL: puppet fail [17:05:38] 06Operations, 10Phabricator, 10Traffic, 10hardware-requests: codfw: (1) phabricator host (backup node) - https://phabricator.wikimedia.org/T131775#2380267 (10mark) >>! In T131775#2219177, @RobH wrote: > No need to ping Papaul, he doesn't have any involvement in #hardware-requests. (Its primarily myself, a... [17:09:50] 06Operations, 10DBA, 06Labs, 10Tool-Labs, 10Traffic: Antigng-bot improper non-api http requests - https://phabricator.wikimedia.org/T137707#2380275 (10jcrespo) BTW, the API is definitely faster, one just need to use it efficiently: ``` $ time curl 'https://en.wikipedia.org/w/api.php?action=query&prop=r... [17:10:10] PROBLEM - check_puppetrun on pay-lvs2002 is CRITICAL: CRITICAL: puppet fail [17:10:10] RECOVERY - check_puppetrun on payments2001 is OK: OK: Puppet is currently enabled, last run 61 seconds ago with 0 failures [17:15:11] PROBLEM - check_puppetrun on pay-lvs2002 is CRITICAL: CRITICAL: puppet fail [17:18:59] (03PS3) 10Muehlenhoff: Define ferm service dynamicproxy-api-http in role::labs::novaproxy [puppet] - 10https://gerrit.wikimedia.org/r/293733 [17:20:03] (03CR) 10Yuvipanda: [C: 04-1] "Should also be in the other places dynamicproxy is used (somewhere in tools I think)" [puppet] - 10https://gerrit.wikimedia.org/r/293733 (owner: 10Muehlenhoff) [17:20:10] RECOVERY - check_puppetrun on pay-lvs2002 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [17:22:22] !log Run initSiteStats.php for arcwiki and htwiki (T137827) [17:22:22] T137827: Update statistics count on htwiki and arcwiki - https://phabricator.wikimedia.org/T137827 [17:22:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:23:36] 06Operations, 10ops-codfw: install2001 hardware troubles - https://phabricator.wikimedia.org/T137647#2380377 (10Papaul) @faidon the server is up and will leave the task open until the end of the week and see if the BIOS change fixed the problem. [17:23:45] (03CR) 10Jhobs: "Yeah, this was Adam. I mean, this //looks// fine to me, but I'm really not the person to ask." [puppet] - 10https://gerrit.wikimedia.org/r/294052 (owner: 10BBlack) [17:24:11] RECOVERY - Host install2001 is UP: PING OK - Packet loss = 0%, RTA = 38.58 ms [17:25:43] 07Blocked-on-Operations, 10Continuous-Integration-Infrastructure, 10Packaging, 05Gerrit-Migration, and 2 others: Package xhpast (libphutil) - https://phabricator.wikimedia.org/T137770#2380387 (10mmodell) @fgiunchedi: Debian already has packages for phabricator, libphutil and arcanist but I need them to be... [17:27:29] (03CR) 10Yuvipanda: [C: 031] "Never -1 early in the day, it is only the API" [puppet] - 10https://gerrit.wikimedia.org/r/293733 (owner: 10Muehlenhoff) [17:29:35] (03CR) 10Muehlenhoff: [C: 032 V: 032] Define ferm service dynamicproxy-api-http in role::labs::novaproxy [puppet] - 10https://gerrit.wikimedia.org/r/293733 (owner: 10Muehlenhoff) [17:32:45] 06Operations, 13Patch-For-Review: Tracking and Reducing cron-spam from root@ - https://phabricator.wikimedia.org/T132324#2380437 (10MoritzMuehlenhoff) [17:32:47] 06Operations: cronspam from argon - apache2 logrotate - https://phabricator.wikimedia.org/T132896#2380434 (10MoritzMuehlenhoff) 05Open>03Resolved a:03MoritzMuehlenhoff This doesn't affect kraz (it uses jessie and not precise, so apparmor is not used) and argon is gone, closing the bug. [17:33:55] (03PS1) 10Yuvipanda: labs: Disable diamond for deployment-prep too [puppet] - 10https://gerrit.wikimedia.org/r/294355 (https://phabricator.wikimedia.org/T137753) [17:34:22] (03PS2) 10Yuvipanda: labs: Disable diamond for deployment-prep too [puppet] - 10https://gerrit.wikimedia.org/r/294355 (https://phabricator.wikimedia.org/T137753) [17:34:42] (03CR) 10Yuvipanda: [C: 032 V: 032] labs: Disable diamond for deployment-prep too [puppet] - 10https://gerrit.wikimedia.org/r/294355 (https://phabricator.wikimedia.org/T137753) (owner: 10Yuvipanda) [17:36:41] PROBLEM - Swift HTTP backend on ms-fe3002 is CRITICAL: Connection refused [17:37:01] PROBLEM - Swift HTTP frontend on ms-fe3002 is CRITICAL: Connection refused [17:51:04] (03PS1) 10Thcipriani: Deploy Graphoid with Scap3 [puppet] - 10https://gerrit.wikimedia.org/r/294357 [17:51:49] (03CR) 10Yurik: [C: 031] Deploy Graphoid with Scap3 [puppet] - 10https://gerrit.wikimedia.org/r/294357 (owner: 10Thcipriani) [18:01:40] 06Operations, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: rename holmium to labservices1002 - https://phabricator.wikimedia.org/T106303#2380514 (10Cmjohnson) [18:01:42] 06Operations, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Update tag and racktables for holmium: rename to labservices1002. - https://phabricator.wikimedia.org/T119533#2380511 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson This has been completed..forgot to resolve. Thanks! [18:02:53] 06Operations, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: rename holmium to labservices1002 - https://phabricator.wikimedia.org/T106303#2380517 (10Andrew) 05Open>03Resolved [18:09:00] PROBLEM - graphite.wmflabs.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:11:00] RECOVERY - graphite.wmflabs.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1701 bytes in 5.089 second response time [18:13:29] (03PS1) 10Mholloway: Oops: restore :ZeroRatedMobileAccess check [puppet] - 10https://gerrit.wikimedia.org/r/294361 [18:24:44] 07Blocked-on-Operations, 06Operations, 10Monitoring, 06Services: Update restbase catchpoint metric - https://phabricator.wikimedia.org/T137181#2380621 (10GWicke) [18:26:59] (03CR) 10BBlack: [C: 032] Oops: restore :ZeroRatedMobileAccess check [puppet] - 10https://gerrit.wikimedia.org/r/294361 (owner: 10Mholloway) [18:27:20] 06Operations, 07Puppet, 13Patch-For-Review: Reconsider the aligning arrows puppet lint - https://phabricator.wikimedia.org/T137763#2380627 (10mmodell) Like @bblack, my main issue with alignment is the way puppet-lint forces me to re-indent a whole section just to add or remove one line. For one thing, I wou... [18:31:27] 06Operations, 10Phabricator, 10Traffic, 10hardware-requests: codfw: (1) phabricator host (backup node) - https://phabricator.wikimedia.org/T131775#2380633 (10RobH) a:05mark>03RobH [18:32:29] (03PS1) 10Bartosz Dziewoński: Set $wgAbuseFilterConditionLimit = 2000 for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294363 (https://phabricator.wikimedia.org/T132048) [18:33:43] 07Blocked-on-Operations, 06Operations, 10Monitoring, 06Services: Update restbase catchpoint metric - https://phabricator.wikimedia.org/T137181#2380639 (10GWicke) p:05Triage>03High [18:39:03] (03PS3) 10Andrew Bogott: WIP: Horizon tab for modifying instance puppet config [puppet] - 10https://gerrit.wikimedia.org/r/294342 (https://phabricator.wikimedia.org/T91990) [18:43:52] 06Operations, 10Phabricator: setup phab2001.codfw.wmnet (WMF6405) - https://phabricator.wikimedia.org/T137838#2380670 (10RobH) [18:44:19] 06Operations, 10Phabricator, 10Traffic, 10hardware-requests: codfw: (1) phabricator host (backup node) - https://phabricator.wikimedia.org/T131775#2178301 (10RobH) 05Open>03Resolved WMF6405 is allocated for this use. T137838 has been created for the setup/deployment. [18:47:52] (03CR) 10Steinsplitter: [C: 031] Set $wgAbuseFilterConditionLimit = 2000 for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294363 (https://phabricator.wikimedia.org/T132048) (owner: 10Bartosz Dziewoński) [18:50:52] (03PS3) 10Chad: Group0 to 1.28.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294336 (https://phabricator.wikimedia.org/T136971) (owner: 10Hashar) [18:53:11] !log demon@tin Purged l10n cache for 1.28.0-wmf.1 [18:53:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:54:03] !log demon@tin Purged l10n cache for 1.28.0-wmf.2 [18:54:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:54:15] !log demon@tin Purged l10n cache for 1.28.0-wmf.3 [18:54:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:54:32] !log demon@tin Purged l10n cache for 1.28.0-wmf.4 [18:54:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:56:23] !log demon@tin Purged l10n cache for 1.27.0-wmf.23 [18:56:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:56:30] RECOVERY - test icmp reachability to codfw on ripe-atlas-codfw is OK: OK - failed 18 probes of 389 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [18:56:42] * greg-g peers over at T130317 [18:56:53] stashbot: you there? [18:57:03] or are /me's excluded? T130317 [18:57:03] T130317: setup automatic deletion of old l10nupdate - https://phabricator.wikimedia.org/T130317 [19:00:04] ostriches: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160614T1900). Please do the needful. [19:00:46] Hang on to your pants folks! [19:00:47] :) [19:00:56] (03CR) 10Chad: [C: 032] Group0 to 1.28.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294336 (https://phabricator.wikimedia.org/T136971) (owner: 10Hashar) [19:01:10] * greg-g is wearing shorts [19:01:34] (03Merged) 10jenkins-bot: Group0 to 1.28.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294336 (https://phabricator.wikimedia.org/T136971) (owner: 10Hashar) [19:03:44] !log demon@tin Started scap: group0 to 1.28.0-wmf.6 [19:03:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:05:40] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 690 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5213936 keys - replication_delay is 690 [19:11:51] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5185054 keys - replication_delay is 0 [19:29:17] * aude waves [19:29:56] ostriches: are you running scap yet / already? [19:30:02] Yep [19:30:09] Almost done [19:30:20] * aude wonders if my submodule update got in? [19:30:27] !log demon@tin Finished scap: group0 to 1.28.0-wmf.6 (duration: 26m 43s) [19:30:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:30:32] aude: Which extension? [19:30:40] wikidata [19:30:54] * aude looks [19:31:09] Nope, looks like [19:31:37] possibly it didn't go in automatically [19:31:45] CentralAuth is the only one I see on wmf.6 [19:31:51] in which case i would like https://gerrit.wikimedia.org/r/#/c/294338/ [19:32:00] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures [19:32:06] (03PS1) 10Yuvipanda: labs: remove shinken monitoring for tools / deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/294366 (https://phabricator.wikimedia.org/T137753) [19:32:13] the amount of code changed isn't that much so maybe running scap again wouldn't take too long [19:32:23] sorry.... [19:32:30] (03PS2) 10Yuvipanda: labs: remove shinken monitoring for tools / deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/294366 (https://phabricator.wikimedia.org/T137753) [19:32:31] Okie dokie :) [19:32:36] Merging [19:32:47] thanks [19:34:05] (03PS2) 10Gehel: Enable kartotherian service check for all maps servers [puppet] - 10https://gerrit.wikimedia.org/r/294333 (https://phabricator.wikimedia.org/T137617) [19:38:31] (03CR) 10Yurik: [C: 031] Enable kartotherian service check for all maps servers [puppet] - 10https://gerrit.wikimedia.org/r/294333 (https://phabricator.wikimedia.org/T137617) (owner: 10Gehel) [19:43:54] !log demon@tin Started scap: wikidata submodule update for wmf.6 [19:43:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:44:07] aude: ^^^ [19:45:35] thanks [19:46:08] (03CR) 10Gehel: [C: 032] Enable kartotherian service check for all maps servers [puppet] - 10https://gerrit.wikimedia.org/r/294333 (https://phabricator.wikimedia.org/T137617) (owner: 10Gehel) [19:51:32] any lvm masters out there? robh? [19:51:51] s/lvm/lvs [19:52:09] lvm not a master but know it more than lvs! whats up? [19:52:29] my lvs experience recently is conftool to pool and depool from pybal/lvs [19:52:53] robh: I'm trying to find someone to watch my back while I send new maps servers to VLS [19:52:59] s/VLS/LVS/ [19:53:32] I am likely non-ideal, if it triggers some lvs critical I wouldn't know what to do. [19:53:43] other than the most basic of thigns [19:53:49] robh: I think I know what I'm doing, but I'd like to know there is someone out there to check me... [19:54:30] robh: can we go over the change together? It should be fine, I'd just like another pair of eyes... [19:54:34] https://gerrit.wikimedia.org/r/#/c/294068/ [19:55:07] ok good, its at least in the area ive recently poked about so its not completely new to me, heh [19:55:19] robh: that's a good start! [19:55:21] conftool defaults a host to depooled right? [19:55:36] robh: yes, I checked that with bblack [19:55:38] so even adding these shouldn't route traffic until you trigger them to pooled [19:55:44] cool, this seems reasonable to me [19:55:47] i can +1 [19:56:16] robh: It is already +1 by bblack. I just want to make sure I what to do after the merge... [19:56:29] (03CR) 10RobH: [C: 031] "chatted in irc, this seems sane as conftool doesn't auto pool these. (so even if they aren't ready for service, they shouldn't cause issu" [puppet] - 10https://gerrit.wikimedia.org/r/294068 (https://phabricator.wikimedia.org/T137620) (owner: 10Gehel) [19:56:35] ahh [19:57:48] whats the active lvs for codfw maps? [19:58:55] robh: that was my first question. Documentation say to find it in puppet/modules/lvs/manifests/configuration.pp, but I'm not sure how to read that [19:59:34] high-traffic1, high-traffic2 or low-traffic? My guess is low-traffic, but I'm unsure... [20:01:38] Oh, I see, there is some hiera as well... so servers are lvs2002 and lvs2005 [20:02:07] (03PS3) 10Eranroz: Adding support for some common imports [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292883 (https://phabricator.wikimedia.org/T137074) [20:02:57] so the steps dont mention stopping puppet on the primary lvs [20:02:58] robh: check me, deployment goes: 1) disable puppet on lvs200(2|5), 2) run puppet on 2005, 3) restart pybal 4) check it looks ok [20:03:04] but yeah, i think you should [20:03:16] I'll talk over here [20:03:21] indeed, otherwise the directions for deploying a change, while outdated somewhat, seem right [20:03:23] just merge the change and run puppet-merge [20:03:28] (on palladium) [20:03:47] this change doesn't involve puppet runs on the LVS machines, or any restart of pybal [20:04:12] so, much easier than I thought! [20:04:14] wont it restart pybal to add them into the config? [20:04:18] restarting pybal is only necessary when defining a whole new service (or deleting one) [20:04:24] bblack, robh: thanks a lot! [20:04:27] oh [20:04:31] and nothing automatically restarts pybal. that only happens manually, and it's always dangerous [20:04:47] bblack: i thought it had to refresh the service to see new entries as well as new services, my bad, thank you for explaining [20:05:43] I'll be back soon to do the actual deployment... [20:05:51] the update to the list of the service's hosts in the commit, is processed by conftool-merge (which runs automatically at the tail of puppet-merge on palladium), which causes conftool-sync to create new etcd keys [20:06:11] and pybal reads the updates from etcd directly in realtime [20:06:21] (realtime-ish, anyways) [20:07:10] when creating new entries, conftool stuff defaults to pooled=no, though, so it won't actually cause IPVS traffic to route to them yet [20:07:18] until someone manually does confctl pooled=yes on the new hosts [20:07:48] dapatrick: do you have uncommitted math/kartographer changes? [20:08:07] look like maybe security related [20:08:07] 06Operations, 10ops-codfw: ms-be2012.codfw.wmnet: slot=10 dev=sdk failed - https://phabricator.wikimedia.org/T135975#2380884 (10Papaul) @fgiunchedi the system is out of warrant and have to 2TB SAS disk on site will have to check with @ Robh to see if i need to open a procurement ticket for a 2TB SAS disk 7.2K... [20:08:07] AaronSchulz: Nope. [20:09:14] dapatrick: I see the commits to the extensions but the submodule pointer is stale [20:09:46] !log demon@tin Finished scap: wikidata submodule update for wmf.6 (duration: 25m 51s) [20:09:46] papaul: im not sure what you mean [20:09:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:09:53] on that last update [20:10:13] You have no spares for that disk, and its out of warranty? Or that you have 3 spares and can use one? [20:10:22] PROBLEM - graphite.wmflabs.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:10:36] AaronSchulz: Ask MaxSem. [20:10:47] (03PS4) 10Andrew Bogott: WIP: Horizon tab for modifying instance puppet config [puppet] - 10https://gerrit.wikimedia.org/r/294342 (https://phabricator.wikimedia.org/T91990) [20:12:12] AaronSchulz, lemme just commit it [20:12:21] yeah, makes sense [20:14:32] PROBLEM - puppet last run on ms-be2012 is CRITICAL: CRITICAL: Puppet has 1 failures [20:14:54] 06Operations, 10ops-codfw: ms-be2012.codfw.wmnet: slot=10 dev=sdk failed - https://phabricator.wikimedia.org/T135975#2380904 (10RobH) @papaul: It isn't entirely clear to me what you mean. Can you re-clarify your statement/questions for clarity? I think you are asking if you need to create a #procurement S4... [20:16:23] RECOVERY - graphite.wmflabs.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1701 bytes in 1.207 second response time [20:20:07] (03PS1) 10BBlack: un-anchor regexes for select [software/conftool] - 10https://gerrit.wikimedia.org/r/294371 [20:21:06] (03CR) 10jenkins-bot: [V: 04-1] un-anchor regexes for select [software/conftool] - 10https://gerrit.wikimedia.org/r/294371 (owner: 10BBlack) [20:22:27] 06Operations, 10Phabricator, 10Traffic: Phabricator needs to expose notification daemon (websocket) - https://phabricator.wikimedia.org/T112765#2380942 (10greg) (All blockers are resolved) [20:24:17] 06Operations, 10Phabricator, 10Traffic: Phabricator needs to expose notification daemon (websocket) - https://phabricator.wikimedia.org/T112765#2380949 (10BBlack) Is the node.js notification service already running on iridium? Do we need some matching config in public DNS + private phab so that it knows its... [20:26:33] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [20:27:14] MaxSem: did you commit yet? [20:27:26] waiting for merge [20:29:17] 06Operations, 10Phabricator, 10Traffic: Phabricator needs to expose notification daemon (websocket) - https://phabricator.wikimedia.org/T112765#2380963 (10BBlack) (or, reading the docs, do we want to map phab.wm.o/ws/ to :22280? either way, it doesn't seem configured at all on the iridium side yet) [20:29:44] AaronSchulz, doned [20:33:37] (03PS20) 10Andrew Bogott: MySQL backend for storing roles / hiera data for labs [puppet] - 10https://gerrit.wikimedia.org/r/285014 (https://phabricator.wikimedia.org/T133412) (owner: 10Yuvipanda) [20:34:34] (03PS3) 10Yuvipanda: labs: remove shinken monitoring for tools / deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/294366 (https://phabricator.wikimedia.org/T137753) [20:34:44] (03CR) 10Yuvipanda: [C: 032 V: 032] labs: remove shinken monitoring for tools / deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/294366 (https://phabricator.wikimedia.org/T137753) (owner: 10Yuvipanda) [20:36:12] (03PS21) 10Andrew Bogott: MySQL backend for storing roles / hiera data for labs [puppet] - 10https://gerrit.wikimedia.org/r/285014 (https://phabricator.wikimedia.org/T133412) (owner: 10Yuvipanda) [20:36:34] MaxSem: looks the same :/ [20:37:07] 06Operations, 10Phabricator, 10Traffic: Phabricator needs to expose notification daemon (websocket) - https://phabricator.wikimedia.org/T112765#2380993 (10mmodell) @bblack: not set up on iridium because I wasn't entirely clear when/if it would become possible. | do we want to map phab.wm.o/ws/ to :22280 Ye... [20:37:55] (03CR) 10Andrew Bogott: [C: 032] MySQL backend for storing roles / hiera data for labs [puppet] - 10https://gerrit.wikimedia.org/r/285014 (https://phabricator.wikimedia.org/T133412) (owner: 10Yuvipanda) [20:38:12] well I didn't touch prod AaronSchulz [20:38:27] * AaronSchulz is looking at git diff for wmf6 [20:38:51] I just made sure it's under version control [20:39:09] andrewbogott woo! [20:39:36] yuvipanda: there are db creation/setup steps needed right? [20:40:05] (03PS1) 10Yuvipanda: labs: Don't have shinken do basic instance checks [puppet] - 10https://gerrit.wikimedia.org/r/294375 (https://phabricator.wikimedia.org/T137753) [20:40:07] yup [20:40:22] (03PS2) 10Yuvipanda: labs: Don't have shinken do basic instance checks [puppet] - 10https://gerrit.wikimedia.org/r/294375 (https://phabricator.wikimedia.org/T137753) [20:40:32] (03CR) 10Yuvipanda: [C: 032 V: 032] labs: Don't have shinken do basic instance checks [puppet] - 10https://gerrit.wikimedia.org/r/294375 (https://phabricator.wikimedia.org/T137753) (owner: 10Yuvipanda) [20:42:45] (03PS22) 10Andrew Bogott: MySQL backend for storing roles / hiera data for labs [puppet] - 10https://gerrit.wikimedia.org/r/285014 (https://phabricator.wikimedia.org/T133412) (owner: 10Yuvipanda) [20:46:29] !log adding new maps servers to LVS [20:46:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:47:06] (03PS2) 10Gehel: Add new maps servers to LVS [puppet] - 10https://gerrit.wikimedia.org/r/294068 (https://phabricator.wikimedia.org/T137620) [20:47:22] gehel, it also depools the old servers? [20:47:34] MaxSem: in a second time, but yes [20:48:20] MaxSem: plan is: add new servers to config, pool them one by one, check if all is good, depool old servers one by one [20:48:44] hmm, why so many steps? [20:49:30] because we're being careful [20:49:54] the config change defines new possible backend servers, but doesn't pool them. pool+depool of individual backend servers is commandline (not puppet commit). [20:50:28] root@palladium:~# confctl select service=kartotherian get [20:50:28] {"maps-test2001.codfw.wmnet": {"pooled": "yes", "weight": 10}, "tags": "dc=codfw,cluster=maps,service=kartotherian"} [20:50:31] {"maps-test2002.codfw.wmnet": {"pooled": "yes", "weight": 10}, "tags": "dc=codfw,cluster=maps,service=kartotherian"} [20:50:34] {"maps-test2003.codfw.wmnet": {"pooled": "yes", "weight": 10}, "tags": "dc=codfw,cluster=maps,service=kartotherian"} [20:50:37] {"maps-test2004.codfw.wmnet": {"pooled": "yes", "weight": 10}, "tags": "dc=codfw,cluster=maps,service=kartotherian"} [20:50:41] !log aaron@tin Synchronized php-1.28.0-wmf.6/includes: ca9068daffb49cc0cdfb84385a29aea34df155cd (duration: 01m 51s) [20:50:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:50:50] ^ that list will get expanded with 4x new entries set to pooled:no by the commit gehel's going to merge [20:51:40] bblack: thanks for watching my back! [20:52:43] (03PS3) 10Gehel: Add new maps servers to LVS [puppet] - 10https://gerrit.wikimedia.org/r/294068 (https://phabricator.wikimedia.org/T137620) [20:54:22] (03CR) 10Gehel: [C: 032] Add new maps servers to LVS [puppet] - 10https://gerrit.wikimedia.org/r/294068 (https://phabricator.wikimedia.org/T137620) (owner: 10Gehel) [20:54:58] gehel: looks good [20:55:09] yep... [20:55:38] !log pooling maps2001 (new map server) - T137620 [20:55:39] T137620: Send traffic to new maps200? servers - https://phabricator.wikimedia.org/T137620 [20:55:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:58:01] !log gehel@palladium conftool action : set/pooled=yes; selector: maps2001.codfw.wmnet (tags: ['dc=codfw', 'cluster=maps', 'service=kartotherian']) [20:58:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:58:45] yeah conftool auto-logs for you now :) [20:59:03] I remember that now... [20:59:06] root@lvs2003:~# ipvsadm -Lnt 10.2.1.13:6533|grep Route -> 10.192.0.128:6533 Route 10 4 1 -> 10.192.0.129:6533 Route 10 3 5 -> 10.192.0.144:6533 Route 10 1 0 -> 10.192.16.34:6533 Route 10 6 3 -> 10.192.16.35:6533 Route 10 6 6 [20:59:13] [20:59:16] bad wrap on paste, but there are 5 servers pooled in LVS now [21:01:06] bblack: is there a way to pass additional info (phab ticket) to confctl logs? --help does not indicate that there is... [21:01:16] MaxSem: let me know if you see something strange on maps... [21:01:29] gehel: probably not [21:02:07] MaxSem: there are a lot of errors "Maximum zoom is 18", but those are also present on old maps servers... [21:02:12] twentyafterfour, dapatrick: https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Security_patches should be updated for how to handle submodule pointers? Should someone just make a local wmf-x core commit with the new hash? Does that have to go in /patches too? [21:02:42] gehel, any referrer on these? [21:03:02] AaronSchulz: when I do the train I make a local commit with the submodule's new hash [21:03:04] MaxSem: I don't see that in the kartotherian logs... [21:03:05] bblack: we're seeing some strange intermittent behavior on a portion of API requests with the error message "via cp1052 frontend, Varnish XID 1317010381
Error: 403, Insecure Request Forbidden - use HTTPS". Are you aware of this? [21:03:38] you don't have to add a patch in /srv/patches for the submodule bumps though [21:03:43] bearND: yes, HTTP is starting to be disabled, all traffic now needs to switch to HTTPS [21:04:02] bearND: yeah, not unexpected [21:04:03] AaronSchulz: I'm not entirely sure if everyone follows the same practices here. It's been inconsistent afaik [21:04:21] it's approximately 10% of insecure requests that are hitting that error [21:04:27] bblack: gehel : thanks. I'll double check to make sure we're not using http // cc:mdholloway [21:04:36] you definitely are, if you see that message :) [21:04:40] twentyafterfour: right now, no one case rebase without stashing or something (which also restored some other crap I had to clear out with HEAD) [21:05:03] I'll just make a local commit. Should it have SECURITY in the message? [21:05:12] AaronSchulz: yes local commit [21:05:16] bearND: what is it that's making the requests that get the error? [21:05:35] I always make a local commit that says '[security] local submodule patches - do not push' [21:05:39] AaronSchulz: ^ [21:05:53] bblack: I was running Mobile Content Service tests. mdholloway saw it, too. Not sure what he was doing [21:05:58] MaxSem: is there an easy way to add referrer to kartotherian logs? MDC or something similar? [21:05:59] gehel, I see a regression in new tiles [21:06:03] indeed the documentation is way outdated I'll try to fix that [21:06:28] MaxSem: how bad? do I depool maps2001? [21:06:37] bearND: these requests were from client-side code? or requests made by the server itself on a host in our infra? [21:06:56] bblack: no, this was originated from my dev box [21:07:01] gehel, try zooming in on https://maps.wikimedia.org/#7/44.351/-113.313 [21:07:14] wait, is there a cache mishmash? [21:07:15] your dev box running an emulation of client-side code, or... ? [21:07:36] MaxSem: what should I be looking for? [21:07:46] strange squares appearing [21:07:59] I thought we expected new data with the new servers, and thus new tiles/borders, etc? [21:07:59] bblack: my dev box running the Mobile Content Service tests. Some of them access Parsoid via RESTBase, other MW API [21:08:05] probably it's a mix of tiles from old and new servers [21:08:14] MaxSem: yep, I see them. [21:08:18] bearND: bblack: we'll probably need to work with mobrovac on this. i tried changing our dev config to use https once before but i vaguely remember that wouldn't work because of some internal restbase configuration [21:08:25] that's how things are configured right now, there are 4 old servers + 1 new server pooled [21:08:27] twentyafterfour: ok [21:08:28] gehel, I say finish the teansition and then flush varnishes [21:08:43] MaxSem: ok, pooling the other 2 maps servers [21:08:50] !log gehel@palladium conftool action : set/pooled=yes; selector: maps2002.codfw.wmnet (tags: ['dc=codfw', 'cluster=maps', 'service=kartotherian']) [21:08:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:08:55] !log gehel@palladium conftool action : set/pooled=yes; selector: maps2003.codfw.wmnet (tags: ['dc=codfw', 'cluster=maps', 'service=kartotherian']) [21:08:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:09:01] !log gehel@palladium conftool action : set/pooled=yes; selector: maps2004.codfw.wmnet (tags: ['dc=codfw', 'cluster=maps', 'service=kartotherian']) [21:09:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:09:07] mdholloway: I could maybe help, but it's hard for me to understand the abstractions here since I don't work on this stuff normally. what peice of code is actually making the request that gets the 403? [21:09:27] ^ now that I think about it, there is probably a way to do that in only one command... [21:09:40] !log gehel@palladium conftool action : set/pooled=no; selector: maps-test2001.codfw.wmnet (tags: ['dc=codfw', 'cluster=maps', 'service=kartotherian']) [21:09:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:09:45] gehel: confctl select name='maps2.*' set/pooled=yes [21:10:12] try it for test depool :) [21:10:22] with name='maps-test.*' [21:10:46] !log gehel@palladium conftool action : set/pooled=no; selector: name=maps-test2* [21:10:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:11:15] bblack: mdholloway is talking about https://github.com/wikimedia/mediawiki-services-mobileapps/blob/master/config.dev.yaml#L67+L74 [21:11:35] bblack: when we run from our dev boxes it uses the dev config [21:11:47] bblack: bearND: it's configured in a template here: https://github.com/wikimedia/mediawiki-services-mobileapps/blob/master/config.dev.yaml#L72-L77 -- the reason we are using http here is, i believe, that we needed non-https support for labs (at least at one time) [21:12:39] bblack: how do I clear cache for maps? [21:12:48] mdholloway: ok so a few things: (1) ignoring labs, shouldn't this be hitting MW-API and RB private entrypoints, not public? e.g. shouldn't it be using restbase.svc.eqiad.wmnet and api.svc.eqiad.wmnet? [21:13:16] gehel: you don't really, we don't have a standard command for that, because it's usually dangerous for production servers :) [21:13:46] bblack, but not it our case ;) [21:13:52] bblack: make sense... I was checking how much traffic that is... [21:13:58] MaxSem: only because so little traffic :) [21:14:01] gehel, old instructions for mobile: https://wikitech.wikimedia.org/wiki/MobileFrontend#Flushing_the_cache [21:14:15] please don't follow those :P [21:14:47] (we should delete that really) [21:14:56] bblack: bearND: in production, it does (that's configured in puppet, but reflected in config.prod.yaml (https://github.com/wikimedia/mediawiki-services-mobileapps/blob/master/config.prod.yaml) -- can/should we be for dev purposes as well? [21:15:01] there is no mobile cluster, and doing that on the text cluster would be awful [21:15:16] mdholloway: if dev is hitting production hostnames, yes. [21:15:38] mdholloway: well maybe I don't understand your question, hang on [21:16:22] bblack: but we have a dedicated varnish cluster for maps now... [21:16:45] gehel: either way, there's no point doing a ban on everything.... [21:17:14] bblack, what are you suggesting? [21:17:19] (and that ban wouldn't be reliable anyways without stepping through the DC tiers...) [21:17:26] just restart and wipe the varnishds [21:17:34] :O [21:18:09] bblack: nuke them! [21:18:49] PROBLEM - graphite.wmflabs.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:19:02] bblack: that does sound more scary, even if I do know it make more sense... [21:19:39] it's already running, it will take a few minutesr [21:19:50] it's nice to clean up the persistent object store once in a blue moon when we can anyways [21:20:24] bblack: I was still trying to see if we have a grain to identify cache-maps [21:20:41] bblack: for my education, how did you restart and wipe them? [21:20:54] salted cmd.run 'service varnish stop; rm -f /srv/sd*/varnish*; service varnish start' [21:21:11] but stepping through them 1 host at a time, and in datacenter-tiering order [21:21:22] so in the current maps case, codfw, then eqiad, then ulsfo+esams [21:21:39] with "-b 1" on each DC's salt run to finish one before starting the next... [21:21:56] bblack: isn't the memory cache in tmpfs? [21:22:02] and then after all the backends are wiped, 'service varnish-frontend restart' to get the frontend caches everywhere (order doesn't matter anymore) [21:22:12] it's in memory, but not in tmpfs :) [21:22:32] ok, I was looking for a tmpfs and not finding it... [21:22:58] RECOVERY - graphite.wmflabs.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1701 bytes in 3.079 second response time [21:23:01] bblack: thanks a lot for the help btw... [21:23:16] everything's wiped now [21:23:28] (well things came back into cache during the process of the wipe, but only from the new servers) [21:24:12] MaxSem: I still see those strange squares... [21:24:33] browser cache? [21:25:12] bblack: nope, anonymous session... [21:25:12] try anon window [21:25:44] mdholloway: it depends... if {domain} in dev ends up being like 'en.wikipedia.org', then yes use https:// in dev config. If it ends up being like 'en.wikipedia.beta.wmflabs.org', then no. [21:25:48] MaxSem: I did (we might not be talking about the same squares... I'm not even sure how those maps should look) [21:26:08] AaronSchulz, I'll ride with what twentyafterfour said; I've not dealt with submodule pointers during deployment yet. [21:26:17] yurik, ^^^ [21:27:08] MaxSem: does it look good to you? [21:27:28] nope [21:27:39] MaxSem: ok, let's roll back... [21:27:44] agree [21:27:53] I still see strange squares too [21:28:00] I checked, old tiles are just fine [21:28:14] are they definitely new tiles? [21:28:35] !log sending traffic back to old maps servers (T137620) [21:28:36] T137620: Send traffic to new maps200? servers - https://phabricator.wikimedia.org/T137620 [21:28:37] !log gehel@palladium conftool action : set/pooled=yes; selector: name=maps-test2* [21:28:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:28:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:28:50] old tiles definitely don't have this crap, bblack [21:29:31] and getting the maps directly from one of the new server has the same issue, so it's not a caching or mixed tiles issue (I should have checked first) [21:30:14] yeah the tile Age: headers look new too [21:30:39] ok [21:31:38] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures [21:32:12] bblack: I probably did something wrong when depooling maps-test servers. [21:32:23] why? [21:32:43] I just ran "confctl select name='maps-test2*' set/pooled=no", but checking with "confctl select service=kartotherian get" I still see those servers pooled [21:33:02] heh [21:33:06] it's not a glob, it's a regex [21:33:18] so yeah, the old servers were never depooled [21:33:23] ok, my bad... [21:33:25] try again [21:33:32] let's try to depool them for real [21:33:50] !log gehel@palladium conftool action : set/pooled=no; selector: name=maps-test2.* [21:33:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:34:01] now they're gone, checked on LVS too [21:34:04] it asks for confirmation this time... [21:34:21] yeah it always does if you take action on multiple servers [21:34:31] or maybe it's if you take action on >half of servers [21:34:32] confctl too (I should have checked, sorry for the first try) [21:34:38] rewipe caches? [21:34:44] bblack: please! [21:35:36] bblack: can you do ordering with salt? Or do you have a script around salt that wipes those servers in the correct order? [21:37:36] gehel: I do the ordering manually, so 4x separate salt commands in this case [21:37:57] the ordering isn't technically fixed anyways, it can vary depending on cluster and whether we're using codfw for things, etc [21:38:05] in any case, all wiped again [21:38:17] no more blocky-looking stuff [21:40:02] bblack: bearND: right, that much makes sense and as far as that goes i think we can fix it up quickly. back to your point, though, about the private entrypoints, i know how to ssh through a bastion into *.eqiad.wmnet but not how I'd set myself up to hit entrypoints there from my local machine with, say, cURL or a browser [21:40:30] MaxSem: I still do see blocky-looking stuff, but I'm not sure I'm looking at the same thing [21:41:06] gehel, land allocations in the US are weird [21:41:09] I don't see it, I was looking again at the same area maxsem pointed out, which looked blocky for me before [21:41:22] you pretty often get a legit 10x10 miles checkerboard [21:41:25] mdholloway: bblack: I copied the config.yaml portion from prod to dev. But that doesn't seem to work. Now I get 504s instead of occasional 403s. [21:42:31] mdholloway: from your local machine, do the same as a labs instance (https://{domain} if it's prod domains, http://{domain} if it's .beta.wmflabs.org), don't use the private stuff? [21:42:40] MaxSem: so it looks good to you? Those squares looked odd, but looking at Europe I did not see any ... [21:42:49] mdholloway: the prod config won't work in dev, dev can't talk to private production hostnames/networks [21:42:55] sorry that last one was for: [21:42:59] gehel, I'm getting more people to loot at map [21:43:03] bearND: the prod config won't work in dev, dev can't talk to private production hostnames/networks [21:43:40] MaxSem: in any case, I'll keep the old maps server in config for some time, just in case we need to switch back for whatever reason... [21:44:01] alrighty [21:44:02] bblack: mdholloway : I guess we're stuck then [21:44:08] stuck with what? [21:44:15] thanks a bunch gehel and bblack! [21:44:18] bblack: ok, yeah, that's why i was surprised you suggested it (i probably wasn't clear by "for development purposes") [21:44:24] !log aaron@tin Synchronized php-1.28.0-wmf.6/includes: 7898fd2fa969342a5cc30df6a5757f4642cd6118 (duration: 01m 12s) [21:44:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:44:33] MaxSem: And thanks for checking! [21:45:05] bblack: our tests fail [21:45:42] mdholloway: so recapping: production uses http://foo.svc.eqiad.wmnet in config, labs and personal dev use either https://{domain} for real production domain or http://{domain} for beta.wmflabs.org domain [21:45:46] bblack: thanks a lot for your help and patience. I always appreciate it! [21:45:50] bearND: with the above, they should work [21:47:19] I would've thought labs/dev would use the beta.wmflabs.org {domain} values, but I'm guessing from the 403s that it's using the production public hostnames (which is ok from this perspective about HTTPS, as long as you use https://) [21:47:57] bearND: as a stopgap fix, what do you think about changing the dev config to use https again, and open up a task to talk with the services team about how to handle beta labs? [21:48:36] is {domain} beta.wmflabs.org in these cases, or not? [21:48:45] well, something.beta.wmflabs.org [21:49:21] bearND: or we could do some regex-matching to s/https/http for labs if it's a labs domain, though it's not the most elegant [21:49:36] oh, the tests use both beta and prod public domainnames, mixed? [21:50:23] !log aaron@tin Synchronized php-1.28.0-wmf.6/resources: 7898fd2fa969342a5cc30df6a5757f4642cd6118 (duration: 00m 28s) [21:50:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:50:54] (ideally beta should have https for its public hostnames, too, but it doesn't. that's the subject of a chain of long-oustanding tickets...) [21:52:06] bblack: yes. we usually use public domains for local development but need to handle labs also since the services team uses labs for some integration testing, as i understand it. (we had a task about this a while back: https://phabricator.wikimedia.org/T113542) /cc: bearND [21:52:08] bblack: yes, a couple of tests use beta labs, the majority uses production servers [21:52:53] (03PS1) 10RobH: phab2001 install module updates [puppet] - 10https://gerrit.wikimedia.org/r/294385 [21:53:52] (public prod hostnames, that is) [21:54:01] 06Operations, 06Research-and-Data-Backlog, 10Research-management, 06Revision-Scoring-As-A-Service, and 3 others: [Epic] Deploy Revscoring/ORES service in Prod - https://phabricator.wikimedia.org/T106867#2381168 (10Ladsgroup) 05Open>03Resolved [21:55:58] someday, we'll reach a place where all internal and external and beta URLs are all https:// always. there are several open tickets covering all of that, and a long way to go :) [21:56:33] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [21:57:11] (03PS1) 10RobH: setting/updating phab2001 dns entries [dns] - 10https://gerrit.wikimedia.org/r/294386 [21:59:41] (03CR) 10BBlack: [C: 031] "seems useful to me! We tend to use hiera() for similar patterns in prod, but there are cases where hiera() is annoying for this, too :)" [puppet] - 10https://gerrit.wikimedia.org/r/294331 (owner: 10Jgreen) [21:59:49] (03CR) 10RobH: [C: 032] setting/updating phab2001 dns entries [dns] - 10https://gerrit.wikimedia.org/r/294386 (owner: 10RobH) [21:59:59] 06Operations, 10Phabricator: setup phab2001.codfw.wmnet (WMF6405) - https://phabricator.wikimedia.org/T137838#2381189 (10RobH) [22:04:21] 06Operations, 10Traffic, 07HTTPS: Preload HSTS - https://phabricator.wikimedia.org/T104244#2381200 (10BBlack) [22:04:25] 06Operations, 10Traffic, 07HTTPS: Enable HSTS on Wikimedia sites - https://phabricator.wikimedia.org/T40516#2381195 (10BBlack) 05Open>03Resolved a:03BBlack This is done for all the reasonable cases we have direct control of. The external-ish ones are tracked in task T132521 and on wikitech at https://... [22:06:16] (03CR) 10RobH: [C: 032] phab2001 install module updates [puppet] - 10https://gerrit.wikimedia.org/r/294385 (owner: 10RobH) [22:06:43] so is there something up with the zuul testing? [22:06:49] it seems to be taking too long. [22:07:03] 13m, 22m, etc... [22:11:14] mdholloway: I'd be ok with your proposal for now, see https://gerrit.wikimedia.org/r/#/c/294389/ [22:11:47] 06Operations, 10Traffic: Remove referrer check from varnish for maps cluster - https://phabricator.wikimedia.org/T137848#2381204 (10Yurik) [22:12:06] 06Operations, 06Discovery, 10Kartotherian, 06Maps, and 2 others: Remove referrer check from varnish for maps cluster - https://phabricator.wikimedia.org/T137848#2381217 (10Yurik) [22:13:49] (03PS1) 10MaxSem: maps caches: remove referrer checks [puppet] - 10https://gerrit.wikimedia.org/r/294390 (https://phabricator.wikimedia.org/T137848) [22:14:05] 06Operations, 06Discovery, 06Maps, 07Epic: Epic: switch Maps to production status - https://phabricator.wikimedia.org/T133744#2381225 (10BBlack) [22:14:07] 06Operations, 06Discovery, 10Kartotherian, 06Maps, and 3 others: Remove referrer check from varnish for maps cluster - https://phabricator.wikimedia.org/T137848#2381224 (10BBlack) [22:15:14] !log aaron@tin Synchronized php-1.28.0-wmf.5/includes/deferred: 29863094805baed7a5fa493c99c87745ce041f49 (duration: 00m 27s) [22:15:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:15:31] 06Operations, 06Discovery, 06Maps, 07Epic: Epic: switch Maps to production status - https://phabricator.wikimedia.org/T133744#2381231 (10MaxSem) [22:15:33] 06Operations, 06Discovery, 06Maps, 10Traffic, 13Patch-For-Review: Send traffic to new maps200? servers - https://phabricator.wikimedia.org/T137620#2381228 (10MaxSem) 05Open>03Resolved a:03MaxSem Was done by @Gehel. [22:15:50] 06Operations, 06Discovery, 06Maps, 07Epic: Epic: switch Maps to production status - https://phabricator.wikimedia.org/T133744#2241419 (10MaxSem) a:03Gehel [22:16:19] 06Operations, 06Discovery, 06Maps, 07Epic: Epic: switch Maps to production status - https://phabricator.wikimedia.org/T133744#2381234 (10Yurik) [22:16:21] 06Operations, 06Discovery, 06Maps, 03Discovery-Maps-Sprint, 13Patch-For-Review: Enable specs on Katotherian service - https://phabricator.wikimedia.org/T137617#2381233 (10Yurik) 05Open>03Resolved [22:16:55] 06Operations, 06Discovery, 06Maps, 07Epic: Epic: switch Maps to production status - https://phabricator.wikimedia.org/T133744#2381238 (10MaxSem) [22:16:57] 06Operations, 06Discovery, 06Maps, 03Discovery-Maps-Sprint, 13Patch-For-Review: Install / configure new maps servers in codfw - https://phabricator.wikimedia.org/T134901#2381236 (10MaxSem) 05Open>03Resolved These servers are now serving traffic. [22:18:45] 06Operations, 06Discovery, 06Maps, 07Epic: Epic: switch Maps to production status - https://phabricator.wikimedia.org/T133744#2381243 (10BBlack) Re the flurry of ticket updates here and in related places: 1. Installing and setting up eqiad doesn't have to block this, it can go under some other meta-task fo... [22:20:53] 06Operations, 06Discovery, 06Maps, 07Epic: Epic: switch Maps to production status - https://phabricator.wikimedia.org/T133744#2381246 (10Yurik) @bblack, the T137617 was the monitoring one - it is now in icinga. Puppetization is a bit less cut and dry - I think everything has been scripted, but db install... [22:21:08] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 10Elasticsearch, 10Wikimedia-Logstash: Logstash elasticsearch mapping does not allow err.code to be a string - https://phabricator.wikimedia.org/T137400#2381248 (10debt) p:05Triage>03Normal [22:23:11] 06Operations, 10Phabricator: setup phab2001.codfw.wmnet (WMF6405) - https://phabricator.wikimedia.org/T137838#2381268 (10RobH) [22:24:06] 06Operations, 06Discovery, 06Maps, 07Epic: Epic: switch Maps to production status - https://phabricator.wikimedia.org/T133744#2381269 (10BBlack) @Yurik - T137617 does detailed service monitoring on each node (and probably shouldn't page people). What we're lacking is the higher-level "is this service aliv... [22:24:21] 06Operations, 10ops-codfw, 10Phabricator: setup phab2001.codfw.wmnet (WMF6405) - https://phabricator.wikimedia.org/T137838#2381273 (10RobH) a:05RobH>03Papaul So this doesn't have a switch port labeled yet. Assigning this task to @papaul for the following steps: [] - update physical label and racktables... [22:24:38] 06Operations, 10ops-codfw, 10Phabricator: setup phab2001.codfw.wmnet (WMF6405) - https://phabricator.wikimedia.org/T137838#2381287 (10RobH) [22:25:24] 06Operations, 06Discovery, 06Maps, 07Epic: Epic: switch Maps to production status - https://phabricator.wikimedia.org/T133744#2381288 (10Yurik) @mobrovac could you comment on how you do that WRT other services? I thought you also use spec.yaml for that? [22:25:25] bearND: hey, sorry, i was working on one too when you pinged, i'll look at yours now [22:25:57] bblack, i was under the impression that all of services use spec.yaml ? [22:27:57] yurik: for the internal svc hostname monitoring, they're defined in https://github.com/wikimedia/operations-puppet/blob/production/modules/lvs/manifests/monitor_services.pp [22:28:56] bearND: yours looks fine to me too, i'll +2 unless you want to go with mine [22:29:02] bblack, thx, will see what i can do [22:29:05] bearND: i imagine we'll have to update before long anyway [22:30:43] yurik: there's also some (slightly different) monitoring usually configured in hieradata/common/lvs/configuration.yaml, the "icinga:" sub-keys you see on services there. and then there's the cache-level one for public-facing URLs also in the same file the same basic way. [22:32:09] 06Operations, 06Discovery, 06Maps: "Is maps service alive?" check - https://phabricator.wikimedia.org/T137851#2381299 (10Yurik) [22:32:24] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures [22:32:50] yurik: what should we be using for an alive-check URL? [22:32:59] bblack, thx! could you add it to https://phabricator.wikimedia.org/T137851 ? [22:33:23] 06Operations, 06Discovery, 06Maps: "Is maps service alive?" check - https://phabricator.wikimedia.org/T137851#2381316 (10BBlack) there's also some (slightly different) monitoring usually configured in hieradata/common/lvs/configuration.yaml, the "icinga:" sub-keys you see on services there. and then there's... [22:33:26] bblack, we should simply use one of the tiles, e.g. the one we already use in the spec [22:34:15] where's the spec at? [22:34:57] bblack, /osm-intl/11/828/655.png -- https://github.com/kartotherian/kartotherian/blob/master/spec.yaml [22:35:21] that url does a more expensive tile generation - using fallback [22:35:33] maybe for quicker live test we should use /0/0/0.png [22:35:59] ok [22:36:06] or one with water, also at low zooms - it probably has the lowest execution costs [22:38:04] 0/0/0 seems pretty canonical and simple [22:40:30] bblack, it is, but it is pretty large on the back end, where as a tiny water tile is both much smaller and much less CPU/DB intensive [22:40:30] https://maps.wikimedia.org/osm-intl/6/23/24.png [22:40:47] this is like 1KB tile [22:40:50] if not less [22:41:14] 06Operations, 06Discovery, 06Maps: "Is maps service alive?" check - https://phabricator.wikimedia.org/T137851#2381325 (10Yurik) The [[ https://github.com/kartotherian/kartotherian/blob/master/spec.yaml | spec.yaml ]] uses https://maps.wikimedia.org/osm-intl/11/828/655.png for node testing. This is a very sma... [22:41:25] copied my notes ^ [22:43:26] !log aaron@tin Synchronized php-1.28.0-wmf.6/includes/deferred: 0d038de1414c0b4faed1cc9882151e68d86d3b2d (duration: 00m 25s) [22:43:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:47:47] (03PS1) 10BBlack: kartotherian.svc.codfw monitoring [puppet] - 10https://gerrit.wikimedia.org/r/294396 (https://phabricator.wikimedia.org/T137851) [22:51:35] (03PS1) 10BBlack: maps.wm.o monitoring [puppet] - 10https://gerrit.wikimedia.org/r/294397 (https://phabricator.wikimedia.org/T137851) [22:52:42] 06Operations, 06Discovery, 06Maps, 13Patch-For-Review: "Is maps service alive?" check - https://phabricator.wikimedia.org/T137851#2381381 (10BBlack) note the first commit 294396 is codfw-only since eqiad isn't set up yet. needs identical eqiad-stuff once eqiad exists [22:55:03] (03CR) 10Yurik: [C: 031] "URL looks good" [puppet] - 10https://gerrit.wikimedia.org/r/294397 (https://phabricator.wikimedia.org/T137851) (owner: 10BBlack) [22:56:59] (03CR) 10Yurik: [C: 031] "url/ports look good" [puppet] - 10https://gerrit.wikimedia.org/r/294396 (https://phabricator.wikimedia.org/T137851) (owner: 10BBlack) [23:00:04] RoanKattouw, ostriches, Krenair, MaxSem, and Dereckson: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160614T2300). Please do the needful. [23:00:04] matt_flaschen, MatmaRex, and kaldari: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:11] Present [23:00:16] hi. [23:00:22] you love those commas [23:00:24] you know you do [23:00:33] ori :> [23:01:06] ori, did you add them? I like the comma style: Oxford Comma 4 Life [23:01:17] yep! :) [23:03:25] who is doing the deploy? [23:03:32] I can if there is no one [23:04:31] OK, I'll do it [23:06:46] kaldari: are you here? [23:06:50] (03PS2) 10Ori.livneh: Set $wgAbuseFilterConditionLimit = 2000 for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294363 (https://phabricator.wikimedia.org/T132048) (owner: 10Bartosz Dziewoński) [23:07:04] (03CR) 10Ori.livneh: [C: 032] "I'll merge it but I don't have to like it! :P" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294363 (https://phabricator.wikimedia.org/T132048) (owner: 10Bartosz Dziewoński) [23:08:01] (03Merged) 10jenkins-bot: Set $wgAbuseFilterConditionLimit = 2000 for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294363 (https://phabricator.wikimedia.org/T132048) (owner: 10Bartosz Dziewoński) [23:09:08] ori: I'm hgere [23:09:45] !log ori@tin Synchronized wmf-config/abusefilter.php: I4e5e4d227: Set $wgAbuseFilterConditionLimit = 2000 for commonswiki (T132048) (duration: 00m 28s) [23:09:46] T132048: Change $wgAbuseFilterConditionLimit for wikimedia commons - https://phabricator.wikimedia.org/T132048 [23:09:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:10:43] ori: thanks, seems fine. https://commons.wikimedia.org/wiki/Special:AbuseFilter is displaying the new limit [23:10:44] kaldari: should I cherry-pick https://gerrit.wikimedia.org/r/#/c/294284/ to wmf.6? [23:10:53] MatmaRex: ack, thanks [23:11:18] ori: oh yeah, I forgot to do that. Sorry! [23:11:27] no problem [23:12:53] kaldari: doesn't apply cleanly :/ could you do it, then? I still need to deploy matt_flaschen's change [23:13:36] I'll try [23:14:48] ori: it's been like a year since I've done this :) [23:15:22] !log ori@tin Synchronized php-1.28.0-wmf.6/extensions/Echo: If07369cb1: Allow the primary link to set all bundled notifications as read (T136368) (duration: 00m 34s) [23:15:23] T136368: Dynamic bundle: non-bundle_base notifications need a read timestamp - https://phabricator.wikimedia.org/T136368 [23:15:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:16:06] matt_flaschen: ^ [23:16:15] ori, thanks, testing now. [23:17:28] ori, looks good, thanks. [23:17:36] ack [23:21:00] (03PS4) 10Dereckson: Set import sources for he.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292883 (https://phabricator.wikimedia.org/T137074) (owner: 10Eranroz) [23:21:08] (03PS5) 10Dereckson: Set import sources for he.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292883 (https://phabricator.wikimedia.org/T137074) (owner: 10Eranroz) [23:21:57] Hi. May I add two config changes to the SWAT? 292883 ^ and another one for pt.wikinews: 293912 [23:22:22] sure, Dereckson [23:22:41] (03CR) 10BBlack: [C: 032] kartotherian.svc.codfw monitoring [puppet] - 10https://gerrit.wikimedia.org/r/294396 (https://phabricator.wikimedia.org/T137851) (owner: 10BBlack) [23:22:54] kaldari: don't worry about it; I cherry-picked it for you [23:23:12] (03CR) 10BBlack: [C: 032] maps.wm.o monitoring [puppet] - 10https://gerrit.wikimedia.org/r/294397 (https://phabricator.wikimedia.org/T137851) (owner: 10BBlack) [23:23:32] (03CR) 10Ori.livneh: [C: 032] Set import sources for he.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292883 (https://phabricator.wikimedia.org/T137074) (owner: 10Eranroz) [23:23:36] ori: thanks, for some reason the way I used to do cherry-picks doesn't seem to work anymore :( [23:24:03] error: the requested upstream branch 'origin/wmf/1.28.0-wmf.6' does not exist [23:24:11] (03Merged) 10jenkins-bot: Set import sources for he.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292883 (https://phabricator.wikimedia.org/T137074) (owner: 10Eranroz) [23:24:20] (03PS2) 10Ori.livneh: Set import sources for pt.wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293912 (https://phabricator.wikimedia.org/T137633) (owner: 10Dereckson) [23:24:25] (03CR) 10Ori.livneh: [C: 032] Set import sources for pt.wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293912 (https://phabricator.wikimedia.org/T137633) (owner: 10Dereckson) [23:25:07] (03Merged) 10jenkins-bot: Set import sources for pt.wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293912 (https://phabricator.wikimedia.org/T137633) (owner: 10Dereckson) [23:25:44] kaldari: wmf/1.28.0-wmf.6 [23:25:51] kaldari: origin is the name of the remote, not the branch [23:25:55] Dereckson: both changes are staged on mw1017 -- would you like to verify? [23:26:06] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [23:28:06] Looks good to me through mwscript eval.php on mw1017, can't verify on web as that requires importer rights. [23:28:08] !log ori@tin Synchronized php-1.28.0-wmf.6/extensions/AntiSpoof: I2e407a3ac8: Revert "Make sure AntiSpoof mappings are mapping in the correct direction." (duration: 00m 27s) [23:28:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:28:41] that's good enough, I think [23:29:15] Dereckson: I was trying to do "git branch --track wmf/1.28.0-wmf.6 origin/wmf/1.28.0-wmf.6" like it says at https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Updating_the_deployment_branch. Is that out of date? [23:30:29] kaldari: try git fetch before to get the origin branch [23:30:40] !log ori@tin Synchronized wmf-config/InitialiseSettings.php: Id800a9d35b: Set import sources for he.wikipedia (T137074) and If66f307a2e: Set import sources for pt.wikinews (T137633) (duration: 00m 27s) [23:30:42] T137633: Add import sources for ptwikinews - https://phabricator.wikimedia.org/T137633 [23:30:42] T137074: add ImportSources for hewiki - https://phabricator.wikimedia.org/T137074 [23:30:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:30:44] Dereckson: ^ [23:31:01] (thanks for taking care of those requests) [23:31:17] Dereckson: Duh! Thanks. I had done that in the core dir, but not in the extention dir :P [23:31:31] kaldari: you repo needs to know the branch exist on a remote repo to be able to track it [23:31:40] ori: looks good to me, thanks for the deploy [23:40:40] !log ori@tin Synchronized php-1.28.0-wmf.6/resources/src/mediawiki.action/mediawiki.action.edit.stash.js: Idfad8407c8e: Improve client-side edit stash change detection (duration: 00m 25s) [23:40:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:41:04] [23:45:07] I'm off, bye [23:46:07] 06Operations, 06Discovery, 06Maps, 13Patch-For-Review: "Is maps service alive?" check - https://phabricator.wikimedia.org/T137851#2381486 (10BBlack) Basic checks are in place with the above merged. I'm not sure about the contact-groups stuff on the kartotherian.svc check, it uses 'services-team', is that... [23:47:05] (03CR) 10Krinkle: [C: 04-1] git.wikimedia.org -> Diffusion redirects (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/293221 (https://phabricator.wikimedia.org/T137224) (owner: 10Dzahn) [23:49:41] PROBLEM - Kartotherian LVS codfw on kartotherian.svc.codfw.wmnet is CRITICAL: Generic error: utf8 codec cant decode byte 0x89 in position 0: invalid start byte [23:49:56] 06Operations, 06Discovery, 06Maps, 13Patch-For-Review: "Is maps service alive?" check - https://phabricator.wikimedia.org/T137851#2381488 (10BBlack) Heh, relatedly, check_wmf_service probably isn't the right thing to use in general, as it tries to parse the response and doesn't like PNG :) [23:50:22] karto alert above is not real. it was just turned on for the first time and it's not the right check :) [23:58:21] (03PS1) 10BBlack: Remove kartotherian from monitor_services [puppet] - 10https://gerrit.wikimedia.org/r/294406 (https://phabricator.wikimedia.org/T137851) [23:59:21] (03CR) 10BBlack: [C: 032 V: 032] Remove kartotherian from monitor_services [puppet] - 10https://gerrit.wikimedia.org/r/294406 (https://phabricator.wikimedia.org/T137851) (owner: 10BBlack) [23:59:51] PROBLEM - graphite.wmflabs.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds