[00:02:24] PROBLEM - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [00:04:04] RECOVERY - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 10514 bytes in 0.113 second response time [00:04:15] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [00:24:24] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3032_v6, cp3043_v6 [00:26:06] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 60 ESP OK [00:30:19] 6operations, 7Availability: Figure out a replication strategy for Swift - https://phabricator.wikimedia.org/T91869#1657302 (10aaron) Related bug: https://phabricator.wikimedia.org/T112708 [00:35:48] (03PS1) 10Tim Landscheidt: Tools: Replace reference to tools. in toolschecker.upstart [puppet] - 10https://gerrit.wikimedia.org/r/239762 (https://phabricator.wikimedia.org/T87387) [00:37:19] (03CR) 10Tim Landscheidt: "Tested on Toolsbeta:" [puppet] - 10https://gerrit.wikimedia.org/r/239762 (https://phabricator.wikimedia.org/T87387) (owner: 10Tim Landscheidt) [00:40:05] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp3048_v6 [00:43:25] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 60 ESP OK [00:51:15] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp3039_v6 [00:52:24] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp3034_v6 [00:54:05] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 60 ESP OK [00:54:45] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 60 ESP OK [00:55:25] PROBLEM - IPsec on cp1059 is CRITICAL: Strongswan CRITICAL - ok: 23 not-conn: cp2003_v6 [00:57:15] RECOVERY - IPsec on cp1059 is OK: Strongswan OK - 24 ESP OK [01:00:05] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp2005_v6 [01:03:04] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp2011_v6 [01:04:45] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 60 ESP OK [01:07:06] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 60 ESP OK [01:16:56] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp4005_v6 [01:21:26] PROBLEM - puppet last run on mw1235 is CRITICAL: CRITICAL: Puppet has 1 failures [01:22:35] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp3042_v6 [01:24:16] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 60 ESP OK [01:27:25] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 60 ESP OK [01:28:34] PROBLEM - LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [01:31:15] PROBLEM - IPsec on cp1060 is CRITICAL: Strongswan CRITICAL - ok: 22 not-conn: cp3017_v6, cp4012_v6 [01:32:06] RECOVERY - LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 505 bytes in 0.998 second response time [01:33:05] RECOVERY - IPsec on cp1060 is OK: Strongswan OK - 24 ESP OK [01:38:16] PROBLEM - IPsec on cp1060 is CRITICAL: Strongswan CRITICAL - ok: 23 not-conn: cp2009_v6 [01:41:46] RECOVERY - IPsec on cp1060 is OK: Strongswan OK - 24 ESP OK [01:42:54] PROBLEM - LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [01:43:54] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp4006_v6 [01:45:54] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp3034_v6 [01:46:15] RECOVERY - puppet last run on mw1235 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [01:46:25] RECOVERY - LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 505 bytes in 1.008 second response time [01:46:33] !log downtimed the "LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6" alert for now ( https://phabricator.wikimedia.org/T113154 ) [01:46:35] PROBLEM - IPsec on cp1059 is CRITICAL: Strongswan CRITICAL - ok: 22 not-conn: cp4012_v6, cp4020_v6 [01:46:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:47:25] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 60 ESP OK [01:47:44] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 60 ESP OK [01:50:06] RECOVERY - IPsec on cp1059 is OK: Strongswan OK - 24 ESP OK [01:51:26] PROBLEM - IPsec on cp1046 is CRITICAL: Strongswan CRITICAL - ok: 23 not-conn: cp2015_v6 [01:52:34] PROBLEM - LVS HTTPS IPv6 on upload-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [01:53:16] RECOVERY - IPsec on cp1046 is OK: Strongswan OK - 24 ESP OK [01:54:15] RECOVERY - LVS HTTPS IPv6 on upload-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 748 bytes in 1.041 second response time [01:54:37] 6operations, 10Traffic: LVS HTTPS IPv6 on mobile-lb.eqiad alert occasionally flapping - https://phabricator.wikimedia.org/T113154#1657319 (10BBlack) This has always hit text and upload lb's (again, IPv6, in eqiad) as well, but usually they're less-likely than mobile to reach 3/3 and actually send an alert. I'... [01:56:16] !log downtimed eqiad ipv6 text/upload alerts as well, as with mobile above ( 1 301 TLS Redirect - 505 bytes in 1.008 second response time [01:56:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:56:31] stupid paste, but it works :P [01:59:44] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp2008_v6 [02:01:34] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 60 ESP OK [02:06:41] !log Maps: created indexes on admin. <3 Postgres :( [02:06:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:19:36] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp2017_v6 [02:19:59] !log l10nupdate@tin Synchronized php-1.26wmf23/cache/l10n: l10nupdate for 1.26wmf23 (duration: 06m 25s) [02:20:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:21:26] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 60 ESP OK [02:23:12] !log l10nupdate@tin LocalisationUpdate completed (1.26wmf23) at 2015-09-21 02:23:12+00:00 [02:23:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:23:25] PROBLEM - IPsec on cp1046 is CRITICAL: Strongswan CRITICAL - ok: 23 not-conn: cp2009_v6 [02:25:06] RECOVERY - IPsec on cp1046 is OK: Strongswan OK - 24 ESP OK [02:31:05] PROBLEM - IPsec on cp1047 is CRITICAL: Strongswan CRITICAL - ok: 23 not-conn: cp2021_v6 [02:32:54] RECOVERY - IPsec on cp1047 is OK: Strongswan OK - 24 ESP OK [02:33:45] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp2024_v6 [02:35:05] PROBLEM - IPsec on cp1060 is CRITICAL: Strongswan CRITICAL - ok: 20 not-conn: cp2009_v6, cp4011_v6, cp4012_v6, cp4020_v6 [02:36:26] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 57 connecting: (unnamed) not-conn: cp3034_v6, cp3044_v6, cp3045_v6 [02:36:54] RECOVERY - IPsec on cp1060 is OK: Strongswan OK - 24 ESP OK [02:37:35] PROBLEM - IPsec on cp1046 is CRITICAL: Strongswan CRITICAL - ok: 21 connecting: (unnamed) not-conn: cp2009_v6, cp3018_v6, cp4011_v6 [02:38:15] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 60 ESP OK [02:39:07] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 60 ESP OK [02:39:15] RECOVERY - IPsec on cp1046 is OK: Strongswan OK - 24 ESP OK [02:39:45] PROBLEM - IPsec on cp1059 is CRITICAL: Strongswan CRITICAL - ok: 22 not-conn: cp3015_v6, cp4011_v6 [02:39:54] PROBLEM - IPsec on cp1047 is CRITICAL: Strongswan CRITICAL - ok: 23 not-conn: cp4019_v6 [02:41:26] RECOVERY - IPsec on cp1059 is OK: Strongswan OK - 24 ESP OK [02:41:35] RECOVERY - IPsec on cp1047 is OK: Strongswan OK - 24 ESP OK [02:45:06] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 8.33% of data above the critical threshold [500.0] [02:47:35] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp3044_v6 [02:48:05] PROBLEM - IPsec on cp1046 is CRITICAL: Strongswan CRITICAL - ok: 23 not-conn: cp3017_v6 [02:49:24] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 60 ESP OK [02:49:54] RECOVERY - IPsec on cp1046 is OK: Strongswan OK - 24 ESP OK [02:52:06] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [03:05:14] PROBLEM - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-others/snapshot is not accessible: Permission denied [03:35:04] RECOVERY - Disk space on labstore1002 is OK: DISK OK [03:39:26] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [100000000.0] [03:49:38] (03PS3) 10Tim Starling: Adding comment on disabling anon page creation on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/237960 (owner: 10Kaldari) [03:49:44] (03CR) 10Tim Starling: [C: 032] Adding comment on disabling anon page creation on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/237960 (owner: 10Kaldari) [03:49:51] (03Merged) 10jenkins-bot: Adding comment on disabling anon page creation on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/237960 (owner: 10Kaldari) [03:50:24] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 8 below the confidence bounds [03:59:05] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [04:04:06] RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [04:04:46] PROBLEM - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-maps/snapshot is not accessible: Permission denied [04:12:54] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 10.34% of data above the critical threshold [100000000.0] [04:31:03] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Sep 21 04:31:03 UTC 2015 (duration 31m 2s) [04:31:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:37:55] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [04:41:05] RECOVERY - Incoming network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [04:57:06] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [05:02:34] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [05:14:45] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [05:58:36] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: ocg_job_status 510932 msg: ocg_render_job_queue 3335 msg (=3000 critical) [05:59:35] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: ocg_job_status 512291 msg: ocg_render_job_queue 3966 msg (=3000 critical) [05:59:36] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: ocg_job_status 512310 msg: ocg_render_job_queue 3977 msg (=3000 critical) [06:08:25] RECOVERY - OCG health on ocg1003 is OK: OK: ocg_job_status 514961 msg: ocg_render_job_queue 82 msg [06:08:25] RECOVERY - OCG health on ocg1001 is OK: OK: ocg_job_status 514961 msg: ocg_render_job_queue 69 msg [06:09:16] RECOVERY - OCG health on ocg1002 is OK: OK: ocg_job_status 515035 msg: ocg_render_job_queue 0 msg [06:30:44] PROBLEM - puppet last run on db1015 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:05] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:16] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:56:45] RECOVERY - puppet last run on db1015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:44] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:14] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:08:01] !log depooled mw1230-mw1235 (for T104968) [07:08:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:13:41] (03PS6) 10Addshore: Rsync api log archives from fluorine to stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/238798 (https://bugzilla.wikimedia.org/112744) [07:23:21] (03PS1) 10Muehlenhoff: Enable ferm on mw1230-mw1235 [puppet] - 10https://gerrit.wikimedia.org/r/239773 [07:25:44] PROBLEM - puppet last run on ms-fe3001 is CRITICAL: CRITICAL: puppet fail [07:38:40] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm on mw1230-mw1235 [puppet] - 10https://gerrit.wikimedia.org/r/239773 (owner: 10Muehlenhoff) [07:43:14] <_joe_> !log installing the new hhvm package on the canary appservers [07:43:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:47:22] (03CR) 10Hashar: [C: 031] Replace Package['git-core'] with Package['git'] [puppet] - 10https://gerrit.wikimedia.org/r/233853 (owner: 10Faidon Liambotis) [07:48:27] (03CR) 10Alexandros Kosiaris: "Actually, it does not seem like that is the case. See the reasoning in https://gerrit.wikimedia.org/r/#/c/239344/ where the commit reverti" [puppet] - 10https://gerrit.wikimedia.org/r/232728 (https://phabricator.wikimedia.org/T109710) (owner: 10Yurik) [07:49:03] (03Abandoned) 10Alexandros Kosiaris: Maps: Add geo-index to the water_polygons table [puppet] - 10https://gerrit.wikimedia.org/r/232728 (https://phabricator.wikimedia.org/T109710) (owner: 10Yurik) [07:49:44] !log repooled mw1230-mw1235 (for T104968) [07:49:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:52:29] (03PS1) 10Muehlenhoff: Enable ferm on mw1221-mw1229 [puppet] - 10https://gerrit.wikimedia.org/r/239774 [07:53:07] !log depooled mw1221-mw1229 (for T104968) [07:53:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:53:46] RECOVERY - puppet last run on ms-fe3001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:54:20] 6operations, 10OTRS: move OTRS to a VM - https://phabricator.wikimedia.org/T105554#1657477 (10akosiaris) [07:54:23] 6operations, 10OTRS, 10vm-requests: EQIAD: 1 VM request for OTRS - https://phabricator.wikimedia.org/T111532#1657475 (10akosiaris) 5Open>3Resolved Nothing indeed. Resolving, thanks! [07:54:34] PROBLEM - puppet last run on db2003 is CRITICAL: CRITICAL: puppet fail [07:55:51] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm on mw1221-mw1229 [puppet] - 10https://gerrit.wikimedia.org/r/239774 (owner: 10Muehlenhoff) [07:55:59] \o/ [07:59:29] 6operations, 7Mail: Replace primary mail relays (polonium/lead) - https://phabricator.wikimedia.org/T113211#1657478 (10faidon) 3NEW a:3faidon [07:59:51] 6operations, 7Mail: Replace primary mail relays (polonium/lead) - https://phabricator.wikimedia.org/T113211#1657487 (10faidon) [07:59:53] 6operations, 7Mail: Protect incoming emails with SMTP STARTLS - https://phabricator.wikimedia.org/T101452#1657486 (10faidon) [08:04:36] !log repooled mw1221-mw1229 (for T104968) [08:04:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:07:40] <_joe_> !log installing the new HHVM package on the api canaries [08:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:13:06] (03PS1) 10Muehlenhoff: Enable ferm for mw1190 - mw1199 [puppet] - 10https://gerrit.wikimedia.org/r/239775 [08:13:08] (03PS1) 10Muehlenhoff: Enable ferm for mw1189 and mw1200 - mw1208 [puppet] - 10https://gerrit.wikimedia.org/r/239776 [08:14:57] PROBLEM - Host pybal-test2001 is DOWN: PING CRITICAL - Packet loss = 100% [08:15:22] that's me [08:15:28] well, juniper's sillyness, but still me [08:15:58] RECOVERY - Host pybal-test2001 is UP: PING OK - Packet loss = 0%, RTA = 35.91 ms [08:16:08] PROBLEM - logstash process on logstash1003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 998 (logstash), command name java, args logstash [08:16:21] logstash dying is not me, though [08:16:55] <_joe_> I'll take a look [08:18:03] !log depooled mw1190-mw1195 and mw1197-mw1199 (for T104968) [08:18:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:20:21] (03PS1) 10Faidon Liambotis: Add forward/reverse for mx1001/mx2001 [dns] - 10https://gerrit.wikimedia.org/r/239778 [08:20:29] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm for mw1190 - mw1199 [puppet] - 10https://gerrit.wikimedia.org/r/239775 (owner: 10Muehlenhoff) [08:20:58] PROBLEM - puppet last run on pybal-test2001 is CRITICAL: CRITICAL: Puppet has 6 failures [08:21:11] (03CR) 10Faidon Liambotis: [C: 032] Add forward/reverse for mx1001/mx2001 [dns] - 10https://gerrit.wikimedia.org/r/239778 (owner: 10Faidon Liambotis) [08:21:17] PROBLEM - puppet last run on ganeti2004 is CRITICAL: CRITICAL: puppet fail [08:21:27] RECOVERY - logstash process on logstash1003 is OK: PROCS OK: 1 process with UID = 998 (logstash), command name java, args logstash [08:21:30] <_joe_> !log restarted the logstash agent on logstash1003, OOM'd [08:21:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:21:47] RECOVERY - puppet last run on db2003 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [08:23:36] (03PS2) 10Filippo Giunchedi: restbase: add cassandra password for test cluster [puppet] - 10https://gerrit.wikimedia.org/r/239398 (https://phabricator.wikimedia.org/T92590) [08:23:43] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] restbase: add cassandra password for test cluster [puppet] - 10https://gerrit.wikimedia.org/r/239398 (https://phabricator.wikimedia.org/T92590) (owner: 10Filippo Giunchedi) [08:29:07] !log repooled mw1190-mw1195 and mw1197-mw1199 (for T104968) [08:29:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:29:51] !log switch to 'restbase' cassandra user on restbase test cluster [08:29:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:33:06] !log depooled mw1189 and mw1200-mw1208 (for T104968) [08:33:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:39:09] (03PS2) 10Muehlenhoff: Enable ferm for mw1189 and mw1200 - mw1208 [puppet] - 10https://gerrit.wikimedia.org/r/239776 [08:39:43] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm for mw1189 and mw1200 - mw1208 [puppet] - 10https://gerrit.wikimedia.org/r/239776 (owner: 10Muehlenhoff) [08:45:16] RECOVERY - puppet last run on ganeti2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:46:45] RECOVERY - puppet last run on pybal-test2001 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [08:48:06] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds [08:48:12] !log repooled mw1189 and mw1200-mw1208 (for T104968) [08:48:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:48:22] (03CR) 10Zfilipin: WIP Move Ruby related packages to a separate file (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/237876 (https://phabricator.wikimedia.org/T110865) (owner: 10Zfilipin) [08:50:35] (03CR) 10Phuedx: [C: 031] Replicate browser test config for QuickSurveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239158 (https://phabricator.wikimedia.org/T112204) (owner: 10Jdlrobson) [08:51:13] (03CR) 10Zfilipin: "To make things simpler, we should move all jobs from ubuntu precise to trusty, that will reduce the number of platforms from 3 to 2. The t" [puppet] - 10https://gerrit.wikimedia.org/r/237876 (https://phabricator.wikimedia.org/T110865) (owner: 10Zfilipin) [08:51:58] (03CR) 10Hashar: WIP Move Ruby related packages to a separate file (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/237876 (https://phabricator.wikimedia.org/T110865) (owner: 10Zfilipin) [08:57:06] (03PS1) 10Faidon Liambotis: Add mx1001/mx2001 as role mail::mx [puppet] - 10https://gerrit.wikimedia.org/r/239784 (https://phabricator.wikimedia.org/T113211) [08:57:41] (03CR) 10Faidon Liambotis: [C: 032] Add mx1001/mx2001 as role mail::mx [puppet] - 10https://gerrit.wikimedia.org/r/239784 (https://phabricator.wikimedia.org/T113211) (owner: 10Faidon Liambotis) [08:58:06] 6operations, 7HHVM: /var/cache/hhvm/cli.hhbc.sq3 owned by root on some mw hosts - https://phabricator.wikimedia.org/T112517#1657538 (10Joe) a:3Joe [09:00:26] (03PS1) 10Muehlenhoff: Enable ferm on mw1120-mw1129 [puppet] - 10https://gerrit.wikimedia.org/r/239785 [09:00:28] (03PS1) 10Muehlenhoff: Enable ferm on mw1130 - mw1139 [puppet] - 10https://gerrit.wikimedia.org/r/239786 [09:00:30] (03PS1) 10Muehlenhoff: Enable ferm on mw1140 - mw1148 [puppet] - 10https://gerrit.wikimedia.org/r/239787 [09:02:10] !log depooled mw1120-mw1129 (for T104968) [09:02:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:03:45] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 1 below the confidence bounds [09:05:12] 6operations, 7HHVM: /var/cache/hhvm/cli.hhbc.sq3 owned by root on some mw hosts - https://phabricator.wikimedia.org/T112517#1657542 (10Joe) I am unsure of what is causing this, but it's happening upon installation only fairly recently (those are all ssytems installed recently). I'll inspect puppet for it [09:06:15] (03PS2) 10Muehlenhoff: Enable ferm on mw1120-mw1129 [puppet] - 10https://gerrit.wikimedia.org/r/239785 [09:06:32] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm on mw1120-mw1129 [puppet] - 10https://gerrit.wikimedia.org/r/239785 (owner: 10Muehlenhoff) [09:07:15] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 2 below the confidence bounds [09:13:42] 6operations, 10OTRS: upgrade iodine to jessie or find a new host with jessie for OTRS - https://phabricator.wikimedia.org/T105125#1657547 (10akosiaris) >>! In T105125#1655522, @Dzahn wrote: >>>! In T105125#1653951, @Krenair wrote: >> Does this block {T74109}? > > No, this task is outdated. It is now going to... [09:14:39] !log repooled mw1120-mw1129 (for T104968) [09:14:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:19:19] (03PS1) 10Filippo Giunchedi: restbase: add codfw service ip [dns] - 10https://gerrit.wikimedia.org/r/239789 [09:22:47] !log depooled mw1130-mw1139 (for T104968) [09:22:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:24:09] 6operations, 7HHVM: /var/cache/hhvm/cli.hhbc.sq3 owned by root on some mw hosts - https://phabricator.wikimedia.org/T112517#1657561 (10Joe) I suspect this happens because of the installation of some php (zend) extensions that run "php" in their installation process, and that is already been aliased to hhvm. T... [09:25:06] (03PS2) 10Muehlenhoff: Enable ferm on mw1130 - mw1139 [puppet] - 10https://gerrit.wikimedia.org/r/239786 [09:26:38] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm on mw1130 - mw1139 [puppet] - 10https://gerrit.wikimedia.org/r/239786 (owner: 10Muehlenhoff) [09:27:06] (03CR) 10Luke081515: "Is "es" needed at the language file? (Because wiki is locked)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239748 (https://phabricator.wikimedia.org/T112006) (owner: 10Alex Monk) [09:36:54] !log repooled mw1130-mw1139 (for T104968) [09:36:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:41:46] !log depooled mw1140 and mw1142-mw1148 (for T104968) [09:41:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:44:08] (03PS2) 10Muehlenhoff: Enable ferm on mw1140 - mw1148 [puppet] - 10https://gerrit.wikimedia.org/r/239787 [09:45:39] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm on mw1140 - mw1148 [puppet] - 10https://gerrit.wikimedia.org/r/239787 (owner: 10Muehlenhoff) [09:45:54] (03PS1) 10Giuseppe Lavagetto: hhvm: explicitly declare existence and ownership of cache files [puppet] - 10https://gerrit.wikimedia.org/r/239792 (https://phabricator.wikimedia.org/T112517) [09:48:20] (03PS2) 10Giuseppe Lavagetto: hhvm: explicitly declare existence and ownership of cache files [puppet] - 10https://gerrit.wikimedia.org/r/239792 (https://phabricator.wikimedia.org/T112517) [09:48:35] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "https://puppet-compiler.wmflabs.org/906/mw1158.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/239792 (https://phabricator.wikimedia.org/T112517) (owner: 10Giuseppe Lavagetto) [09:53:11] (03PS1) 10Filippo Giunchedi: cassandra: provision restbase user [puppet] - 10https://gerrit.wikimedia.org/r/239793 (https://phabricator.wikimedia.org/T92590) [09:55:54] 6operations, 7Mail: Replace primary mail relays (polonium/lead) - https://phabricator.wikimedia.org/T113211#1657622 (10faidon) [09:56:10] !log repooled mw1140 and mw1142-mw1148 (for T104968) [09:56:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:56:41] (03PS1) 10Filippo Giunchedi: cassandra: enable inter-dc encryption [puppet] - 10https://gerrit.wikimedia.org/r/239794 (https://phabricator.wikimedia.org/T108953) [09:57:06] (03PS1) 10Giuseppe Lavagetto: hhvm: harden cache permissions [puppet] - 10https://gerrit.wikimedia.org/r/239795 [10:04:36] (03PS1) 10Filippo Giunchedi: cassandra: expose cassandra.yaml auto_bootstrap setting [puppet] - 10https://gerrit.wikimedia.org/r/239797 [10:05:54] (03PS1) 10Muehlenhoff: Enable ferm on mw1100 - mw1109 [puppet] - 10https://gerrit.wikimedia.org/r/239798 [10:05:56] (03PS1) 10Muehlenhoff: Enable ferm on mw1026-mw1029 and mw1110-mw1113 [puppet] - 10https://gerrit.wikimedia.org/r/239799 [10:06:46] !log depooled mw1100-mw1109 (for T104968) [10:06:49] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 7 below the confidence bounds [10:06:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:07:22] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] restbase: add codfw service ip [dns] - 10https://gerrit.wikimedia.org/r/239789 (owner: 10Filippo Giunchedi) [10:09:08] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm on mw1100 - mw1109 [puppet] - 10https://gerrit.wikimedia.org/r/239798 (owner: 10Muehlenhoff) [10:10:03] (03PS1) 10Faidon Liambotis: mailman: enable TLS for lists.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/239800 (https://phabricator.wikimedia.org/T82576) [10:10:10] (03PS2) 10Filippo Giunchedi: cassandra: expose cassandra.yaml auto_bootstrap setting [puppet] - 10https://gerrit.wikimedia.org/r/239797 [10:10:18] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: expose cassandra.yaml auto_bootstrap setting [puppet] - 10https://gerrit.wikimedia.org/r/239797 (owner: 10Filippo Giunchedi) [10:10:26] (03PS2) 10Faidon Liambotis: mailman: enable TLS for lists.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/239800 (https://phabricator.wikimedia.org/T82576) [10:10:29] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 6 below the confidence bounds [10:10:32] (03CR) 10Faidon Liambotis: [C: 032 V: 032] mailman: enable TLS for lists.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/239800 (https://phabricator.wikimedia.org/T82576) (owner: 10Faidon Liambotis) [10:13:13] (03PS2) 10Filippo Giunchedi: cassandra: provision restbase user [puppet] - 10https://gerrit.wikimedia.org/r/239793 (https://phabricator.wikimedia.org/T92590) [10:13:20] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: provision restbase user [puppet] - 10https://gerrit.wikimedia.org/r/239793 (https://phabricator.wikimedia.org/T92590) (owner: 10Filippo Giunchedi) [10:15:48] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 6 below the confidence bounds [10:17:23] !log create restbase user on cassandra cluster [10:17:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:17:32] !log repooled mw1100-mw1109 (for T104968) [10:17:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:20:01] (03PS2) 10Alexandros Kosiaris: WIP: modularize otrs [puppet] - 10https://gerrit.wikimedia.org/r/239369 [10:20:55] (03CR) 10jenkins-bot: [V: 04-1] WIP: modularize otrs [puppet] - 10https://gerrit.wikimedia.org/r/239369 (owner: 10Alexandros Kosiaris) [10:21:08] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [10:23:25] (03PS1) 10Faidon Liambotis: sslcert: add /etc/ssl/private, set to 0711 [puppet] - 10https://gerrit.wikimedia.org/r/239801 [10:23:27] (03PS1) 10Faidon Liambotis: lists: group => Debian-exim for lists' certificate [puppet] - 10https://gerrit.wikimedia.org/r/239802 [10:24:15] !log depooled mw1026-mw1029 and mw1110-mw1113 (for T104968) [10:24:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:26:41] (03PS2) 10Muehlenhoff: Enable ferm on mw1026-mw1029 and mw1110-mw1113 [puppet] - 10https://gerrit.wikimedia.org/r/239799 [10:27:55] (03CR) 10Alexandros Kosiaris: [C: 031] sslcert: add /etc/ssl/private, set to 0711 [puppet] - 10https://gerrit.wikimedia.org/r/239801 (owner: 10Faidon Liambotis) [10:28:41] (03PS2) 10Faidon Liambotis: lists: group => Debian-exim for lists' certificate [puppet] - 10https://gerrit.wikimedia.org/r/239802 [10:28:43] (03PS2) 10Faidon Liambotis: sslcert: add /etc/ssl/private, set mode to 0711 [puppet] - 10https://gerrit.wikimedia.org/r/239801 [10:29:06] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm on mw1026-mw1029 and mw1110-mw1113 [puppet] - 10https://gerrit.wikimedia.org/r/239799 (owner: 10Muehlenhoff) [10:29:19] (03CR) 10Faidon Liambotis: [C: 032] sslcert: add /etc/ssl/private, set mode to 0711 [puppet] - 10https://gerrit.wikimedia.org/r/239801 (owner: 10Faidon Liambotis) [10:29:48] (03PS3) 10Faidon Liambotis: lists: group => Debian-exim for lists' certificate [puppet] - 10https://gerrit.wikimedia.org/r/239802 [10:29:50] (03PS3) 10Faidon Liambotis: sslcert: add /etc/ssl/private, set mode to 0711 [puppet] - 10https://gerrit.wikimedia.org/r/239801 [10:30:08] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 8 below the confidence bounds [10:35:08] (03CR) 10Filippo Giunchedi: [C: 031] sslcert: add /etc/ssl/private, set mode to 0711 [puppet] - 10https://gerrit.wikimedia.org/r/239801 (owner: 10Faidon Liambotis) [10:35:44] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 8 below the confidence bounds [10:39:05] !log repooled mw1026-mw1029 and mw1110-mw1113 (for T104968) [10:39:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:39:15] (03CR) 10Faidon Liambotis: [C: 032] lists: group => Debian-exim for lists' certificate [puppet] - 10https://gerrit.wikimedia.org/r/239802 (owner: 10Faidon Liambotis) [10:43:03] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [10:43:23] 6operations, 10Salt: salt cmd.run reports an empty dictionary instead of empty string sometimes - https://phabricator.wikimedia.org/T113217#1657658 (10fgiunchedi) 3NEW [10:47:51] 6operations, 5Patch-For-Review: Ferm rules for app servers - https://phabricator.wikimedia.org/T104968#1657670 (10MoritzMuehlenhoff) a:3MoritzMuehlenhoff [10:48:14] 6operations, 5Patch-For-Review: Ferm rules for app servers - https://phabricator.wikimedia.org/T104968#1657672 (10MoritzMuehlenhoff) 5Open>3Resolved All mediawiki application servers and API servers are now using ferm. [10:55:19] <_joe_> \o/ [10:55:52] 6operations, 7HHVM, 5Patch-For-Review: /var/cache/hhvm/cli.hhbc.sq3 owned by root on some mw hosts - https://phabricator.wikimedia.org/T112517#1657696 (10Joe) 5Open>3Resolved [10:57:01] (03CR) 10Gilles: [C: 031] asset-check: Use mwLoadEvent hook instead of polling modules directly [puppet] - 10https://gerrit.wikimedia.org/r/235956 (owner: 10Krinkle) [10:57:04] PROBLEM - Host analytics1011 is DOWN: PING CRITICAL - Packet loss = 100% [10:59:20] 6operations, 6Analytics-Kanban, 10hardware-requests, 5Patch-For-Review: Request three servers for Pageview API - https://phabricator.wikimedia.org/T111053#1657727 (10akosiaris) > Alex, proceed with VLAN changes! Then we can reinstall. Done. All 3 boxes changed in VLAN and interface descriptions [11:00:13] PROBLEM - Host analytics1016 is DOWN: PING CRITICAL - Packet loss = 100% [11:00:55] PROBLEM - Host analytics1019 is DOWN: PING CRITICAL - Packet loss = 100% [11:01:22] <_joe_> uhm someone knwos what this ^^ is about? [11:01:47] IIRC these are commisioned? let me check [11:03:10] yeah, these were once in site.pp [11:03:15] <_joe_> thanks, I'm looking at something else [11:03:48] (decomissioned obviously) [11:03:53] 6operations, 7HHVM, 7Wikimedia-log-errors: Unknown modifier '\': [([^\s,]+)\s*=\s*([^\s,]+)[\+\-]] - https://phabricator.wikimedia.org/T112922#1657734 (10Joe) [11:14:56] (03PS1) 10Muehlenhoff: Enable ferm on mw1154/mw1155 [puppet] - 10https://gerrit.wikimedia.org/r/239806 [11:14:58] (03PS1) 10Muehlenhoff: Enable ferm on mw1156, mw1157 [puppet] - 10https://gerrit.wikimedia.org/r/239807 [11:15:00] (03PS1) 10Muehlenhoff: Enable ferm on mw1158, mw1159 [puppet] - 10https://gerrit.wikimedia.org/r/239808 [11:15:02] (03PS1) 10Muehlenhoff: Enable ferm on mw1160 [puppet] - 10https://gerrit.wikimedia.org/r/239809 [11:21:06] !log depooled mw1154, mw1155 (for T104969) [11:21:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:22:19] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm on mw1154/mw1155 [puppet] - 10https://gerrit.wikimedia.org/r/239806 (owner: 10Muehlenhoff) [11:26:03] 6operations, 10RESTBase, 6Services, 3Mobile-Content-Service, 7Varnish: Varnish not letting through RESTBase back-end service responses for rest.wm.org - https://phabricator.wikimedia.org/T113223#1657757 (10mobrovac) 3NEW [11:26:12] !log repooled mw1154, mw1155 (for T104969) [11:26:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:27:45] 6operations, 7HHVM, 7Wikimedia-log-errors: Unknown modifier '\': [([^\s,]+)\s*=\s*([^\s,]+)[\+\-]] - https://phabricator.wikimedia.org/T112922#1657766 (10Joe) So, this happens because someone somewhere has something like this: ``` preg_match('([^\s,]+)\s*=\s*([^\s,]+)[\+\-]', 's = 4+'); ``` and causes preg... [11:30:02] 6operations, 10RESTBase, 6Services, 3Mobile-Content-Service, 7Varnish: Varnish not letting through RESTBase back-end service responses for rest.wm.org - https://phabricator.wikimedia.org/T113223#1657775 (10mobrovac) [11:31:54] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm on mw1156, mw1157 [puppet] - 10https://gerrit.wikimedia.org/r/239807 (owner: 10Muehlenhoff) [11:37:35] !log depooled and repooled mw1156, mw1157 (for T104969) [11:37:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:39:26] !log depooled mw1158, mw1159 (for T104969) [11:39:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:42:36] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm on mw1158, mw1159 [puppet] - 10https://gerrit.wikimedia.org/r/239808 (owner: 10Muehlenhoff) [11:42:48] 6operations, 10Wikimedia-Mailing-lists, 7Mail: Enable STARTTLS (both inbound and outbound) on lists - https://phabricator.wikimedia.org/T82576#1657798 (10faidon) p:5Normal>3High a:3faidon [11:42:59] 6operations, 7HHVM, 7Wikimedia-log-errors: Unknown modifier '\': [([^\s,]+)\s*=\s*([^\s,]+)[\+\-]] - https://phabricator.wikimedia.org/T112922#1657800 (10Joe) So, the error is there since September 13th, the date when https://gerrit.wikimedia.org/r/#/c/238073 was merged. It's not clear to me how the statis... [11:43:18] 6operations: Encrypt all the things - https://phabricator.wikimedia.org/T111653#1657804 (10faidon) [11:43:20] 6operations, 10Wikimedia-Mailing-lists, 7Mail: Enable STARTTLS (both inbound and outbound) on lists - https://phabricator.wikimedia.org/T82576#1657802 (10faidon) 5Open>3Resolved This should be done now. [11:47:19] 6operations, 10Wikimedia-Mailing-lists, 7Mail: Enable STARTTLS (both inbound and outbound) on lists - https://phabricator.wikimedia.org/T82576#1657812 (10faidon) [11:49:40] (03PS1) 10Faidon Liambotis: exim: enable outbound TLS for primary MX [puppet] - 10https://gerrit.wikimedia.org/r/239815 [11:50:22] (03CR) 10Faidon Liambotis: [C: 032 V: 032] exim: enable outbound TLS for primary MX [puppet] - 10https://gerrit.wikimedia.org/r/239815 (owner: 10Faidon Liambotis) [11:51:23] !log repooled mw1158, mw1159 (for T104969) [11:51:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:54:18] (03PS3) 10Alexandros Kosiaris: WIP: modularize otrs [puppet] - 10https://gerrit.wikimedia.org/r/239369 [11:54:36] !log depooled mw1160 (for T104969) [11:54:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:54:57] (03PS2) 10Muehlenhoff: Enable ferm on mw1160 [puppet] - 10https://gerrit.wikimedia.org/r/239809 [11:55:15] (03CR) 10jenkins-bot: [V: 04-1] WIP: modularize otrs [puppet] - 10https://gerrit.wikimedia.org/r/239369 (owner: 10Alexandros Kosiaris) [11:58:00] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm on mw1160 [puppet] - 10https://gerrit.wikimedia.org/r/239809 (owner: 10Muehlenhoff) [12:01:15] !log repooled mw1160 (for T104969) [12:01:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:03:04] (03Abandoned) 10Muehlenhoff: Enable ferm on mw1221-mw1229 [puppet] - 10https://gerrit.wikimedia.org/r/239041 (owner: 10Muehlenhoff) [12:03:23] (03Abandoned) 10Muehlenhoff: Enable ferm on mw1230-mw1235 [puppet] - 10https://gerrit.wikimedia.org/r/239042 (owner: 10Muehlenhoff) [12:04:58] (03CR) 10Faidon Liambotis: [C: 04-1] "Let's not create per-host hieradata for this. This should be under the hiera data for role bastion." [puppet] - 10https://gerrit.wikimedia.org/r/239023 (owner: 10Dzahn) [12:05:06] 6operations: Ferm rules for image scalers - https://phabricator.wikimedia.org/T104969#1657853 (10MoritzMuehlenhoff) a:3MoritzMuehlenhoff [12:05:35] 6operations: Ferm rules for image scalers - https://phabricator.wikimedia.org/T104969#1657855 (10MoritzMuehlenhoff) 5Open>3Resolved All image scalers are now using ferm [12:05:50] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: shutdown sodium, decom - https://phabricator.wikimedia.org/T110142#1657858 (10faidon) Decommission it means more than just merging that changeset; we have a process for that which also includes revocation of certificates, wiping the server etc. [12:06:16] moritzm: [12:06:20] moritzm: \o/ [12:06:44] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: Decommission sodium - https://phabricator.wikimedia.org/T110142#1657859 (10faidon) [12:12:44] 6operations, 10Salt: salt cmd.run reports an empty dictionary instead of empty string sometimes - https://phabricator.wikimedia.org/T113217#1657866 (10fgiunchedi) interestingly enough, another run of the same command didn't yield any `{}` but only the expected `''` as results [12:19:18] (03PS1) 10Giuseppe Lavagetto: Fix regex in stats code [debs/hhvm] - 10https://gerrit.wikimedia.org/r/239816 (https://phabricator.wikimedia.org/T112922) [12:19:43] 6operations, 7HHVM, 5Patch-For-Review, 7Wikimedia-log-errors: Unknown modifier '\': [([^\s,]+)\s*=\s*([^\s,]+)[\+\-]] - https://phabricator.wikimedia.org/T112922#1657874 (10Joe) Digging thorugh the HHVM codebase I found where the problem was originated; this is only affecting statistics collection anyways,... [12:19:55] 6operations, 7HHVM, 5Patch-For-Review, 7Wikimedia-log-errors: Unknown modifier '\': [([^\s,]+)\s*=\s*([^\s,]+)[\+\-]] - https://phabricator.wikimedia.org/T112922#1657875 (10Joe) p:5Normal>3Low [12:21:58] (03PS4) 10Alexandros Kosiaris: WIP: modularize otrs [puppet] - 10https://gerrit.wikimedia.org/r/239369 [12:23:10] (03CR) 10jenkins-bot: [V: 04-1] WIP: modularize otrs [puppet] - 10https://gerrit.wikimedia.org/r/239369 (owner: 10Alexandros Kosiaris) [12:28:53] (03PS5) 10Alexandros Kosiaris: WIP: modularize otrs [puppet] - 10https://gerrit.wikimedia.org/r/239369 [12:29:32] 6operations, 10Salt: salt cmd.run reports an empty dictionary instead of empty string sometimes - https://phabricator.wikimedia.org/T113217#1657886 (10MoritzMuehlenhoff) Hosts which are genereally unreachable (powered down or similar) also return a dict for failing queries. This might be related, e.g. that con... [12:35:51] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: Education coop mailing list rename - https://phabricator.wikimedia.org/T107445#1657896 (10Selsharbaty-WMF) Hi colleagues, I have asked the Collab members if they have any problems with keeping the old archives open to the public and no one had conce... [12:37:39] (03CR) 10Faidon Liambotis: [C: 031] Remove sodium from puppet (spare/decom) [puppet] - 10https://gerrit.wikimedia.org/r/239411 (https://phabricator.wikimedia.org/T110142) (owner: 10John F. Lewis) [12:45:00] (03CR) 10Alex Monk: "Yes. Interwikis and SiteMatrix entries should not go missing due to a wiki being locked - instead, SiteMatrix detects this separately and " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239748 (https://phabricator.wikimedia.org/T112006) (owner: 10Alex Monk) [12:47:58] (03Abandoned) 10ArielGlenn: Dump GeoData information [dumps] - 10https://gerrit.wikimedia.org/r/155080 (https://bugzilla.wikimedia.org/51225) (owner: 10MaxSem) [12:49:32] (03PS2) 10Filippo Giunchedi: cassandra: enable inter-dc encryption [puppet] - 10https://gerrit.wikimedia.org/r/239794 (https://phabricator.wikimedia.org/T108953) [12:49:42] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: enable inter-dc encryption [puppet] - 10https://gerrit.wikimedia.org/r/239794 (https://phabricator.wikimedia.org/T108953) (owner: 10Filippo Giunchedi) [12:52:19] (03CR) 10Alex Monk: "But then won't iron need some sort of exception?" [puppet] - 10https://gerrit.wikimedia.org/r/239023 (owner: 10Dzahn) [12:53:09] !log rolling restart cassandra after enabling dc encryption, no nodes in codfw yet [12:53:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:06:28] 6operations, 10ops-eqiad: mw1031 has a bad uplink - https://phabricator.wikimedia.org/T95896#1657939 (10Cmjohnson) 5Open>3Resolved a:3Cmjohnson resolving this and created https://phabricator.wikimedia.org/T113079 to decommission mw1031 [13:07:19] 6operations, 7Mail: Replace primary mail relays (polonium/lead) - https://phabricator.wikimedia.org/T113211#1657946 (10faidon) The new hosts, mx1001/mx2001 are up and running. I've already notified WMF's Office IT team to update Google Apps with the new IPs. [13:07:34] (03PS6) 10Alexandros Kosiaris: WIP: modularize otrs [puppet] - 10https://gerrit.wikimedia.org/r/239369 [13:14:33] (03PS1) 10Krinkle: varnish: Don't disable Cache-Control for all mobile traffic [puppet] - 10https://gerrit.wikimedia.org/r/239826 (https://phabricator.wikimedia.org/T113007) [13:24:33] (03PS7) 10Alexandros Kosiaris: modularize otrs [puppet] - 10https://gerrit.wikimedia.org/r/239369 [13:26:52] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] modularize otrs [puppet] - 10https://gerrit.wikimedia.org/r/239369 (owner: 10Alexandros Kosiaris) [13:28:47] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 (Hue / Hive) for bmansurov - https://phabricator.wikimedia.org/T113069#1657973 (10Ottomata) The relevant group here is `analytics-privatedata-users` [13:29:29] 10Ops-Access-Requests, 6operations, 10Wikimedia-Blog: stat1003/EventLogging access for asherman - https://phabricator.wikimedia.org/T113118#1657974 (10Ottomata) I believe `researchers` is the proper group. [13:37:56] (03CR) 10Addshore: [C: 031] Make link in dataset relative [puppet] - 10https://gerrit.wikimedia.org/r/239367 (https://phabricator.wikimedia.org/T112892) (owner: 10JanZerebecki) [13:38:05] (03PS7) 10Addshore: Rsync api log archives from fluorine to stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/238798 (https://bugzilla.wikimedia.org/112744) [13:42:42] (03PS1) 10Filippo Giunchedi: restbase: set cassandra credentials [puppet] - 10https://gerrit.wikimedia.org/r/239829 (https://phabricator.wikimedia.org/T92590) [13:44:25] (03CR) 10Ottomata: [C: 032] Rsync api log archives from fluorine to stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/238798 (https://bugzilla.wikimedia.org/112744) (owner: 10Addshore) [13:46:32] (03PS2) 10Zfilipin: rubocop: Ignoring Style/WordArray offense [puppet] - 10https://gerrit.wikimedia.org/r/238778 (https://phabricator.wikimedia.org/T112651) [13:48:12] (03PS3) 10Zfilipin: rubocop: Ignoring Style/WordArray offense [puppet] - 10https://gerrit.wikimedia.org/r/238778 (https://phabricator.wikimedia.org/T112651) [13:48:24] (03CR) 10Eevans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/239829 (https://phabricator.wikimedia.org/T92590) (owner: 10Filippo Giunchedi) [13:48:26] (03CR) 10Zfilipin: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/238778 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [13:49:11] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: puppet fail [13:49:58] (03PS1) 10Ottomata: Rename fluorine api rsync job to mw-api to avoid conflict with webrequest api log rsync job [puppet] - 10https://gerrit.wikimedia.org/r/239830 (https://phabricator.wikimedia.org/T112744) [13:51:11] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] hhvm (3.6.5+dfsg1-1+wm6) urgency=medium [debs/hhvm] - 10https://gerrit.wikimedia.org/r/238408 (https://phabricator.wikimedia.org/T112640) (owner: 10Giuseppe Lavagetto) [13:57:27] (03CR) 10Ottomata: [C: 032] Rename fluorine api rsync job to mw-api to avoid conflict with webrequest api log rsync job [puppet] - 10https://gerrit.wikimedia.org/r/239830 (https://phabricator.wikimedia.org/T112744) (owner: 10Ottomata) [14:00:02] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [14:04:30] (03PS2) 10Filippo Giunchedi: restbase: set cassandra credentials [puppet] - 10https://gerrit.wikimedia.org/r/239829 (https://phabricator.wikimedia.org/T92590) [14:04:38] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] restbase: set cassandra credentials [puppet] - 10https://gerrit.wikimedia.org/r/239829 (https://phabricator.wikimedia.org/T92590) (owner: 10Filippo Giunchedi) [14:09:35] !log rolling restart restbase in production after cassandra credentials change [14:09:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:18:50] (03PS1) 10Ottomata: Fix wildcard in rsync for api.log [puppet] - 10https://gerrit.wikimedia.org/r/239840 (https://phabricator.wikimedia.org/T112744) [14:19:43] (03PS2) 10Ottomata: Fix identation and wildcard in rsync for api.log [puppet] - 10https://gerrit.wikimedia.org/r/239840 (https://phabricator.wikimedia.org/T112744) [14:21:36] (03CR) 10Ottomata: [C: 032] Fix identation and wildcard in rsync for api.log [puppet] - 10https://gerrit.wikimedia.org/r/239840 (https://phabricator.wikimedia.org/T112744) (owner: 10Ottomata) [14:24:44] (03PS2) 10Zfilipin: rubocop: enforcing comma after the last element of a multiline list [puppet] - 10https://gerrit.wikimedia.org/r/238779 (https://phabricator.wikimedia.org/T112651) [14:25:30] (03PS2) 10Ottomata: Set replace=True for EventLogging MySQL consumer [puppet] - 10https://gerrit.wikimedia.org/r/237688 (https://phabricator.wikimedia.org/T112265) [14:26:10] (03CR) 10Zfilipin: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/238779 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [14:27:35] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Pin mock<1.1.0 and add tox entry point [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/233360 (owner: 10Hashar) [14:28:40] (03CR) 10Ottomata: [C: 032] Set replace=True for EventLogging MySQL consumer [puppet] - 10https://gerrit.wikimedia.org/r/237688 (https://phabricator.wikimedia.org/T112265) (owner: 10Ottomata) [14:31:15] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] pass flake8 and add entry point [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/233361 (owner: 10Hashar) [14:32:28] (03CR) 10Zfilipin: "It is really strange how RuboCop "fixes" this offense with different settings. This patch set sets `EnforcedStyleForMultiline` to `consist" [puppet] - 10https://gerrit.wikimedia.org/r/238779 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [14:33:15] !log restart eventlogging with mysql consumer replace=True (AKA INSERT IGNORE) [14:33:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:38:11] 6operations, 10RESTBase, 6Services, 3Mobile-Content-Service, 7Varnish: Varnish not letting through RESTBase back-end service responses for rest.wm.org - https://phabricator.wikimedia.org/T113223#1658134 (10GWicke) Does this still matter? [14:39:00] (03PS3) 10Zfilipin: rubocop: enforcing comma after the last element of a multiline list [puppet] - 10https://gerrit.wikimedia.org/r/238779 (https://phabricator.wikimedia.org/T112651) [14:41:06] (03CR) 10Zfilipin: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/238779 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [14:41:16] (03CR) 10Hashar: "I am enabling the Jenkins jobs ( https://gerrit.wikimedia.org/r/239845 )" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/233361 (owner: 10Hashar) [14:43:38] (03CR) 10Zfilipin: "- patch set 1 is no_comma setting" [puppet] - 10https://gerrit.wikimedia.org/r/238779 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [14:45:23] 6operations, 10ops-codfw: wipe working spare disk in codfw - https://phabricator.wikimedia.org/T112783#1658143 (10Papaul) Zhen uses 2.5 HDD and not 3.5 HDD. Right now I have 2x SATA 250 Gb 2.5 drives in the server. I will have to find one 2.5 SATA drive to perform this test. @Rob do you have a problem with me... [14:45:30] (03PS1) 10Muehlenhoff: Move base::firewall include into the roles [puppet] - 10https://gerrit.wikimedia.org/r/239847 [14:46:03] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [14:50:06] (03CR) 10BBlack: "Can we make the new block identical to the one in text-frontend instead of slightly-different? Is there a reason, related to different app" [puppet] - 10https://gerrit.wikimedia.org/r/239826 (https://phabricator.wikimedia.org/T113007) (owner: 10Krinkle) [14:51:23] (03CR) 10Muehlenhoff: [C: 031] Remove support for Ubuntu Lucid/10.04 [puppet] - 10https://gerrit.wikimedia.org/r/179888 (owner: 10Faidon Liambotis) [14:53:25] <_joe_> moritzm: can we really do it then? :) [14:56:09] (03CR) 10BBlack: [C: 031] Remove support for Ubuntu Lucid/10.04 [puppet] - 10https://gerrit.wikimedia.org/r/179888 (owner: 10Faidon Liambotis) [14:56:32] _joe_: not yet, it's part of a decom ticket by Faidon. but since I looked at some lucid-specific puppet snippet this morning, I thought I might already look over it [14:56:52] (03CR) 10BBlack: [C: 031] Replace Package['git-core'] with Package['git'] [puppet] - 10https://gerrit.wikimedia.org/r/233853 (owner: 10Faidon Liambotis) [14:57:13] <_joe_> moritzm: didn't we switch mailman on friday? [14:58:04] _joe_: yeah, all that remains is to properly decom the box: https://phabricator.wikimedia.org/T110142 [14:59:35] _joe_: Daniel is doing it today [15:00:05] anomie ostriches thcipriani marktraceur Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150921T1500). [15:00:05] kart_ Glaisher: A patch you scheduled for Morning SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [15:00:09] Once we take a look to ensure sodium is not needed anymore for one reason or another and fermium is functioning as expected [15:00:31] here [15:01:49] here too [15:02:01] Krinkle: re: CC headers - if we limit it to only cases matching s-maxage != 0, doesn't that leave open the possibility that MW could send "s-maxage=0, max-age=3000000", skip the header overwrite, but end up cached long-term in a browser where we can't purge wiki content? [15:02:04] Who is SWAT'ng? [15:02:08] I can SWAT [15:02:37] thcipriani: my 2nd patch depends on first to pass the 'test'. Just note :) [15:02:47] bblack: If MediaWiki were to do that, then it is most likely on purpose for e.g. a SpecialPage we intend to allow caching for. [15:03:04] thcipriani: I'll recheck once you merge the first one. [15:03:07] kart_: I was just figuring that out :) [15:03:12] :) [15:03:13] Krinkle: ok so that's why the new mobile variant in the patch doesn't need the Special:Banner exclusion? [15:03:43] bblack: Well, that's ommitted 1) because this patch is not about that, 2) afaik that banner code no longer exists, but I'm not sure [15:03:49] Glaisher: while we wait on rechecks, let's get your patch done. [15:03:59] ok [15:05:16] bblack: I'd like to separate the minimal fix for this currently seriously impactful performance problem from making them match and introducing potentilaly incompatible behaviour [15:05:39] I simply can't account for whether or not non-smaxage or Special pages may or may not have special stuff in MObileFrontend [15:05:55] there's all kinds of stuff going on in MF that they can figure out another time [15:06:53] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239403 (https://phabricator.wikimedia.org/T104251) (owner: 10Glaisher) [15:07:20] (03Merged) 10jenkins-bot: Remove redundant entries from robots.txt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239403 (https://phabricator.wikimedia.org/T104251) (owner: 10Glaisher) [15:08:06] Krinkle: ok [15:08:21] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: ocg_job_status 534914 msg: ocg_render_job_queue 3235 msg (=3000 critical) [15:09:01] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: ocg_job_status 535958 msg: ocg_render_job_queue 3672 msg (=3000 critical) [15:09:03] (03CR) 10BBlack: [C: 031] varnish: Don't disable Cache-Control for all mobile traffic [puppet] - 10https://gerrit.wikimedia.org/r/239826 (https://phabricator.wikimedia.org/T113007) (owner: 10Krinkle) [15:09:11] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: ocg_job_status 536176 msg: ocg_render_job_queue 3756 msg (=3000 critical) [15:10:23] <_joe_> uhm [15:10:26] !log thcipriani@tin Synchronized robots.txt: SWAT: Remove redundant entries from robots.txt [[gerrit:239403]] (duration: 00m 12s) [15:10:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:10:40] ^ Glaisher sync'd! [15:11:12] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [15:11:21] (03CR) 10Filippo Giunchedi: [C: 04-1] "things prefixed with ~ are ignorable/not blockers, not sure about 'clusters' tho" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/239511 (owner: 10Rush) [15:11:57] thcipriani: thanks [15:12:03] anoyone (krenair?) know when https://gerrit.wikimedia.org/r/#/c/237169/ will be merged and executed? [15:13:03] thcipriani: we're good to go. [15:13:12] test passed. [15:13:17] kart_: kk [15:13:17] aharoni: ^ [15:14:03] kart_: the test patch SWAT-deployed, right? [15:14:14] and the revert is going to be deployed now? [15:14:21] yes [15:17:46] 6operations, 5Patch-For-Review, 7Swift: swift capacity planning - https://phabricator.wikimedia.org/T1268#1658205 (10fgiunchedi) [15:22:37] !log thcipriani@tin Synchronized php-1.26wmf23/extensions/ContentTranslation/modules/entrypoint/ext.cx.interlanguagelink.js: SWAT: Revert "Do not call cxserver to display gray interwiki link" [[gerrit:239819]] (duration: 00m 11s) [15:22:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:22:46] ^ kart_ check please [15:25:01] Sure [15:25:15] (03PS1) 10Mdann52: allow sysops to grant bot userright on https://mai.wikimedia.org Bug:T111898 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239854 (https://phabricator.wikimedia.org/T111898) [15:26:05] aharoni: can you test too? [15:26:41] RECOVERY - OCG health on ocg1002 is OK: OK: ocg_job_status 548482 msg: ocg_render_job_queue 296 msg [15:26:48] kart_: I still see en-us [15:27:15] thcipriani: both patches are in? [15:27:22] RECOVERY - OCG health on ocg1003 is OK: OK: ocg_job_status 548611 msg: ocg_render_job_queue 0 msg [15:27:31] RECOVERY - OCG health on ocg1001 is OK: OK: ocg_job_status 548642 msg: ocg_render_job_queue 0 msg [15:27:56] aharoni: refresh? [15:28:01] kart_: yeah, both patches merged, didn't sync the test out. [15:28:13] Okay! [15:28:23] aharoni: I can't see en-US now. [15:28:44] *do't [15:29:35] kart_: yes, seems good [15:29:37] thanks you [15:30:44] cool. [15:30:49] Thanks thcipriani and aharoni [15:31:11] kart_: aharoni awesome. Thanks for checking! [15:37:06] (03CR) 10BBlack: [C: 031] Geolocate our networks to their respective DC [dns] - 10https://gerrit.wikimedia.org/r/239069 (owner: 10Faidon Liambotis) [15:38:54] (03PS2) 10Glaisher: Allow sysops to add users to bot group on mai.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239854 (https://phabricator.wikimedia.org/T111898) (owner: 10Mdann52) [15:39:14] (03CR) 10Glaisher: "Please follow https://www.mediawiki.org/wiki/Gerrit/Commit_message_guidelines" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239854 (https://phabricator.wikimedia.org/T111898) (owner: 10Mdann52) [15:40:43] 6operations, 6Analytics-Kanban, 10hardware-requests, 5Patch-For-Review: Request three servers for Pageview API - https://phabricator.wikimedia.org/T111053#1658357 (10kevinator) 5Open>3Resolved [15:41:14] godog: gah! in foreground on cli in gdb reqstats is still running! [15:41:17] GRR [15:41:26] must be because of the way it is run in diamond...:( [15:41:36] (03PS1) 10Mdann52: ns name instead of number [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239856 [15:42:02] James_F: https://logstash.wikimedia.org/#/dashboard/elasticsearch/apache2log [15:42:36] (03Abandoned) 10Mdann52: ns name instead of number [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239856 (owner: 10Mdann52) [15:45:14] (03PS3) 10Faidon Liambotis: Switch Middle-East's backup from ulsfo to eqiad [dns] - 10https://gerrit.wikimedia.org/r/239071 [15:45:16] (03PS3) 10Faidon Liambotis: Add codfw everywhere on the map [dns] - 10https://gerrit.wikimedia.org/r/239070 [15:45:18] (03PS3) 10Faidon Liambotis: Geolocate our networks to their respective DC [dns] - 10https://gerrit.wikimedia.org/r/239069 [15:45:20] (03PS3) 10Faidon Liambotis: Switch Central/South Asia to esams [dns] - 10https://gerrit.wikimedia.org/r/239072 [15:45:22] (03PS2) 10Faidon Liambotis: Move codfw to be second in place at the DC list [dns] - 10https://gerrit.wikimedia.org/r/239114 [15:45:32] (03CR) 10Ottomata: "I wonder if instead of using role/analytics/{restbase,cassandra}.pp for this, we should call this role/analytics/aqs.pp and put the classe" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/231574 (https://phabricator.wikimedia.org/T107056) (owner: 10Milimetric) [15:47:20] (03PS1) 10Mdann52: name instead of number [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239857 [15:47:22] (03CR) 10BBlack: [C: 031] Add codfw everywhere on the map [dns] - 10https://gerrit.wikimedia.org/r/239070 (owner: 10Faidon Liambotis) [15:47:53] (03Abandoned) 10Mdann52: name instead of number [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239857 (owner: 10Mdann52) [15:50:34] (03CR) 10BBlack: "If we're going to leave RU as [esams, ulsfo, ... ] in general, probably it should be [esams, ulsfo, eqiad, codfw]? As it stands in the pa" [dns] - 10https://gerrit.wikimedia.org/r/239071 (owner: 10Faidon Liambotis) [15:54:04] (03CR) 10BBlack: [C: 031] "Really the RU thing, if we want to address it at all, could be a separate patch at the end of this series." [dns] - 10https://gerrit.wikimedia.org/r/239071 (owner: 10Faidon Liambotis) [15:54:29] (03CR) 10BBlack: [C: 031] Switch Central/South Asia to esams [dns] - 10https://gerrit.wikimedia.org/r/239072 (owner: 10Faidon Liambotis) [15:54:45] (03PS1) 10Mdann52: use names, not numbers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239859 [15:55:25] (03Abandoned) 10Mdann52: use names, not numbers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239859 (owner: 10Mdann52) [15:57:40] akosiaris, so what's the scoop with the osm db? I think your patch killed our database last week [15:58:18] akosiaris, btw, just fyi - there are over 50 news sources that picked up maps announcement in russia alone [15:58:38] some claim that we will provide video, including CCTV, via it ;) [15:59:07] yurik: and unicorns [15:59:29] yurik: in a meeting in 1 min, I 'll reach you in about an hour [15:59:38] nod [16:11:48] 6operations, 10Analytics-Cluster, 6Analytics-Kanban: Fix llama user id {hawk} - https://phabricator.wikimedia.org/T100678#1659546 (10kevinator) [16:14:38] 6operations, 10Analytics-Cluster, 6Analytics-Kanban: Fix llama user id {hawk} [5 pts] - https://phabricator.wikimedia.org/T100678#1659563 (10Ottomata) [16:14:40] 6operations, 10Analytics-Cluster, 6Analytics-Kanban: Fix llama user id {hawk} [5 pts] - https://phabricator.wikimedia.org/T100678#1659564 (10kevinator) p:5Triage>3Normal [16:16:03] (03PS6) 10Mdann52: noindex user namespace on en.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/237330 (https://phabricator.wikimedia.org/T104797) [16:17:03] (03CR) 10Mdann52: "I've added the namespace name instead of the number, as per the earlier comment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/237330 (https://phabricator.wikimedia.org/T104797) (owner: 10Mdann52) [16:21:47] 6operations, 10Wikimedia-General-or-Unknown, 6Wikisource: Upgrade Ghostscript to 9.15 or later - https://phabricator.wikimedia.org/T110849#1659608 (10Jdforrester-WMF) [16:23:50] 10Ops-Access-Requests, 6operations: Expand shell access for aklapper on Phabricator - https://phabricator.wikimedia.org/T113124#1659695 (10Dzahn) Does this require running any commands with sudo? [16:24:53] PROBLEM - etherpad.wikimedia.org HTTP on etherpad1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:26:43] RECOVERY - etherpad.wikimedia.org HTTP on etherpad1001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 522 bytes in 7.556 second response time [16:28:55] RECOVERY - Disk space on labstore1002 is OK: DISK OK [16:39:18] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 (Hue / Hive) for bmansurov - https://phabricator.wikimedia.org/T113069#1659805 (10Dzahn) a:3coren @coren handing over because i see on Etherpad you are on-duty this week. ok? [16:39:56] 10Ops-Access-Requests, 6operations, 10Wikimedia-Blog: stat1003/EventLogging access for asherman - https://phabricator.wikimedia.org/T113118#1659809 (10Dzahn) a:5Ottomata>3coren [16:39:59] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 (Hue / Hive) for bmansurov - https://phabricator.wikimedia.org/T113069#1659811 (10coren) @dzahn: You are relieved. :-) [16:40:49] (03PS2) 10Mattflaschen: Set $wgFlowMigrateReferenceWiki to false in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238111 (https://phabricator.wikimedia.org/T107204) (owner: 10Catrope) [16:53:48] 10Ops-Access-Requests, 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Let contint-admins force run puppet with /usr/local/sbin/puppet-run - https://phabricator.wikimedia.org/T110943#1659880 (10RobH) Update from meeting: This was approved by ops in the operations meeting. (Discuss... [16:56:34] (03PS5) 10Yuvipanda: admin: Allow aklapper to reset user auths and delete accounts in Phab [puppet] - 10https://gerrit.wikimedia.org/r/219151 (https://phabricator.wikimedia.org/T113124) (owner: 10Aklapper) [16:57:08] (03CR) 10Yuvipanda: "This just got approved. Needs a manual rebase, however :(" [puppet] - 10https://gerrit.wikimedia.org/r/219151 (https://phabricator.wikimedia.org/T113124) (owner: 10Aklapper) [16:57:51] 10Ops-Access-Requests, 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Let contint-admins force run puppet with /usr/local/sbin/puppet-run - https://phabricator.wikimedia.org/T110943#1659899 (10Dzahn) @hashar brought it up in meeting and it has been approved. @Andrew might follow-u... [16:58:29] (03PS6) 10Yuvipanda: admin: Allow aklapper to reset user auths and delete accounts in Phab [puppet] - 10https://gerrit.wikimedia.org/r/219151 (https://phabricator.wikimedia.org/T113124) (owner: 10Aklapper) [16:58:31] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Expand shell access for aklapper on Phabricator - https://phabricator.wikimedia.org/T113124#1659901 (10Dzahn) approved in meeting (if it's less or similar to what others already have) [16:59:20] (03CR) 10Yuvipanda: [C: 032] admin: Allow aklapper to reset user auths and delete accounts in Phab [puppet] - 10https://gerrit.wikimedia.org/r/219151 (https://phabricator.wikimedia.org/T113124) (owner: 10Aklapper) [17:00:32] yurik: and I am back [17:04:08] andre__: let me know when you want to try your new powers :) [17:06:01] (03CR) 10Filippo Giunchedi: WIP: elastic: sane diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/239511 (owner: 10Rush) [17:17:04] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [17:23:17] <_joe_> yuvipanda: ^^ you forgot to merge [17:23:29] _joe_: yeah just did [17:24:23] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [17:24:45] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [17:30:28] Did something change about mailman SSL configuration recently? [17:31:39] marktraceur: yes, paravoid has been strengthening them up [17:32:45] marktraceur: it is because we are now on jessie with Apache 2.4 so we support newer ciphers and forward secrecy [17:32:58] (03Abandoned) 10Chad: Simplify logging in ssh module [tools/scap] - 10https://gerrit.wikimedia.org/r/238959 (owner: 10Chad) [17:33:18] yuvipanda: that's the TLS part for mail [17:33:23] well, both are TLS [17:33:27] mail or webserver or both? [17:33:37] 6operations, 10ops-eqiad: Swap two elasticsearch servers in row D with an elasticsearch server in racks A3 and C5. - https://phabricator.wikimedia.org/T112559#1660068 (10chasemp) We made a plan to do 1030 and 1005 tomorrow and then let thing stabilize before going further. @dcausse @ebernhardson @cmjohnson... [17:33:37] marktraceur: ^ [17:34:06] Hm. [17:34:11] marktraceur: this https://phabricator.wikimedia.org/T90351 ? [17:34:23] Must be [17:34:25] or this https://phabricator.wikimedia.org/T82576 [17:34:32] It's causing listadmin to fall on its face now [17:34:49] ? [17:35:08] http://www.freecode.com/projects/listadmin [17:35:56] ERROR: fetching https://lists.wikimedia.org/mailman/admindb/wikimedia-us-mn [17:35:59] ERROR: 500 SSL negotiation failed: -- skipping list [17:36:04] aha, ok [17:37:03] so yea, that is the first link then, the cipher list is now level "medium" [17:37:12] as opposed to "compat" before on old Apache [17:38:58] we are graded A+ though and just exclude few clients https://www.ssllabs.com/ssltest/analyze.html?d=lists.wikimedia.org&latest [17:39:10] since the last release of listadmin seems to be 2007 .. yea.. hmm [17:41:07] clients we exclude: Java6, openssl 0.9.8, IE6/IE8 on XP [17:41:10] 6operations, 10Analytics, 10Analytics-Cluster, 10Fundraising Tech Backlog, 10Fundraising-Backlog: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1660099 (10Jgreen) > What is the highest sampling rate we can afford? Significantly unknown. Load is pretty low... [17:43:26] mutante: So...no dice? Gotta do things manually? [17:44:43] 6operations, 10Wikimedia-Mailing-lists: @txt.att.net bounce notifications being sent to list admins - https://phabricator.wikimedia.org/T112912#1660112 (10JohnLewis) 5Open>3Invalid It's subscription spam again (has happened in the past). Has anyone received any recently? My last bounce was September 18th a... [17:44:50] marktraceur: i think it needs a fix in the script itself [17:45:02] listadmin? :/ [17:45:05] yes [17:45:06] I'd echo its client side [17:45:11] it's not been updated in years [17:45:18] perl... [17:45:27] * marktraceur looks at perl, looks at not writing perl [17:48:54] yuvipanda: heh, thanks... after Gerrit Cleanup Day I'll try [17:49:15] andre__: I just want to see if it actually works :) [17:49:23] so we can close the ticket and move on [17:49:46] I found the patch because I was triaging for Gerrit Cleanup day, :) [17:51:15] andre__: if you want to wait until after to verify, I can just revert it for now [17:52:06] marktraceur: are you on listadmin 2.3.7? apparently there is 2.4.0 but the link is a bit more hidden on his page: http://heim.ifi.uio.no/kjetilho/hacks/listadmin-2.40.tar.gz [17:52:34] 2.40, from Debian repos [17:52:49] Just ran an update too [17:53:39] hmm.. wanna try reporting it upstream? [17:54:28] akosiaris, ping? [17:55:10] I will, sure [17:55:21] ok,cool [17:55:41] 6operations, 10Analytics, 10Analytics-Cluster, 10Fundraising Tech Backlog, 10Fundraising-Backlog: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1660173 (10ellery) + 1 [17:56:17] (03PS1) 10Yuvipanda: Revert "admin: Allow aklapper to reset user auths and delete accounts in Phab" [puppet] - 10https://gerrit.wikimedia.org/r/239892 [17:56:19] (03PS2) 10Cmjohnson: remove mw1031 from dsh groups and DHCP [puppet] - 10https://gerrit.wikimedia.org/r/239513 (https://phabricator.wikimedia.org/T113079) (owner: 10Dzahn) [17:56:26] andre__: ^^ [17:57:53] (03CR) 10Faidon Liambotis: [C: 032] Move codfw to be second in place at the DC list [dns] - 10https://gerrit.wikimedia.org/r/239114 (owner: 10Faidon Liambotis) [17:57:55] (03CR) 10Cmjohnson: [C: 032] remove mw1031 from dsh groups and DHCP [puppet] - 10https://gerrit.wikimedia.org/r/239513 (https://phabricator.wikimedia.org/T113079) (owner: 10Dzahn) [17:58:18] (03CR) 10Faidon Liambotis: [C: 032] Geolocate our networks to their respective DC [dns] - 10https://gerrit.wikimedia.org/r/239069 (owner: 10Faidon Liambotis) [17:58:45] !log banning 1030 from eqiad elastic cluster for T112559#1660068 [17:58:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:02:09] (03CR) 10Faidon Liambotis: [C: 032] Add codfw everywhere on the map [dns] - 10https://gerrit.wikimedia.org/r/239070 (owner: 10Faidon Liambotis) [18:02:19] (03CR) 10Faidon Liambotis: [C: 032] Switch Middle-East's backup from ulsfo to eqiad [dns] - 10https://gerrit.wikimedia.org/r/239071 (owner: 10Faidon Liambotis) [18:02:25] (03PS3) 10Dzahn: admin: let contint-admins run puppet [puppet] - 10https://gerrit.wikimedia.org/r/234539 (https://phabricator.wikimedia.org/T110943) (owner: 10Hashar) [18:02:58] (03PS4) 10Dzahn: admin: let contint-admins run puppet [puppet] - 10https://gerrit.wikimedia.org/r/234539 (https://phabricator.wikimedia.org/T110943) (owner: 10Hashar) [18:04:56] yurik: ping [18:05:21] akosiaris, sorry, another meeting. Could you just in case disable your water labels script for now? [18:05:46] yurik: why ? I 've run it a couple of times and it has created no problems [18:05:50] (03CR) 10RobH: [C: 031] "I'm on ops puppet swat this week with brandon, so this is my +1 sanity review." [puppet] - 10https://gerrit.wikimedia.org/r/230483 (https://phabricator.wikimedia.org/T97195) (owner: 10Smalyshev) [18:06:09] I am pretty sure it was not the source of the problem. Plus it never really run [18:06:12] akosiaris, because last week we generated tons of tiles, only to find out that they all were broken, and we had to postpone launch [18:06:32] again, it never really run [18:06:35] it might be ok, but i want to have some controls in place before we automate [18:06:40] it can not be the cause of the problems [18:06:48] something has emptied the water label tables [18:06:51] not sure what it is [18:06:57] water polyogns that is [18:07:08] trying to minimize problems here [18:07:10] yeah, I know, I 've commented already on the change [18:10:05] (03CR) 10RobH: "I'm on puppet swat this week, so giving each of the swatting bugs an initial review. This appears to be a typical redirection. The only " [puppet] - 10https://gerrit.wikimedia.org/r/239278 (https://phabricator.wikimedia.org/T31919) (owner: 10Alex Monk) [18:10:38] gwicke: Available for a restbase question? https://gerrit.wikimedia.org/r/#/c/239278/ has the removal of /{domain:et.wikimedia.org}: *wp/default/1.0.0 from the modules/restbase/templates/config.yaml.erb [18:10:55] I just wanted to check if there are followup steps to be done post merge of that? [18:11:03] (and if so, what are they so we dont mess up restbase!) [18:11:32] otherwise its a typical apache redirection patch, so i only am asking you about the restbase change specifically =] [18:12:24] i can confirm "ee" has been added to replace "et" and the new one is in the restbase config [18:12:57] afaik it needs restbase deploy but when we added "ee" it just happened to be a minute before a deploy anyways [18:13:23] it's renaming of a wiki [18:13:30] yea i joined services and asked [18:13:40] i just want to ensure removal doesnt break anything [18:13:44] adding things is always easier =] [18:14:01] marko is checking it out [18:14:22] right [18:15:26] yurik: just re-ran it right now, completed once more without problems and the table is still 35907 lines long [18:16:18] akosiaris, could you check with MaxSem about this - he was the one who rebuilt it after failure. We should have some automated script to check data validity, otherwise we will generate junk tiles that will be difficult to fish out and regen [18:16:39] !log disabling puppet on mw1031 [18:16:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:16:50] (03PS2) 10Ori.livneh: varnish: Don't disable Cache-Control for all mobile traffic [puppet] - 10https://gerrit.wikimedia.org/r/239826 (https://phabricator.wikimedia.org/T113007) (owner: 10Krinkle) [18:17:05] 6operations, 10Analytics, 10Analytics-Cluster, 10Fundraising Tech Backlog, 10Fundraising-Backlog: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1660255 (10Jgreen) Ok, this is switched for all americium collectors. [18:17:18] yurik: a check ? that sounds reasonable. I 'll ping MaxSem on what to actually check for [18:17:27] 6operations: Backup and decom home_pmtpa - https://phabricator.wikimedia.org/T113265#1660256 (10faidon) 3NEW [18:17:59] akosiaris, i'm thinking a) at least N rows for specific table, and b) geo index must be present (we ran into this issue again) [18:18:10] (03PS3) 10Faidon Liambotis: Remove home_pmtpa and svn client from bast1001 [puppet] - 10https://gerrit.wikimedia.org/r/231142 (https://phabricator.wikimedia.org/T113265) [18:18:31] (03CR) 10Ori.livneh: [C: 032] varnish: Don't disable Cache-Control for all mobile traffic [puppet] - 10https://gerrit.wikimedia.org/r/239826 (https://phabricator.wikimedia.org/T113007) (owner: 10Krinkle) [18:19:16] (03PS5) 10Dzahn: admin: let contint-admins run puppet [puppet] - 10https://gerrit.wikimedia.org/r/234539 (https://phabricator.wikimedia.org/T110943) (owner: 10Hashar) [18:20:18] (03CR) 10Dzahn: [C: 032] admin: let contint-admins run puppet [puppet] - 10https://gerrit.wikimedia.org/r/234539 (https://phabricator.wikimedia.org/T110943) (owner: 10Hashar) [18:21:38] (03PS13) 10Andrew Bogott: toolschecker: read/write test for labsdb1004 [puppet] - 10https://gerrit.wikimedia.org/r/239183 (https://phabricator.wikimedia.org/T107449) [18:21:40] (03PS6) 10Andrew Bogott: toolschecker: Add tests for starting/stopping web services [puppet] - 10https://gerrit.wikimedia.org/r/239504 [18:22:07] ori: https://gerrit.wikimedia.org/r/#/c/238536/2 ? thinking of pushing that around starting today, with slow rolling restarts to take effect. any last thoughts, etc? [18:22:36] (03CR) 10RobH: [C: 031] "puppet swat review +1. making old rt tickets public results in volunteers fixing our stuff! \o/" [puppet] - 10https://gerrit.wikimedia.org/r/229426 (https://phabricator.wikimedia.org/T84060) (owner: 10JanZerebecki) [18:22:40] bblack: no last thoughts; it would be awesome and much appreciated. [18:22:50] (03CR) 10jenkins-bot: [V: 04-1] toolschecker: Add tests for starting/stopping web services [puppet] - 10https://gerrit.wikimedia.org/r/239504 (owner: 10Andrew Bogott) [18:22:53] (03CR) 10jenkins-bot: [V: 04-1] toolschecker: read/write test for labsdb1004 [puppet] - 10https://gerrit.wikimedia.org/r/239183 (https://phabricator.wikimedia.org/T107449) (owner: 10Andrew Bogott) [18:23:02] ok! [18:26:13] 10Ops-Access-Requests, 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Let contint-admins force run puppet with /usr/local/sbin/puppet-run - https://phabricator.wikimedia.org/T110943#1660294 (10Dzahn) 5stalled>3Resolved applied on gallium [18:27:04] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Expand shell access for aklapper on Phabricator - https://phabricator.wikimedia.org/T113124#1660296 (10Dzahn) But it was reverted? What's up @Yuvipanda? [18:27:19] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Expand shell access for aklapper on Phabricator - https://phabricator.wikimedia.org/T113124#1660298 (10Dzahn) a:3yuvipanda [18:28:09] (03PS1) 10Cmjohnson: Removing mw1031 dns entries [dns] - 10https://gerrit.wikimedia.org/r/239895 [18:29:42] (03CR) 10RobH: [C: 031] "looks good, puppetswat review (won't merge until tomorrow's window)" [puppet] - 10https://gerrit.wikimedia.org/r/239367 (https://phabricator.wikimedia.org/T112892) (owner: 10JanZerebecki) [18:30:24] (03PS2) 10Cmjohnson: Removing mw1031 dns entries [dns] - 10https://gerrit.wikimedia.org/r/239895 [18:30:45] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Expand shell access for aklapper on Phabricator - https://phabricator.wikimedia.org/T113124#1660326 (10Krenair) Revert has not been merged yet: https://gerrit.wikimedia.org/r/#/c/239892/ Bit of a weird reason to revert someone's access though... [18:31:07] (03CR) 10Alex Monk: "Reverting just because they're too busy to verify? :/" [puppet] - 10https://gerrit.wikimedia.org/r/239892 (owner: 10Yuvipanda) [18:31:37] (03PS2) 10Yuvipanda: Revert "admin: Allow aklapper to reset user auths and delete accounts in Phab" [puppet] - 10https://gerrit.wikimedia.org/r/239892 [18:32:15] (03CR) 10Yuvipanda: [C: 032 V: 032] "Don't feel comfortable leaving untested sudo rules up on a prod machine. I'll re-merge whenever andre has time to test :)" [puppet] - 10https://gerrit.wikimedia.org/r/239892 (owner: 10Yuvipanda) [18:33:11] (03CR) 10Cmjohnson: [C: 032] Removing mw1031 dns entries [dns] - 10https://gerrit.wikimedia.org/r/239895 (owner: 10Cmjohnson) [18:34:48] 10Ops-Access-Requests, 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Let contint-admins force run puppet with /usr/local/sbin/puppet-run - https://phabricator.wikimedia.org/T110943#1660349 (10hashar) Works on labnodepool1001.eqiad.wmnet as well. Thank you! [18:36:12] (03PS14) 10Andrew Bogott: toolschecker: read/write test for labsdb1004 [puppet] - 10https://gerrit.wikimedia.org/r/239183 (https://phabricator.wikimedia.org/T107449) [18:36:14] (03PS7) 10Andrew Bogott: toolschecker: Add tests for starting/stopping web services [puppet] - 10https://gerrit.wikimedia.org/r/239504 [18:36:16] (03PS1) 10Andrew Bogott: Reformatted comments to make pep8 happy. [puppet] - 10https://gerrit.wikimedia.org/r/239899 [18:36:52] 6operations, 10ops-eqiad, 5Patch-For-Review: Decommission mw1031 - https://phabricator.wikimedia.org/T113079#1660351 (10Cmjohnson) removed dns, icinga, salt-keys, puppet certs. Requires wipe [18:37:24] (03CR) 10Mobrovac: "Wrt to the RESTBase config part, there is no harm in removing a domain per se - it simply won't be served any more (as in, doing a request" [puppet] - 10https://gerrit.wikimedia.org/r/239278 (https://phabricator.wikimedia.org/T31919) (owner: 10Alex Monk) [18:37:36] robh: ^^ [18:38:00] awesome, thank you! [18:38:07] (03PS2) 10Andrew Bogott: mm_cfg.py: Reformatted comments to make pep8 happy. [puppet] - 10https://gerrit.wikimedia.org/r/239899 [18:38:09] (03PS15) 10Andrew Bogott: toolschecker: read/write test for labsdb1004 [puppet] - 10https://gerrit.wikimedia.org/r/239183 (https://phabricator.wikimedia.org/T107449) [18:38:11] (03PS8) 10Andrew Bogott: toolschecker: Add tests for starting/stopping web services [puppet] - 10https://gerrit.wikimedia.org/r/239504 [18:38:24] hrmm [18:38:27] good question in your review [18:38:39] i think we may ineed want them to just split off the restbase change and address it out of scope [18:38:44] indeed [18:38:54] JohnFLewis, mutante, can you live with https://gerrit.wikimedia.org/r/#/c/239899/ ? [18:39:33] andrewbogott: yeah. it's clearly c&p from the defaults file :) [18:39:51] robh: if varnish still takes precedence over the apache config, then i think we should as this is something we haven't thought through really for RB [18:39:53] 6operations, 10Wikimedia-Site-Requests, 5Patch-For-Review: Move the wiki of WMEE - https://phabricator.wikimedia.org/T31919#1660369 (10RobH) @Krenair: There are questions about removal of the data from restbase on the patchset. Would you want to simply split this into two changesets, and address only the a... [18:40:01] Right now that file is breaking CI tests for every patch in the repo. I don’t understand why Jenkins suddenly started caring [18:40:28] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Expand shell access for aklapper on Phabricator - https://phabricator.wikimedia.org/T113124#1660370 (10yuvipanda) I've reverted because: # @chasemp pointed out potential problems in the sudo rule's 'strip' line # Andre is busy with gerrit cleanup day wo... [18:40:33] (03PS3) 10Andrew Bogott: mm_cfg.py: Reformatted comments to make pep8 happy. [puppet] - 10https://gerrit.wikimedia.org/r/239899 [18:40:35] (03CR) 10John F. Lewis: [C: 031] mm_cfg.py: Reformatted comments to make pep8 happy. [puppet] - 10https://gerrit.wikimedia.org/r/239899 (owner: 10Andrew Bogott) [18:41:53] (03CR) 10Andrew Bogott: [C: 032] mm_cfg.py: Reformatted comments to make pep8 happy. [puppet] - 10https://gerrit.wikimedia.org/r/239899 (owner: 10Andrew Bogott) [18:44:05] (03CR) 10Alex Monk: "I'm fine with splitting this to let the apache config change proceed without being blocked by issues in RESTBase." [puppet] - 10https://gerrit.wikimedia.org/r/239278 (https://phabricator.wikimedia.org/T31919) (owner: 10Alex Monk) [18:44:23] (03PS16) 10Andrew Bogott: toolschecker: read/write test for labsdb1004 [puppet] - 10https://gerrit.wikimedia.org/r/239183 (https://phabricator.wikimedia.org/T107449) [18:44:25] (03PS9) 10Andrew Bogott: toolschecker: Add tests for starting/stopping web services [puppet] - 10https://gerrit.wikimedia.org/r/239504 [18:46:54] (03CR) 10Alex Monk: "I realised that we've done this sort of change before, and the redirect is indeed not completely working because of RESTBase: https://be-x" [puppet] - 10https://gerrit.wikimedia.org/r/239278 (https://phabricator.wikimedia.org/T31919) (owner: 10Alex Monk) [18:47:00] (03CR) 10Ori.livneh: "The only conceivably private thing that the cache could include is the content of variables in /srv/mediawiki/private/PrivateSettings.php," [puppet] - 10https://gerrit.wikimedia.org/r/239795 (owner: 10Giuseppe Lavagetto) [18:47:45] (03PS2) 10Andrew Bogott: Labs: Include python-openstackclient on the controller host. [puppet] - 10https://gerrit.wikimedia.org/r/239570 [18:48:19] MaxSem, back? [18:49:02] (03PS1) 10Dzahn: lists: don't load mod status [puppet] - 10https://gerrit.wikimedia.org/r/239903 [18:49:08] (03CR) 10Andrew Bogott: [C: 032] toolschecker: read/write test for labsdb1004 [puppet] - 10https://gerrit.wikimedia.org/r/239183 (https://phabricator.wikimedia.org/T107449) (owner: 10Andrew Bogott) [18:49:50] (03PS2) 10Dzahn: lists: don't load mod status [puppet] - 10https://gerrit.wikimedia.org/r/239903 [18:49:53] (03PS3) 10Andrew Bogott: Labs: Include python-openstackclient on the controller host. [puppet] - 10https://gerrit.wikimedia.org/r/239570 [18:50:39] (03PS3) 10Alex Monk: Redirect et.wikimedia.org to ee.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/239278 (https://phabricator.wikimedia.org/T31919) [18:50:45] yurik, lunch! [18:50:55] MaxSem, ping akosiaris before he takes off [18:50:59] but yeah, I saw your convo [18:51:09] (03PS3) 10Ori.livneh: asset-check: Use mwLoadEvent hook instead of polling modules directly [puppet] - 10https://gerrit.wikimedia.org/r/235956 (owner: 10Krinkle) [18:51:24] (03CR) 10Ori.livneh: [C: 032 V: 032] asset-check: Use mwLoadEvent hook instead of polling modules directly [puppet] - 10https://gerrit.wikimedia.org/r/235956 (owner: 10Krinkle) [18:51:37] (03PS3) 10Dzahn: lists: don't load mod status [puppet] - 10https://gerrit.wikimedia.org/r/239903 [18:52:08] 6operations, 10Wikimedia-Site-Requests, 5Patch-For-Review: Move the wiki of WMEE - https://phabricator.wikimedia.org/T31919#1660422 (10Krenair) Dealing with the old entry in RESTBase looks like it's going to have to be a separate ticket, I'm closing this when the apache redirect goes live. [18:52:44] (03CR) 10QChris: [C: 031] "ostriches did the manual clean-up on ytterbium." [puppet] - 10https://gerrit.wikimedia.org/r/238976 (owner: 10QChris) [18:53:29] (03CR) 10QChris: [C: 031] Make gerrit offer newer key exchange algorithms for new sshs [puppet] - 10https://gerrit.wikimedia.org/r/237753 (https://phabricator.wikimedia.org/T112025) (owner: 10QChris) [18:54:27] (03CR) 10Dzahn: [C: 032] lists: don't load mod status [puppet] - 10https://gerrit.wikimedia.org/r/239903 (owner: 10Dzahn) [18:56:26] (03CR) 10Ori.livneh: "This leaves the printHelp() function orphaned; it should be removed as well. But IMO just leave this code unperturbed for now, there is no" [debs/pybal] - 10https://gerrit.wikimedia.org/r/239390 (owner: 10Giuseppe Lavagetto) [18:58:35] (03PS4) 10Andrew Bogott: Labs: Include python-openstackclient on the controller host. [puppet] - 10https://gerrit.wikimedia.org/r/239570 [18:58:43] 6operations, 10Wikimedia-Mailing-lists: Fix description encoding of 4 lists - https://phabricator.wikimedia.org/T113272#1660445 (10JohnLewis) 3NEW a:3Dzahn [19:00:24] (03PS1) 10Aaron Schulz: Enable async secondary writes for mysql-multiwrite cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239904 [19:02:44] 6operations, 10Wikimedia-Mailing-lists: Fix description encoding of 4 lists - https://phabricator.wikimedia.org/T113272#1660488 (10JohnLewis) Looked into the cause of this quickly; it is a feature added in 2.1.16 as part of Debian's UTF-8 support. Where ISO encoded strings are no longer being converted automat... [19:06:54] (03PS2) 10Aaron Schulz: Enable async secondary writes for mysql-multiwrite cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239904 [19:10:12] (03CR) 10Thcipriani: [C: 032] Support atomic promotion and rollback [tools/scap] - 10https://gerrit.wikimedia.org/r/238839 (https://phabricator.wikimedia.org/T109514) (owner: 10Dduvall) [19:10:16] yuvipanda: I'm still getting integration shinken emails :( [19:10:25] (03CR) 10Ori.livneh: [C: 031] Enable async secondary writes for mysql-multiwrite cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239904 (owner: 10Aaron Schulz) [19:14:41] (03PS3) 10Ori.livneh: Enable async secondary writes for mysql-multiwrite cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239904 (owner: 10Aaron Schulz) [19:14:48] (03CR) 10Ori.livneh: [C: 032] Enable async secondary writes for mysql-multiwrite cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239904 (owner: 10Aaron Schulz) [19:14:53] (03Merged) 10jenkins-bot: Enable async secondary writes for mysql-multiwrite cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239904 (owner: 10Aaron Schulz) [19:15:52] !log ori@tin Synchronized wmf-config/CommonSettings.php: Ieccb23f: Enable async secondary writes for mysql-multiwrite cache (on testwiki) (duration: 00m 13s) [19:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:16:24] (03PS7) 10Rush: elastic: sane diamond collector for WMF [puppet] - 10https://gerrit.wikimedia.org/r/239511 [19:16:36] (03PS8) 10Rush: elastic: sane diamond collector for WMF [puppet] - 10https://gerrit.wikimedia.org/r/239511 [19:17:11] (03CR) 10Thcipriani: [C: 04-1] "Causes error on regular scap when cherry-picked to beta: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/71019/console" [tools/scap] - 10https://gerrit.wikimedia.org/r/239028 (owner: 1020after4) [19:17:36] (03Abandoned) 10Chad: Convert tasks.* to use context logger [tools/scap] - 10https://gerrit.wikimedia.org/r/239112 (owner: 10Chad) [19:19:05] (03Merged) 10jenkins-bot: Support atomic promotion and rollback [tools/scap] - 10https://gerrit.wikimedia.org/r/238839 (https://phabricator.wikimedia.org/T109514) (owner: 10Dduvall) [19:19:51] (03PS9) 10Rush: elastic: sane diamond collector for WMF [puppet] - 10https://gerrit.wikimedia.org/r/239511 [19:21:16] (03CR) 10Rush: [C: 032] elastic: sane diamond collector for WMF [puppet] - 10https://gerrit.wikimedia.org/r/239511 (owner: 10Rush) [19:22:17] (03CR) 10Chad: "We should probably stop passing loggers to it then. I'll write up a patch and cherry pick this on top." [tools/scap] - 10https://gerrit.wikimedia.org/r/239028 (owner: 1020after4) [19:23:15] (03PS1) 10Rush: elastic: apply elasticsearch::monitor::diamond to codfw [puppet] - 10https://gerrit.wikimedia.org/r/239911 [19:25:04] (03CR) 10Rush: [C: 032] elastic: apply elasticsearch::monitor::diamond to codfw [puppet] - 10https://gerrit.wikimedia.org/r/239911 (owner: 10Rush) [19:29:17] 6operations, 10ops-eqiad: label server mendelevium / wmf4543 / update racktables - https://phabricator.wikimedia.org/T113281#1660620 (10RobH) 3NEW a:3Cmjohnson [19:29:29] 6operations, 10ops-eqiad: label server mendelevium / wmf4543 / update racktables - https://phabricator.wikimedia.org/T113281#1660630 (10RobH) [19:29:30] 6operations, 10hardware-requests: eqiad: (1) hardware request for ElasticSearch replication to Labs - 4 weeks use - https://phabricator.wikimedia.org/T112163#1660631 (10RobH) [19:29:37] robh: beat me to it...was going to with wmf4541 thought [19:31:35] PROBLEM - salt-minion processes on planet1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:31:51] 6operations: setup / deploy mendelevium for elastic-search testing in labs - https://phabricator.wikimedia.org/T113282#1660635 (10RobH) 3NEW [19:32:03] 6operations: setup / deploy mendelevium for elastic-search testing in labs - https://phabricator.wikimedia.org/T113282#1660644 (10RobH) [19:32:03] 6operations, 10hardware-requests: eqiad: (1) hardware request for ElasticSearch replication to Labs - 4 weeks use - https://phabricator.wikimedia.org/T112163#1627748 (10RobH) [19:32:04] PROBLEM - DPKG on planet1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:32:04] PROBLEM - Disk space on planet1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:32:15] PROBLEM - RAID on planet1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:32:24] 6operations: setup / deploy mendelevium for elastic-search testing in labs - https://phabricator.wikimedia.org/T113282#1660635 (10RobH) [19:32:26] 6operations, 10ops-eqiad: label server mendelevium / wmf4543 / update racktables - https://phabricator.wikimedia.org/T113281#1660649 (10RobH) [19:32:27] 6operations, 10hardware-requests: eqiad: (1) hardware request for ElasticSearch replication to Labs - 4 weeks use - https://phabricator.wikimedia.org/T112163#1660646 (10RobH) 5Open>3Resolved mendelevium / wmf4543 is allocaed for this task. The related tasks for the setup and update the labels have been s... [19:32:45] PROBLEM - configured eth on planet1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:32:54] PROBLEM - HTTP on planet1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:33:03] PROBLEM - dhclient process on planet1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:33:14] PROBLEM - Check size of conntrack table on planet1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:33:15] PROBLEM - puppet last run on planet1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:34:20] (03CR) 10Thcipriani: [C: 032] Support batch size configuration per stage [tools/scap] - 10https://gerrit.wikimedia.org/r/239016 (https://phabricator.wikimedia.org/T112841) (owner: 10Dduvall) [19:34:24] RECOVERY - configured eth on planet1001 is OK: OK - interfaces up [19:34:25] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 6 below the confidence bounds [19:34:30] that's when it is running updates [19:34:33] RECOVERY - HTTP on planet1001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 459 bytes in 0.005 second response time [19:34:35] (03Merged) 10jenkins-bot: Support batch size configuration per stage [tools/scap] - 10https://gerrit.wikimedia.org/r/239016 (https://phabricator.wikimedia.org/T112841) (owner: 10Dduvall) [19:34:35] 6operations: setup / deploy mendelevium for elastic-search testing in labs - https://phabricator.wikimedia.org/T113282#1660653 (10yuvipanda) This needs to be trusty rather than jessie as well. [19:34:36] (planet1001) [19:34:44] RECOVERY - dhclient process on planet1001 is OK: PROCS OK: 0 processes with command name dhclient [19:34:54] RECOVERY - Check size of conntrack table on planet1001 is OK: OK: nf_conntrack is 0 % full [19:34:55] RECOVERY - puppet last run on planet1001 is OK: OK: Puppet is currently enabled, last run 22 minutes ago with 0 failures [19:35:07] 6operations: setup / deploy mendelevium for elastic-search testing in labs - https://phabricator.wikimedia.org/T113282#1660659 (10yuvipanda) Let's figure out how to RAID the disks. @Ebernhardson? [19:35:14] RECOVERY - salt-minion processes on planet1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:35:34] RECOVERY - DPKG on planet1001 is OK: All packages OK [19:35:34] RECOVERY - Disk space on planet1001 is OK: DISK OK [19:35:45] RECOVERY - RAID on planet1001 is OK: OK: no RAID installed [19:36:00] 6operations, 10ops-eqiad: Wipe and disconnect mw1031 - https://phabricator.wikimedia.org/T113283#1660664 (10Cmjohnson) 3NEW a:3Cmjohnson [19:40:11] ebernhardson: should we just do raid10? [19:40:59] !log temporarily stopping codfw restbase cassandra nodes to test quorum auth [19:41:04] robh: so if we do RAID10 on 4*3TB we'll get 6TB usable space, right? [19:41:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:41:54] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 9 below the confidence bounds [19:43:45] edsanders: raid10 gives us 6TB. should be enough, I guess? [19:44:09] err [19:44:13] sorry, I meant ebernhardson [19:49:52] yuvipanda: yea i think 10 is the only option [19:50:00] yup [19:50:27] 6operations: setup / deploy mendelevium for elastic-search testing in labs - https://phabricator.wikimedia.org/T113282#1660692 (10yuvipanda) Yup, RAID10 gives us 6T usable space which is good enough, so let's do that. [19:50:29] commented [19:50:41] (03PS1) 10Rush: elastic: add diamond monitoring to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/239919 [19:50:45] ebernhardson: is thsi in labs? [19:51:06] chasemp: yes, trying to setup a "public access" elastic instance with all the prod data [19:51:17] (03PS1) 10Yuvipanda: k8s: Make kubelet use the puppet SSL cert [puppet] - 10https://gerrit.wikimedia.org/r/239920 [19:51:32] (03PS2) 10Yuvipanda: k8s: Make kubelet use the puppet SSL cert [puppet] - 10https://gerrit.wikimedia.org/r/239920 [19:51:51] ebernhardson: why not go for best space since it's a secondary copy of secondary data, or I mean if we can sustain some recovery time it's way cheaper to do raid 0 space wise [19:52:16] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: use non-default credentials when authenticating to Cassandra - https://phabricator.wikimedia.org/T92590#1660693 (10Eevans) The dedicated RESTBase user is now deployed, so a quick test just to verify that the previous issues we saw with QUORU... [19:52:30] (03CR) 10Rush: [C: 032] elastic: add diamond monitoring to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/239919 (owner: 10Rush) [19:52:31] chasemp: disks are 3T and the full prod loadout is 2.5T, doing a full mirror felt like cutting it too close [19:52:32] (03PS1) 10Dzahn: lists: fix duplicate definition re: status module [puppet] - 10https://gerrit.wikimedia.org/r/239921 [19:52:55] chasemp: otherwise, i would prefer 4xmirror for best read performance [19:53:06] mobrovac: out of curiosity, does bluetooth work on your laptop? [19:53:12] (03PS2) 10Dzahn: lists: fix duplicate definition re: status module [puppet] - 10https://gerrit.wikimedia.org/r/239921 [19:53:39] urandom: yup, had a bt mouse working just fine with it [19:53:45] (03PS3) 10Yuvipanda: k8s: Make kubelet use the puppet SSL cert [puppet] - 10https://gerrit.wikimedia.org/r/239920 [19:53:53] (03CR) 10Yuvipanda: [C: 032 V: 032] k8s: Make kubelet use the puppet SSL cert [puppet] - 10https://gerrit.wikimedia.org/r/239920 (owner: 10Yuvipanda) [19:53:59] i was having some initial troubles with the driver though [19:54:13] mobrovac: k, mine did too in the beginning, and at some point stopped [19:54:26] i found a work-around, so if you start having problems, ping me :) [19:54:27] (03PS3) 10Dzahn: lists: fix duplicate definition re: status module [puppet] - 10https://gerrit.wikimedia.org/r/239921 [19:54:30] heheh [19:54:35] cool thnx urandom [19:54:40] i'm always running pretty bleeding edge stuff [19:54:55] Debian unstable, updated 3 or 4 times a week [19:55:45] (03CR) 10Dzahn: [C: 032] lists: fix duplicate definition re: status module [puppet] - 10https://gerrit.wikimedia.org/r/239921 (owner: 10Dzahn) [19:57:56] 6operations, 10MediaWiki-extensions-TimedMediaHandler: Frequent job timeouts on HHVM video scalers - https://phabricator.wikimedia.org/T113284#1660709 (10brion) 3NEW [19:58:13] 6operations, 10MediaWiki-extensions-TimedMediaHandler: Frequent job timeouts on HHVM video scalers - https://phabricator.wikimedia.org/T113284#1660725 (10brion) [19:58:15] 6operations, 10MediaWiki-extensions-TimedMediaHandler, 7HHVM, 5Patch-For-Review: Convert tmh100[12] to HHVM and trusty - https://phabricator.wikimedia.org/T104747#1426066 (10brion) [19:58:36] 6operations, 10hardware-requests: eqiad: (1) hardware request for ElasticSearch replication to Labs - 4 weeks use - https://phabricator.wikimedia.org/T112163#1660727 (10Krenair) >>! In T112163#1660646, @RobH wrote: > mendelevium / wmf4543 is allocaed for this task. The related tasks for the setup and update... [20:00:30] (03PS1) 10Hashar: Adjust .gitreview for the scap3 branch [tools/scap] (scap3) - 10https://gerrit.wikimedia.org/r/239977 [20:00:50] (03CR) 10jenkins-bot: [V: 04-1] Adjust .gitreview for the scap3 branch [tools/scap] (scap3) - 10https://gerrit.wikimedia.org/r/239977 (owner: 10Hashar) [20:01:17] 6operations, 10OTRS, 10vm-requests: EQIAD: 1 VM request for OTRS - https://phabricator.wikimedia.org/T111532#1660748 (10Krenair) See T112163 [20:01:28] jouncebot, are you sleeping? you didn't notify me about parsoid deploy as usual? [20:02:15] jouncebot: next [20:02:15] In 2 hour(s) and 57 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150921T2300) [20:02:18] (03PS1) 10RobH: setting host mendelevium dns entries [dns] - 10https://gerrit.wikimedia.org/r/239979 [20:02:26] robh: wait, that is already OTRS [20:03:56] omw [20:04:01] 6operations: setup / deploy nobelium for elastic-search testing in labs - https://phabricator.wikimedia.org/T113282#1660754 (10RobH) [20:04:15] 6operations, 10ops-eqiad: label server nobelium / wmf4543 / update racktables - https://phabricator.wikimedia.org/T113281#1660756 (10RobH) [20:04:47] 6operations, 10hardware-requests: eqiad: (1) hardware request for ElasticSearch replication to Labs - 4 weeks use - https://phabricator.wikimedia.org/T112163#1660759 (10RobH) it was, i misallocated since it was in gaeneti. now its allocated as system name nobelium for wmf4543 (associated tasks already updated) [20:04:54] will be deploying a new version of parsoid. [20:06:57] (03PS2) 10RobH: setting host nobelium dns entries [dns] - 10https://gerrit.wikimedia.org/r/239979 [20:09:16] (03CR) 10RobH: [C: 032] setting host nobelium dns entries [dns] - 10https://gerrit.wikimedia.org/r/239979 (owner: 10RobH) [20:10:55] 6operations: setup / deploy nobelium for elastic-search testing in labs - https://phabricator.wikimedia.org/T113282#1660791 (10RobH) [20:11:23] yuvipanda: so raid10 4 * 3tb = 6 raw but not 6 usable [20:11:29] it'll be more like 5.4 usable [20:11:30] but yes [20:11:44] yeah that's good enough [20:11:54] cool [20:12:04] oh, trusty... ewww [20:12:07] but glad you noted it [20:12:29] robh: yup. Elasticsearch systems are still trusty [20:12:45] and I don't want to experiment on that regard here :) [20:13:00] ebernhardson|lch: chasemp can you confirm that ElasticSearch is still trusty? [20:13:11] it is [20:13:14] silly you wanting to actually replicate our setup before experimentation! [20:13:36] cool! [20:14:52] (03CR) 10Phuedx: "Ping Legoktm. The values of the config change LGTM, hence my +1, but I'm reluctant to +2 given your initial CR around how the config is lo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239158 (https://phabricator.wikimedia.org/T112204) (owner: 10Jdlrobson) [20:16:37] I'm going to head in to the office now [20:17:07] (03PS1) 10RobH: setting up install params for nobelium [puppet] - 10https://gerrit.wikimedia.org/r/239985 [20:17:25] (03CR) 10Chad: [C: 032 V: 032] Adjust .gitreview for the scap3 branch [tools/scap] (scap3) - 10https://gerrit.wikimedia.org/r/239977 (owner: 10Hashar) [20:20:29] (03CR) 10Luke081515: [C: 031] Split langlist for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239748 (https://phabricator.wikimedia.org/T112006) (owner: 10Alex Monk) [20:23:31] 6operations, 10Wikimedia-Mailing-lists: wikimediabe-l: decide status of list - https://phabricator.wikimedia.org/T110974#1660825 (10JohnLewis) 5Open>3Resolved [20:23:32] 6operations, 10Wikimedia-Mailing-lists: Evaluate lists with large moderation queues - https://phabricator.wikimedia.org/T110438#1660826 (10JohnLewis) [20:26:14] PROBLEM - puppet last run on elastic1008 is CRITICAL: CRITICAL: puppet fail [20:27:20] ^looking [20:29:40] !log deployed parsoid version 9984d221 [20:29:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:30:05] RECOVERY - puppet last run on elastic1008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:32:27] (03CR) 10RobH: [C: 032] setting up install params for nobelium [puppet] - 10https://gerrit.wikimedia.org/r/239985 (owner: 10RobH) [20:39:38] !log banning elastic1005 for T112559 [20:39:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:40:56] 6operations: setup / deploy nobelium for elastic-search testing in labs - https://phabricator.wikimedia.org/T113282#1660870 (10RobH) [20:41:12] 6operations: setup / deploy nobelium for elastic-search testing in labs - https://phabricator.wikimedia.org/T113282#1660635 (10RobH) [20:41:15] !log MobileApps deployed sha1 013044e [20:41:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:46:33] (03CR) 10Chad: "Puppet will also need an update, see docroot/noc/conf/index.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239748 (https://phabricator.wikimedia.org/T112006) (owner: 10Alex Monk) [20:47:23] 7Blocked-on-Operations, 7Puppet, 6operations, 10Beta-Cluster, and 4 others: Setup a dedicated mediawiki host in Beta Cluster that we can use for security scanning - https://phabricator.wikimedia.org/T72181#1660914 (10greg) [20:51:32] 6operations, 10Wikimedia-Mailing-lists, 6Wiktionary: wiktionary-l: assign new moderators - https://phabricator.wikimedia.org/T110969#1660928 (10Dzahn) pinged #wiktionary [20:52:05] (03CR) 10Alex Monk: "Puppet?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239748 (https://phabricator.wikimedia.org/T112006) (owner: 10Alex Monk) [20:53:05] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [20:55:04] PROBLEM - puppet last run on mw2107 is CRITICAL: CRITICAL: Puppet has 1 failures [21:00:55] (03CR) 10Chad: "Derp, it's also here! Ignore me. I haven't had my third coffee yet..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239748 (https://phabricator.wikimedia.org/T112006) (owner: 10Alex Monk) [21:01:11] Krenair: ^ :) [21:01:25] :) [21:08:13] (03PS3) 10Alex Monk: Split langlist for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239748 (https://phabricator.wikimedia.org/T112006) [21:12:33] 6operations: setup / deploy nobelium for elastic-search testing in labs - https://phabricator.wikimedia.org/T113282#1661076 (10RobH) [21:18:36] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: Education coop mailing list rename - https://phabricator.wikimedia.org/T107445#1661091 (10JohnLewis) a:5Selsharbaty-WMF>3RobH [21:18:47] !disabling puppet on krypton for 1h to check apache config [21:21:14] RECOVERY - puppet last run on mw2107 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [21:22:53] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: Education coop mailing list rename - https://phabricator.wikimedia.org/T107445#1661111 (10RobH) a:5RobH>3Selsharbaty-WMF I'm not comfortable making a private list archive public, I think there are legal/ethical/more ramifications there than we've... [21:22:59] JohnFLewis: that task isnt enough info for me to do that [21:23:09] ive detailed and kicked back to samir [21:23:24] ie: there isnt any way in hell im making a private archive public. [21:23:25] robh: you asked for a response and got it so :) [21:23:36] Yep, it wasnt a good enough one for me, heh [21:23:42] 6operations, 10Wikimedia-Mailing-lists: Evaluate lists with large moderation queues - https://phabricator.wikimedia.org/T110438#1661116 (10JohnLewis) Just for an update on progress: ``` johnflewis@fermium:/var/lib/mailman/data$ head -n 30 /home/johnflewis/held_msg_stats_mailman.txt 11617 education- 10914... [21:23:45] mutante: ^^ [21:23:59] I think its a horrible idea to do so, and they should simply do what I first suggested with the renames [21:24:13] but, they can overrride me, but im going to make them spell it out and directly override my concern =] [21:24:16] on fermium looks great if we consider what we carried over from sodium :) [21:24:28] robh: okay :) [21:24:59] 6operations: setup / deploy nobelium for elastic-search testing in labs - https://phabricator.wikimedia.org/T113282#1661141 (10RobH) [21:25:26] 6operations: setup / deploy nobelium for elastic-search testing in labs - https://phabricator.wikimedia.org/T113282#1661150 (10RobH) a:3yuvipanda system installed and awaiting puppet/salt key acceptance. assigning to yuvi for implementation. [21:25:34] yuvipanda: ^ all yours =] [21:25:39] robh: wonderful [21:25:42] thank you very much [21:26:04] JohnFLewis: i wasnt mad at you assigning to me or anyting, i just wanted to be clear and keep ya in loop why i was kcikign back to him [21:26:09] (03PS3) 10BBlack: Increase Varnish's `shm_reclen` from 1024 to 2048 [puppet] - 10https://gerrit.wikimedia.org/r/238536 (https://phabricator.wikimedia.org/T112002) (owner: 10Ori.livneh) [21:26:13] and also echo in here in case you or mutante disagree with me =] [21:27:06] * robh may be overly paranoid about archive data privacy but meh [21:27:23] robh: you're the one handling it so whatever you need to feel confident about doing it, me and mutante aren't so :) [21:27:37] yea but you two are the other two folks who will talk about list stuff with me! [21:27:58] plus if i check with you i can wave the wand of 'i checked with community' see ;] [21:28:39] for a WMF list? sounds sane :) [21:29:21] * JohnFLewis renames list to education-maybe-public-maybe-private-needs-approval [21:30:02] 6operations, 10Wikimedia-Mailing-lists: Fix description encoding of 4 lists - https://phabricator.wikimedia.org/T113272#1661165 (10Dzahn) mediawiki-sv: was already a disabled list that said it's no longer active. changed description to "Listan är inte längre aktiv." per Google translate Ruwikiconference-l: fi... [21:30:31] 6operations, 10Wikimedia-Mailing-lists: Fix description encoding of 4 lists - https://phabricator.wikimedia.org/T113272#1661166 (10Dzahn) a:5Dzahn>3JohnLewis [21:32:13] 6operations, 10Wikimedia-Mailing-lists: Fix description encoding of 4 lists - https://phabricator.wikimedia.org/T113272#1661176 (10JohnLewis) 5Open>3Resolved Looks good to me. Don't see this becoming a problem if these lists received emails. [21:36:34] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for JUnikowski_WMF - https://phabricator.wikimedia.org/T113298#1661203 (10JUnikowski_WMF) 3NEW [21:37:05] (03PS2) 10Dzahn: Revert "mailman: ferm, allow rsync from sodium for migration" [puppet] - 10https://gerrit.wikimedia.org/r/235976 [21:39:10] (03CR) 10Dzahn: [C: 032] "planned revert - this firewall rule was just needed for the migration" [puppet] - 10https://gerrit.wikimedia.org/r/235976 (owner: 10Dzahn) [21:41:54] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for JUnikowski_WMF - https://phabricator.wikimedia.org/T113298#1661249 (10Krenair) No labs account: ```krenair@tools-bastion-01:~$ ldaplist -l passwd junikowski krenair@tools-bastion-01:~$``` You'll want to do that before someone else reserves y... [21:42:26] (03CR) 10Legoktm: "Well, it's not going to cause any warnings AFAIS." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239158 (https://phabricator.wikimedia.org/T112204) (owner: 10Jdlrobson) [21:43:34] (03PS4) 10Dzahn: Remove sodium from puppet (spare/decom) [puppet] - 10https://gerrit.wikimedia.org/r/239411 (https://phabricator.wikimedia.org/T110142) (owner: 10John F. Lewis) [21:44:04] (03CR) 10Dzahn: [C: 032] Remove sodium from puppet (spare/decom) [puppet] - 10https://gerrit.wikimedia.org/r/239411 (https://phabricator.wikimedia.org/T110142) (owner: 10John F. Lewis) [21:45:13] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for Chedasaurus - https://phabricator.wikimedia.org/T113302#1661270 (10egalvezwmf) 3NEW [21:45:44] (03PS2) 10Dzahn: lists: TTL up to 1H [dns] - 10https://gerrit.wikimedia.org/r/239400 (owner: 10John F. Lewis) [21:45:56] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for Chedasaurus - https://phabricator.wikimedia.org/T113302#1661278 (10egalvezwmf) a:3Reinrosemary [21:46:59] (03CR) 10Dzahn: [C: 032] lists: TTL up to 1H [dns] - 10https://gerrit.wikimedia.org/r/239400 (owner: 10John F. Lewis) [21:47:03] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for JUnikowski_WMF - https://phabricator.wikimedia.org/T113298#1661285 (10JUnikowski_WMF) Hi Alex, Thanks a lot. Did that just now (hopefully I got it right). Jonathan [21:48:04] (03CR) 10Dzahn: "https://phabricator.wikimedia.org/T110141" [dns] - 10https://gerrit.wikimedia.org/r/239400 (owner: 10John F. Lewis) [21:48:10] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for Chedasaurus - https://phabricator.wikimedia.org/T113302#1661287 (10Krenair) > ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQCsg/AwUjqqnVHfRO9aR3lqo74YdpJJTo825sOMj+Qw3xIAas6nbe/YQMXRz2/DKK0B0/fiAFRnMK1DGKw0FPOpdMvQdYk6r8yrpZ2AN3QzMAQ+XuzjLlgst4td/Jr... [21:48:36] 6operations, 10Wikimedia-Mailing-lists: TTL back up to normal 1H - https://phabricator.wikimedia.org/T110141#1661288 (10Dzahn) 5Open>3Resolved https://gerrit.wikimedia.org/r/#/c/239400/ [21:48:45] we need a KrenairBot [21:49:02] That's a labs ssh key! [21:49:16] 6operations: shutdown sodium after mailman has migrated to jessie VM - https://phabricator.wikimedia.org/T82698#1661299 (10Dzahn) removing from puppet today but not shutdown yet [21:49:59] legoktm, I wrote a script to audit for these [21:50:04] mutante: are we scheduling a party for the exact time you shut sodium down? ;) [21:50:05] caught a couple of people doing it as well [21:50:12] I just don't know how to integrate it with jenkins [21:50:34] file a bug about doing that and cc me on it? [21:50:38] yuvipanda: did you say something about lucid in labs? [21:51:31] It was discussed in https://phabricator.wikimedia.org/T108078 legoktm [21:51:48] (03PS4) 10BBlack: Increase Varnish's `shm_reclen` from 1024 to 2048 [puppet] - 10https://gerrit.wikimedia.org/r/238536 (https://phabricator.wikimedia.org/T112002) (owner: 10Ori.livneh) [21:51:49] JohnFLewis: the "celebrate" ticket is already closed :p [21:51:55] (03CR) 10BBlack: [C: 032 V: 032] Increase Varnish's `shm_reclen` from 1024 to 2048 [puppet] - 10https://gerrit.wikimedia.org/r/238536 (https://phabricator.wikimedia.org/T112002) (owner: 10Ori.livneh) [21:52:22] mutante: I said a party for the shutdown, not the removal of it in production :) [21:53:14] mutante: and https://phabricator.wikimedia.org/T113199 [21:53:23] is what I presume yuvipanda was on about which you meant [21:53:39] JohnFLewis: yes, it is. thanks! good [21:53:58] yup [21:54:01] thanks JohnFLewis [21:55:28] and searching, I see 4 instances with lucid as their OS [21:56:54] yuvipanda: I've listed the instances I see on the ticket [21:56:59] thanks! [21:57:38] andrewbogott: I guess #1 instance on https://phabricator.wikimedia.org/T113199 is yours [21:58:09] Krenair: ok, working on something... [21:58:39] so, i was about to merge that change [21:58:42] that removes all lucid support [21:58:52] what made me stop for a second was the sshd part [21:59:06] you still might want to sshd to these instances [21:59:16] ssh [21:59:26] Krenair: is there a programatic way to detect whether you're able to connect to labs ldap? or detect whether you're in labs? [22:00:01] beyond checking network connectivity to an ldap server? [22:00:41] mutante: lucid has been broken in labs for ages [22:01:13] hrm [22:02:12] (03PS1) 10Faidon Liambotis: Switch Russia's esams backup from ulsfo to eqiad [dns] - 10https://gerrit.wikimedia.org/r/239996 [22:02:16] bblack: ^ [22:03:38] (03PS9) 10Dzahn: Remove support for Ubuntu Lucid/10.04 [puppet] - 10https://gerrit.wikimedia.org/r/179888 (owner: 10Faidon Liambotis) [22:03:59] mutante: yes, lucid has been broken in labs for *ages* [22:04:06] labs ssh has never worked there for a while now [22:04:10] feel free to kill [22:04:52] (03CR) 10Dzahn: [C: 032] Remove support for Ubuntu Lucid/10.04 [puppet] - 10https://gerrit.wikimedia.org/r/179888 (owner: 10Faidon Liambotis) [22:04:57] yay [22:05:20] ! ok :) [22:05:31] legoktm, I guess you could check whether the INSTANCEPROJECT or INSTANCENAME environment variables are set. Those sound like a pretty labs-specific things [22:05:41] mutante: did you wipe sodium yet? [22:05:50] mutante: if not, let's take a final backup of it, just in case [22:06:05] (03CR) 10BBlack: [C: 031] Switch Russia's esams backup from ulsfo to eqiad [dns] - 10https://gerrit.wikimedia.org/r/239996 (owner: 10Faidon Liambotis) [22:06:09] (03PS1) 10Ori.livneh: Update mod_status configuration [puppet] - 10https://gerrit.wikimedia.org/r/239998 [22:06:24] mutante: ^ [22:06:25] paravoid: not shutdown or wiped yet, yes, first just puppet [22:06:45] (03CR) 10Faidon Liambotis: [C: 032] Switch Russia's esams backup from ulsfo to eqiad [dns] - 10https://gerrit.wikimedia.org/r/239996 (owner: 10Faidon Liambotis) [22:09:20] 6operations, 10Analytics-EventLogging, 10MediaWiki-extensions-NavigationTiming, 6Performance-Team: Increase maxUrlSize from 1000 to 1500 - https://phabricator.wikimedia.org/T112002#1661356 (10BBlack) Note the change itself is merged and deployed to config files on the caches, but it will be a while before... [22:11:56] (03PS1) 10Faidon Liambotis: Fix esams IPv6 space's geolocation [dns] - 10https://gerrit.wikimedia.org/r/239999 [22:12:10] (03CR) 10Faidon Liambotis: [C: 032] Fix esams IPv6 space's geolocation [dns] - 10https://gerrit.wikimedia.org/r/239999 (owner: 10Faidon Liambotis) [22:13:40] (03PS4) 10Dzahn: Replace Package['git-core'] with Package['git'] [puppet] - 10https://gerrit.wikimedia.org/r/233853 (owner: 10Faidon Liambotis) [22:14:59] (03CR) 10Dzahn: [C: 032] Replace Package['git-core'] with Package['git'] [puppet] - 10https://gerrit.wikimedia.org/r/233853 (owner: 10Faidon Liambotis) [22:16:19] Krenair: it looks like tools-login doesn't have python-ldap installed :/ [22:16:41] or maybe my venv doesn't? [22:18:16] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for Chedasaurus - https://phabricator.wikimedia.org/T113302#1661367 (10kevinator) [22:18:17] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for JUnikowski_WMF - https://phabricator.wikimedia.org/T113298#1661368 (10kevinator) [22:18:20] (03PS2) 10Dzahn: Ensure gerrit's plugins are kept in sync with plugin repo [puppet] - 10https://gerrit.wikimedia.org/r/238976 (owner: 10QChris) [22:18:40] (03PS1) 10Legoktm: Add tests to make sure production ssh keys are not in labs [puppet] - 10https://gerrit.wikimedia.org/r/240000 [22:19:39] (03CR) 10Dzahn: [C: 032] Ensure gerrit's plugins are kept in sync with plugin repo [puppet] - 10https://gerrit.wikimedia.org/r/238976 (owner: 10QChris) [22:19:54] (03CR) 10jenkins-bot: [V: 04-1] Add tests to make sure production ssh keys are not in labs [puppet] - 10https://gerrit.wikimedia.org/r/240000 (owner: 10Legoktm) [22:20:07] (03PS2) 10Legoktm: [WIP] Add tests to make sure production ssh keys are not in labs [puppet] - 10https://gerrit.wikimedia.org/r/240000 [22:21:04] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Add tests to make sure production ssh keys are not in labs [puppet] - 10https://gerrit.wikimedia.org/r/240000 (owner: 10Legoktm) [22:21:43] (03PS4) 10Dzahn: Make gerrit offer newer key exchange algorithms for new sshs [puppet] - 10https://gerrit.wikimedia.org/r/237753 (https://phabricator.wikimedia.org/T112025) (owner: 10QChris) [22:23:21] (03CR) 10Dzahn: [C: 032] Make gerrit offer newer key exchange algorithms for new sshs [puppet] - 10https://gerrit.wikimedia.org/r/237753 (https://phabricator.wikimedia.org/T112025) (owner: 10QChris) [22:23:49] watches gerrit itself because of the change above [22:24:35] It should restart on its own, shouldn't it? [22:25:06] 6operations, 10Gerrit, 5Patch-For-Review: Wikimedia Gerrit doesn't work if OpenSSH version is higher than 7.0 - https://phabricator.wikimedia.org/T112025#1661397 (10Paladox) When will we be able to start using gerrit normaly without needing to do a workaround. So that I can test if it works. [22:25:19] qchris: i saw it create the file but that's it [22:25:55] Harr. No notify. You're right. [22:26:08] i'll restart gerrit [22:26:09] So I guess it needs a manual restart. [22:26:11] Thanks. [22:26:52] !log restarting gerrit for ssh config change [22:26:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:27:27] 6operations, 10Gerrit, 5Patch-For-Review: Wikimedia Gerrit doesn't work if OpenSSH version is higher than 7.0 - https://phabricator.wikimedia.org/T112025#1661402 (10Paladox) Seems gerrit has gone down. [22:27:41] 6operations, 10RESTBase, 6Services: RESTBase and domain renames - https://phabricator.wikimedia.org/T113307#1661403 (10mobrovac) 3NEW [22:27:51] Wohoo. It worked. [22:27:59] OpenSSH 7 can connect without the workaround. [22:28:06] :)) [22:28:06] Thanks mutante [22:28:10] thank you! [22:28:22] tells Paladox right away :p [22:28:30] Hahaha :-) [22:28:46] 6operations, 10Gerrit, 5Patch-For-Review: Wikimedia Gerrit doesn't work if OpenSSH version is higher than 7.0 - https://phabricator.wikimedia.org/T112025#1661412 (10Dzahn) @Paladox it was a restart to fix this issue. try it now :) [22:29:04] 6operations, 10Gerrit, 5Patch-For-Review: Wikimedia Gerrit doesn't work if OpenSSH version is higher than 7.0 - https://phabricator.wikimedia.org/T112025#1661413 (10Paladox) Ok I will. [22:29:27] About the missing notify ... since it's a jar that's not supposed to change. Should I add it, or is it a wasted effort anyway? [22:29:52] 6operations, 10Wikimedia-Site-Requests, 5Patch-For-Review: Move the wiki of WMEE - https://phabricator.wikimedia.org/T31919#1661415 (10mobrovac) >>! In T31919#1660422, @Krenair wrote: > Dealing with the old entry in RESTBase looks like it's going to have to be a separate ticket, I'm closing this when the apa... [22:30:09] 6operations, 10Gerrit, 5Patch-For-Review: Wikimedia Gerrit doesn't work if OpenSSH version is higher than 7.0 - https://phabricator.wikimedia.org/T112025#1661417 (10Paladox) it says this now git clone ssh://paladox@gerrit.wikimedia.org:29418/mediawiki/extensions/MsUpload fatal: could not create work tree d... [22:30:16] hmm, i think not needed then and manually restarting is ok [22:31:04] 6operations, 10Gerrit, 5Patch-For-Review: Wikimedia Gerrit doesn't work if OpenSSH version is higher than 7.0 - https://phabricator.wikimedia.org/T112025#1661420 (10Paladox) Sorry it works now. [22:31:08] ok. [22:31:11] krrrit-wm: you're down are you not [22:31:16] probably the ssh connection is stuck [22:31:59] i'll fix that labs issue now [22:32:19] mutante: https://gerrit.wikimedia.org/r/#/c/240002/ [22:32:23] ah [22:32:34] PROBLEM - puppet last run on stat1003 is CRITICAL: CRITICAL: Puppet has 2 failures [22:32:41] yes, exactly what i was uploading too [22:32:43] merging yours [22:33:39] mutante: thanks [22:35:34] 6operations, 10Gerrit, 5Patch-For-Review: Wikimedia Gerrit doesn't work if OpenSSH version is higher than 7.0 - https://phabricator.wikimedia.org/T112025#1661433 (10Dzahn) >>! In T112025#1661420, @Paladox wrote: > Sorry it works now. yay :) thanks @qchris [22:35:41] 6operations, 10Gerrit, 5Patch-For-Review: Wikimedia Gerrit doesn't work if OpenSSH version is higher than 7.0 - https://phabricator.wikimedia.org/T112025#1661434 (10Dzahn) 5Open>3Resolved a:3Dzahn [22:36:03] i hope it didnt break the bot at the same time :) [22:36:15] RECOVERY - puppet last run on stat1003 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [22:37:07] qchris: thanks muchly! [22:37:31] Haha. yw. [22:39:22] 6operations, 10Gerrit, 5Patch-For-Review: Wikimedia Gerrit doesn't work if OpenSSH version is higher than 7.0 - https://phabricator.wikimedia.org/T112025#1661444 (10greg) a:5Dzahn>3QChris [22:40:16] mutante: yeah, everytime we restart gerrit the bot gets stuck [22:40:19] I just restarted bot [22:40:22] brb [22:40:34] mutante: It shouldn't have broken the bot. [22:40:36] yuvipanda: i remember it now, yes [22:40:43] Ah. /me's too late. [22:41:35] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: use non-default credentials when authenticating to Cassandra - https://phabricator.wikimedia.org/T92590#1661446 (10mobrovac) The relevant prod patch has been merged as well. Should we close this then? [22:41:48] 6operations, 10RESTBase, 10RESTBase-Cassandra: use non-default credentials when authenticating to Cassandra - https://phabricator.wikimedia.org/T92590#1661447 (10mobrovac) [22:42:33] yuvipanda: re: labs puppet fail, i confirmed it's ok again on tools-exec-1206 [22:42:37] getting train now [22:43:40] stuff down? [22:43:55] no? [22:44:11] hoo: which stuff [22:44:20] just seen a 503 on edit page, OK after reload, mutante [22:44:30] Ok, seems good again [22:44:32] Request: GET http://en.wikisource.org/wiki/Special:Contributions/ShakespeareFan00, from 10.20.0.165 via cp1055 cp1055 ([10.64.32.107]:3128), Varnish XID 2604574317 [22:44:34] Forwarded for: 80.176.129.180, 10.20.0.165, 10.20.0.165, 10.20.0.165 [22:44:35] Error: 503, Service Unavailable at Mon, 21 Sep 2015 22:43:33 GMT [22:44:44] Was just surfing the wiki logged in and got 503s [22:44:45] Assuming temporary [22:45:05] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 7 below the confidence bounds [22:45:16] Yeah I just got an error page too [22:45:20] here we go ^^^ [22:45:21] seems OK now [22:45:58] Must be because my kids forced me to listen to Katy Perry. Knowledge just hates that [22:46:21] 500 rate looks ugly [22:46:36] https://ganglia.wikimedia.org/latest/?c=API%20application%20servers%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 mh [22:46:58] negative memory? [22:47:25] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 21.43% of data above the critical threshold [500.0] [22:47:29] mh... I see that all over ganglia... network? [22:48:18] see what all over ganglia? [22:48:31] odd stats, like ganglia lost data for a lot of nodes there for a while [22:48:40] Yeah, it lost data for all hosts for a bit [22:48:48] Something unexpectedly restarted [22:48:49] ? [22:48:54] there was a sizable 500 spike, but now it's been completely dwarfed by a massive 503 spike in reqerr [22:48:56] It happens [22:49:15] (not that I know anything about BIG networks) [22:49:31] https://ganglia.wikimedia.org/latest/?c=Swift%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 - hmm [22:51:01] at first glance it looks like a big broad network dropout within eqiad circa 22:33 [22:51:21] and then maybe some related after-effects in time [22:52:09] well something else happened earlier too, maybe 22:17 ish [22:52:54] I'm looking at logs etc, I don't see anything unusual [22:53:45] well lvs1009 flapping like crazy [22:53:49] but that's "expected" right? [22:54:03] seems to have been happening for hours [22:54:11] lvs1009 shouldn't be up at all. I guess network port could be flapping with failed PXE attempts [22:54:16] yeah [22:54:20] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for Chedasaurus - https://phabricator.wikimedia.org/T113302#1661467 (10egalvezwmf) [22:54:23] also labnet1001 eth5 [22:54:32] but anyway, unrelated flapping ports wouldn't explain this [22:54:35] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for Chedasaurus - https://phabricator.wikimedia.org/T113302#1661270 (10egalvezwmf) [22:54:59] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for Chedasaurus - https://phabricator.wikimedia.org/T113302#1661270 (10egalvezwmf) Thanks @Krenair, I've updated to a new one. [22:55:41] the 22:17 event even shows up in e.g. maps-cluster network traffic report: https://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=maps+Cluster+codfw&m=cpu_report&s=by+name&mc=2&g=network_report [22:56:07] (which is interesting because we know that service doesn't have a bunch of dependencies on other things like the API, appserver, memcache, etc, etc...) [22:56:28] and it's in codfw [23:00:04] RoanKattouw ostriches Krenair: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150921T2300). Please do the needful. [23:00:05] aude: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:10] hi [23:00:48] bblack: Well, it could very well have "just" affected the server that collects the data for ganglia [23:00:58] * aude here :) [23:01:10] yeah except we have a huge 503 spike right after in the reqerr graphs, which is not from ganglia [23:01:15] and user reports here of issues at the same time [23:01:20] Why did jouncebot not mention csteipp? [23:01:32] I'm here! [23:01:36] bblack: Yeah... I mean for thins like codfw (I was the first to report the outage, so I noticed it) [23:01:41] Oh, because he didn't use the proper ircnick template and just wrote it manually [23:01:44] yeah maybe [23:01:47] bblack: https://gdash.wikimedia.org/dashboards/reqsum/ [23:01:47] We should really fix that example at some point [23:01:48] greg-g, ^ [23:01:52] grr... sorry about that [23:01:55] although you'd think that data would just buffer somewhere [23:01:59] pageviews tripled [23:02:11] or... ntp? :) [23:02:14] (03PS2) 10Alex Monk: Enable captchas on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238357 (https://phabricator.wikimedia.org/T86460) (owner: 10CSteipp) [23:02:25] paravoid: nothing happening that we should delay/postpone SWAT right? [23:02:38] * greg-g isn't paying that close attenion since you and ori etc were in here [23:02:44] Is the spike related to the issue a few weeks ago? [23:03:00] Where there was a "massive" spike that took stuff offline for a bit? [23:03:22] ShakespeareFan00: I'm not sure what you're referring to exactly [23:03:26] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:03:49] greg-g: the even seems to have been very brief and not currently ongoing, but then again we have no idea wtf happened yet... [23:03:56] s/even/event/ [23:04:03] The other people here that do operations stuff will know... can't reveal much in public [23:04:29] bblack: so... I won't stop swat for now :) [23:04:43] we good to go then? [23:05:40] Krenair: yeah [23:05:59] (03CR) 10Alex Monk: [C: 032] Enable captchas on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238357 (https://phabricator.wikimedia.org/T86460) (owner: 10CSteipp) [23:06:05] (03Merged) 10jenkins-bot: Enable captchas on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238357 (https://phabricator.wikimedia.org/T86460) (owner: 10CSteipp) [23:07:30] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/238357/ (duration: 00m 13s) [23:07:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:07:36] csteipp, ^ please test [23:08:00] Krenair: Showing up, thanks! [23:16:15] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 28.57% of data above the critical threshold [500.0] [23:18:14] aude, err... I just realised that 1.26wmf23 is the only branch in use [23:18:19] (03PS1) 10BBlack: block POSTs to pretty wiki URLs... [puppet] - 10https://gerrit.wikimedia.org/r/240009 [23:18:26] what? [23:18:50] wmf22 is our extension and should be used in wmf23 core [23:18:57] (03PS2) 10BBlack: block POSTs to pretty wiki URLs... [puppet] - 10https://gerrit.wikimedia.org/r/240009 [23:19:04] oh right, because wikidata is a special extension right? [23:19:28] yeah [23:21:18] !log krenair@tin Synchronized php-1.26wmf22/extensions/Wikidata: https://gerrit.wikimedia.org/r/#/c/239828/ (duration: 00m 21s) [23:21:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:21:26] aude ^ [23:21:29] * aude checks [23:22:27] looks ok (e.g. nothing broken) [23:22:42] good... did it fix the issue? [23:22:58] the errors happen not that often, but will see that they are gone [23:23:08] (and will produce some debug logging for us) [23:23:46] thanks :) [23:24:01] np [23:24:17] probably abuse filter was misbehaving in some places [23:24:41] (03CR) 10BBlack: [C: 032] block POSTs to pretty wiki URLs... [puppet] - 10https://gerrit.wikimedia.org/r/240009 (owner: 10BBlack) [23:27:36] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [23:29:30] !log krinkle@tin Synchronized php-1.26wmf23/extensions/NavigationTiming/modules/ext.navigationTiming.js: T112593 (duration: 00m 14s) [23:29:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:30:46] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:34:04] 6operations: shutdown sodium after mailman has migrated to jessie VM - https://phabricator.wikimedia.org/T82698#1661587 (10Dzahn) a:3Dzahn [23:35:15] (03PS1) 10Madhuvishy: Enable logging via stashbot in irc channel wikimedia-analytics [puppet] - 10https://gerrit.wikimedia.org/r/240014 (https://phabricator.wikimedia.org/T111393) [23:36:11] (03CR) 10Dzahn: "yes, what Alex Monk said. that's why i didn't use the role" [puppet] - 10https://gerrit.wikimedia.org/r/239023 (owner: 10Dzahn) [23:37:02] (03PS2) 10Madhuvishy: logstash: Enable logging via stashbot in irc channel wikimedia-analytics [puppet] - 10https://gerrit.wikimedia.org/r/240014 (https://phabricator.wikimedia.org/T111393) [23:37:12] (03PS2) 10Dzahn: mailman: exim alias for discovery list renames [puppet] - 10https://gerrit.wikimedia.org/r/238652 (https://phabricator.wikimedia.org/T110256) [23:40:18] (03CR) 10Alex Monk: "Maybe regex.yaml can be used?" [puppet] - 10https://gerrit.wikimedia.org/r/239023 (owner: 10Dzahn) [23:41:35] 6operations, 10Analytics-EventLogging, 7Graphite: Statsv down since 2015-09-20 07:53 - https://phabricator.wikimedia.org/T113315#1661622 (10Krinkle) [23:42:37] paravoid: bblack: Any idea why statsv would have stopped at that time? (^) Looks like SAL shows an entry from you around that time. with regards to anti-spam measures and varnish reconfig [23:44:01] Krinkle: the anti-spam stuff was mailman, unrelated [23:44:43] 6operations, 10Analytics-EventLogging, 7Graphite: Statsv down since 2015-09-20 07:53 - https://phabricator.wikimedia.org/T113315#1661644 (10Krinkle) [23:44:49] paravoid: k [23:45:01] Hey people, i'm trying to add logging to SAL for the analytics irc channel. I see that bd808 is on vacation. Can someone else review my puppet patch for this change? [23:45:15] (03PS1) 10Dzahn: bastion: separate role for general and opsonly [puppet] - 10https://gerrit.wikimedia.org/r/240016 [23:45:30] You want another channel to be able to log directly to production SAL, madhuvishy? [23:46:12] Krenair: I thought it could be a separate project - so it would show up here: https://tools.wmflabs.org/sal/analytics [23:46:34] oh, I don't know how the sal tool works, just the traditional SAL wiki page [23:46:36] Krenair: patch is here -https://gerrit.wikimedia.org/r/#/c/240014/ [23:47:16] (03CR) 10Dzahn: "like this instead? https://gerrit.wikimedia.org/r/#/c/240016/ aren't they different roles if we want to treat them differently but also u" [puppet] - 10https://gerrit.wikimedia.org/r/239023 (owner: 10Dzahn) [23:47:38] madhuvishy, I wonder if this would be eligible for puppet swat [23:47:40] This one looks bad: I can't save any changes to my preferences: https://phabricator.wikimedia.org/T113319 (Also not sure which #projects to attach) [23:47:43] yuvipanda|maybe, ? [23:48:15] quiddity, #operations #traffic [23:48:17] and CC bblack [23:48:22] k, ty [23:48:27] PROBLEM - puppet last run on cp3021 is CRITICAL: CRITICAL: puppet fail [23:48:41] 6operations, 10MediaWiki-User-preferences, 10Traffic: Saving preferences gives 403 error - https://phabricator.wikimedia.org/T113319#1661684 (10Quiddity) [23:48:42] Krenair: no it's not urgent and can wait if only bd808 can review, just wondering if someone else can do it so i can add them [23:48:48] I can guess which commit caused this [23:48:55] yeah :( [23:49:21] and of course it's localized dammit [23:49:31] madhuvishy, bd808 won't be able to approve this commit, but I assume ops would value his +1 [23:51:02] 6operations, 10Analytics-EventLogging, 7Graphite: Statsv down since 2015-09-20 07:53 - https://phabricator.wikimedia.org/T113315#1661697 (10Krinkle) It seems hafnium has been having a hard time since around that time: ... [23:51:25] 6operations, 10Traffic: Saving preferences or blocking (and probably various other things) give 403 error - https://phabricator.wikimedia.org/T113319#1661698 (10Krenair) [23:52:01] (03PS1) 10Faidon Liambotis: Revert "block POSTs to pretty wiki URLs..." [puppet] - 10https://gerrit.wikimedia.org/r/240018 (https://phabricator.wikimedia.org/T113319) [23:52:04] 6operations, 10Traffic, 5Patch-For-Review: Saving preferences or blocking (and probably various other things) give 403 error - https://phabricator.wikimedia.org/T113319#1661661 (10Krenair) Sounds like this may have been caused by https://gerrit.wikimedia.org/r/#/c/240009/ [23:52:32] (03CR) 10Faidon Liambotis: [C: 032 V: 032] Revert "block POSTs to pretty wiki URLs..." [puppet] - 10https://gerrit.wikimedia.org/r/240018 (https://phabricator.wikimedia.org/T113319) (owner: 10Faidon Liambotis) [23:52:55] Krenair: ummm sorry I don't understand [23:53:15] (03PS2) 10Dzahn: bastion: separate role for general and opsonly [puppet] - 10https://gerrit.wikimedia.org/r/240016 [23:53:34] madhuvishy: hey -- we're in the middle of a crisis, so excuse the lack of responses [23:53:45] paravoid: no problem, this can wait