[00:04:03] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [00:05:44] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [00:07:25] !log running mwscript refreshLinks.php --wiki=metawiki --namespace=2 on terbium (T145366) [00:07:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:07:36] T145366: Create and populate babel database table on Wikimedia wikis - https://phabricator.wikimedia.org/T145366 [00:08:08] 10Operations, 10ops-codfw, 10ops-eqiad, 10monitoring: Unresponsive/misconfigured iDRACs - https://phabricator.wikimedia.org/T169360#3423391 (10Dzahn) [00:09:29] 10Operations, 10ops-codfw, 10ops-eqiad, 10monitoring: Unresponsive/misconfigured iDRACs - https://phabricator.wikimedia.org/T169360#3395803 (10Dzahn) analytics1047 already seemed ok, showed the right IP, racreset anyways but it stayed the same analytics1061 also showed the right IP but the wrong gateway.... [00:15:58] (03PS1) 10Smalyshev: Fix nginx parametrization - use variable consistently for port [puppet] - 10https://gerrit.wikimedia.org/r/364349 [00:19:53] RECOVERY - puppet last run on sodium is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [00:21:24] 10Operations, 10ops-codfw, 10ops-eqiad, 10monitoring: Unresponsive/misconfigured iDRACs - https://phabricator.wikimedia.org/T169360#3423420 (10faidon) [00:21:27] 10Operations, 10ops-codfw, 10ops-eqiad, 10monitoring: Unresponsive/misconfigured iDRACs - https://phabricator.wikimedia.org/T169360#3423421 (10Dzahn) [00:22:10] arg, i think i reverted faidon [00:22:16] with an edit in the same moment [00:22:48] wants "undo" action in phab [00:24:30] 10Operations, 10ops-codfw, 10ops-eqiad, 10monitoring: Unresponsive/misconfigured iDRACs - https://phabricator.wikimedia.org/T169360#3423424 (10faidon) I racreset all of the ones in list which had a discrepancy of their IP configuration with the output (showing 192.168.0.1 as gateway) and they're all fixed... [00:24:54] (03CR) 10Brian Wolff: [C: 04-1] Add ar_content_format and ar_content_model to labs views (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/363851 (https://phabricator.wikimedia.org/T89741) (owner: 10Umherirrender) [00:27:00] 10Operations, 10ops-codfw, 10ops-eqiad, 10monitoring: Unresponsive/misconfigured iDRACs - https://phabricator.wikimedia.org/T169360#3423428 (10faidon) [00:28:03] 10Operations, 10ops-codfw, 10ops-eqiad, 10monitoring: Unresponsive/misconfigured iDRACs - https://phabricator.wikimedia.org/T169360#3423429 (10Dzahn) [00:30:08] 10Operations, 10ops-codfw, 10ops-eqiad, 10monitoring: Unresponsive/misconfigured iDRACs - https://phabricator.wikimedia.org/T169360#3423435 (10faidon) [00:37:01] (03CR) 10Catrope: [C: 032] Enable experimental RCFilters live update feature in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364266 (https://phabricator.wikimedia.org/T167743) (owner: 10Catrope) [00:39:23] 10Operations, 10ops-codfw, 10ops-eqiad, 10monitoring: Unresponsive/misconfigured iDRACs over the host-BMC interface - https://phabricator.wikimedia.org/T169360#3423459 (10faidon) [00:40:37] (03Merged) 10jenkins-bot: Enable experimental RCFilters live update feature in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364266 (https://phabricator.wikimedia.org/T167743) (owner: 10Catrope) [00:41:09] (03CR) 10jenkins-bot: Enable experimental RCFilters live update feature in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364266 (https://phabricator.wikimedia.org/T167743) (owner: 10Catrope) [00:41:10] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: remove icinga monitoring for benefactorevents.wm.o SSL certificate - https://phabricator.wikimedia.org/T170139#3423466 (10Dzahn) Notice: /Stage[main]/Icinga/Nagios_host[benefactorevents.wikimedia.org]/ensure: removed Info: Computing checksum on file... [00:41:20] (03PS1) 10BryanDavis: striker: Set utf-8 for python runtime [puppet] - 10https://gerrit.wikimedia.org/r/364350 (https://phabricator.wikimedia.org/T164034) [00:44:42] (03PS1) 10Papaul: DHCP: Add MAC address for netmon2001 [puppet] - 10https://gerrit.wikimedia.org/r/364353 [00:45:06] (03CR) 10jerkins-bot: [V: 04-1] striker: Set utf-8 for python runtime [puppet] - 10https://gerrit.wikimedia.org/r/364350 (https://phabricator.wikimedia.org/T164034) (owner: 10BryanDavis) [00:53:14] (03PS1) 10Dzahn: disable base monitoring for labtest* machines [puppet] - 10https://gerrit.wikimedia.org/r/364355 [00:58:54] 10Operations, 10Traffic, 10HTTPS, 10Tracking: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#3423500 (10BBlack) [00:58:56] 10Operations, 10Traffic: stream.wikimedia.org: remove legacy rcstream/socket.io HTTPS redirect hole punches - https://phabricator.wikimedia.org/T168919#3423497 (10BBlack) 05Open>03Resolved a:03BBlack This hole was removed today in https://gerrit.wikimedia.org/r/#/c/364252 , so this is resolved assuming w... [01:00:52] 10Operations, 10Traffic: stream.wikimedia.org: remove legacy rcstream/socket.io HTTPS redirect hole punches - https://phabricator.wikimedia.org/T168919#3423524 (10BBlack) [01:09:50] (03PS2) 10BryanDavis: striker: Set utf-8 for python runtime [puppet] - 10https://gerrit.wikimedia.org/r/364350 (https://phabricator.wikimedia.org/T164034) [01:10:57] (03CR) 10jerkins-bot: [V: 04-1] striker: Set utf-8 for python runtime [puppet] - 10https://gerrit.wikimedia.org/r/364350 (https://phabricator.wikimedia.org/T164034) (owner: 10BryanDavis) [01:15:52] (03PS2) 10BryanDavis: striker: Override http-socket config [puppet] - 10https://gerrit.wikimedia.org/r/362210 (https://phabricator.wikimedia.org/T169070) (owner: 10Alexandros Kosiaris) [01:15:54] (03PS3) 10BryanDavis: striker: Set utf-8 for python runtime [puppet] - 10https://gerrit.wikimedia.org/r/364350 (https://phabricator.wikimedia.org/T164034) [01:16:38] (03CR) 10BryanDavis: [C: 031] "Tested in striker labs project." [puppet] - 10https://gerrit.wikimedia.org/r/362210 (https://phabricator.wikimedia.org/T169070) (owner: 10Alexandros Kosiaris) [01:17:44] (03CR) 10BryanDavis: [C: 031] striker: Override http-socket config [puppet] - 10https://gerrit.wikimedia.org/r/362210 (https://phabricator.wikimedia.org/T169070) (owner: 10Alexandros Kosiaris) [01:19:30] (03CR) 10BryanDavis: [C: 031] "Gerrit won't let me v+1, but I tested this in the Striker labs project and it fixes the stdout/stderr encoding problem." [puppet] - 10https://gerrit.wikimedia.org/r/364350 (https://phabricator.wikimedia.org/T164034) (owner: 10BryanDavis) [01:19:34] (03CR) 10BryanDavis: [C: 031] Remove ferm service for striker / 8081 [puppet] - 10https://gerrit.wikimedia.org/r/364174 (https://phabricator.wikimedia.org/T169070) (owner: 10Muehlenhoff) [01:21:07] (03CR) 10Dzahn: [C: 032] DHCP: Add MAC address for netmon2001 [puppet] - 10https://gerrit.wikimedia.org/r/364353 (owner: 10Papaul) [01:23:32] (03CR) 10Dzahn: [C: 031] Remove ferm service for striker / 8081 [puppet] - 10https://gerrit.wikimedia.org/r/364174 (https://phabricator.wikimedia.org/T169070) (owner: 10Muehlenhoff) [01:24:03] (03PS2) 10Dzahn: Remove ferm service for striker / 8081 [puppet] - 10https://gerrit.wikimedia.org/r/364174 (https://phabricator.wikimedia.org/T169070) (owner: 10Muehlenhoff) [01:24:48] (03CR) 10Dzahn: [C: 032] Remove ferm service for striker / 8081 [puppet] - 10https://gerrit.wikimedia.org/r/364174 (https://phabricator.wikimedia.org/T169070) (owner: 10Muehlenhoff) [01:26:23] (03PS3) 10Dzahn: striker: Override http-socket config [puppet] - 10https://gerrit.wikimedia.org/r/362210 (https://phabricator.wikimedia.org/T169070) (owner: 10Alexandros Kosiaris) [01:28:17] (03CR) 10Dzahn: [C: 032] striker: Override http-socket config [puppet] - 10https://gerrit.wikimedia.org/r/362210 (https://phabricator.wikimedia.org/T169070) (owner: 10Alexandros Kosiaris) [01:28:51] (03PS4) 10Dzahn: striker: Set utf-8 for python runtime [puppet] - 10https://gerrit.wikimedia.org/r/364350 (https://phabricator.wikimedia.org/T164034) (owner: 10BryanDavis) [01:30:10] (03CR) 10Dzahn: [C: 032] striker: Set utf-8 for python runtime [puppet] - 10https://gerrit.wikimedia.org/r/364350 (https://phabricator.wikimedia.org/T164034) (owner: 10BryanDavis) [01:40:03] PROBLEM - striker on californium is CRITICAL: connect to address 208.80.154.147 and port 8081: Connection refused [01:51:22] bd808: ^ see merges, and icinga-wm , heh [01:51:33] but that just shows it was applied :) [02:11:38] mutante: heh. I didn't know that we had a watch on it [02:12:56] (03CR) 10Dzahn: "it just had a side-effect, alert in Icinga https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=californium&service=striker" [puppet] - 10https://gerrit.wikimedia.org/r/362210 (https://phabricator.wikimedia.org/T169070) (owner: 10Alexandros Kosiaris) [02:13:12] (03CR) 10Dzahn: "it just had a side-effect, alert in Icinga https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=californium&service=striker" [puppet] - 10https://gerrit.wikimedia.org/r/364174 (https://phabricator.wikimedia.org/T169070) (owner: 10Muehlenhoff) [02:13:23] bd808: me neither, i just found it in service::uwsgi [02:13:31] if $has_spec { [02:13:38] # Advanced monitoring [02:13:49] } else { .. # Basic monitoring [02:13:51] that should probably move to the role/profile too [02:14:01] and Basic = check_command => "check_http_port_url!${port}!${healthcheck_url}" [02:14:49] that only works from external, not an NRPE check running on the monitored host itself [02:15:45] ACKNOWLEDGEMENT - striker on californium is CRITICAL: connect to address 208.80.154.147 and port 8081: Connection refused daniel_zahn https://gerrit.wikimedia.org/r/#/c/364174/ [02:16:49] just ACKed it for now, we can find a fix tomorrow, need to get dinner :) [02:17:14] and yea @ move to role/profile [02:17:25] sounds good [02:25:39] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.7) (duration: 08m 46s) [02:25:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:32:17] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Jul 11 02:32:17 UTC 2017 (duration 6m 39s) [02:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:30:24] 10Operations, 10Performance-Team, 10User-Elukey, 10Wikimedia-log-errors: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735#3423614 (10Krinkle) [04:09:36] 10Operations, 10Traffic, 10Patch-For-Review: Decrease max object TTL in varnishes - https://phabricator.wikimedia.org/T124954#3423643 (10Krinkle) >>! In T124954#3421257, @BBlack wrote: > [..] We don't believe it should be possible at this time for an object to exist in the caching layers for more than 4 days... [04:09:45] 10Operations, 10Traffic: Decrease max object TTL in varnishes - https://phabricator.wikimedia.org/T124954#3423644 (10Krinkle) [04:10:43] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=479.62 Read Requests/Sec=2330.90 Write Requests/Sec=0.30 KBytes Read/Sec=38771.60 KBytes_Written/Sec=7.20 [04:18:53] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=43.10 Read Requests/Sec=0.00 Write Requests/Sec=0.20 KBytes Read/Sec=0.00 KBytes_Written/Sec=2.40 [04:23:40] 10Operations, 10Deployment-Systems, 10Performance-Team, 10HHVM, and 2 others: Translation cache exhaustion caused by changes to PHP code in file scope - https://phabricator.wikimedia.org/T103886#3423647 (10Krinkle) >>! In T103886#3419367, @Joe wrote: > @Krinkle sure, we can enable reusing TC in beta for no... [04:23:50] (03CR) 10Krinkle: [C: 031] deployment-prep: enable reusable TC on HHVM [puppet] - 10https://gerrit.wikimedia.org/r/364148 (https://phabricator.wikimedia.org/T103886) (owner: 10Giuseppe Lavagetto) [05:01:10] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Reset db1070 idrac - https://phabricator.wikimedia.org/T160392#3423679 (10Marostegui) Nice catch faidon!! Thanks for fixing this and specially thanks for fixing dbstore1001, which is a critical host for us! [05:06:15] (03PS1) 10Marostegui: db-eqiad.php: Depool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364365 (https://phabricator.wikimedia.org/T166204) [05:08:35] !log Deploy alter table on enwiki - labsdb1011 - T166204 [05:08:43] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364365 (https://phabricator.wikimedia.org/T166204) (owner: 10Marostegui) [05:08:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:08:48] T166204: Convert unique keys into primary keys for some wiki tables on s1 - https://phabricator.wikimedia.org/T166204 [05:09:38] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364365 (https://phabricator.wikimedia.org/T166204) (owner: 10Marostegui) [05:09:46] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364365 (https://phabricator.wikimedia.org/T166204) (owner: 10Marostegui) [05:11:12] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1066 - T166204 (duration: 00m 43s) [05:11:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:14:11] !log Deploy alter table on db1066 - T166204 [05:14:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:14:21] T166204: Convert unique keys into primary keys for some wiki tables on s1 - https://phabricator.wikimedia.org/T166204 [05:16:09] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1059" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364366 [05:18:36] marostegui: morning! I got the green light to drop the huge _Edit table from dbstore1002 [05:19:45] Yaaaaaaaaaaay [05:24:35] dropped! [05:25:27] Does it exist on db1047? [05:25:33] Or was it only in dbstore1002? [05:26:12] only on dbstore1002 [05:26:19] and it was basically a copy of another table [05:46:03] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.69% of data above the critical threshold [1000.0] [06:42:13] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [06:52:12] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1059" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364366 [06:54:07] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1059" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364366 (owner: 10Marostegui) [06:54:59] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1059" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364366 (owner: 10Marostegui) [06:56:05] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1059 - T168661 (duration: 00m 41s) [06:56:10] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1059" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364366 (owner: 10Marostegui) [06:56:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:18] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [06:58:38] !log Deploy alter table on s1 - dbstore1002 - T166204 [06:58:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:48] T166204: Convert unique keys into primary keys for some wiki tables on s1 - https://phabricator.wikimedia.org/T166204 [06:58:52] elukey: ^ [07:03:03] (03PS1) 10Marostegui: db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364371 (https://phabricator.wikimedia.org/T168661) [07:04:20] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364371 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [07:05:17] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364371 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [07:06:08] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364371 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [07:06:45] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1084 - T168661 (duration: 00m 42s) [07:06:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:57] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [07:07:29] !log Deploy alter table db1084 - T168661 [07:07:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:20] (03PS1) 10Marostegui: db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364372 (https://phabricator.wikimedia.org/T153743) [07:13:12] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364372 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [07:14:06] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364372 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [07:15:05] marostegui: ack! [07:15:27] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1079 - T153743 (duration: 00m 41s) [07:15:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:39] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [07:16:09] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364372 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [07:16:32] (03PS1) 10Muehlenhoff: Make the creation of an Icinga check for service::uwsgi configurable [puppet] - 10https://gerrit.wikimedia.org/r/364373 [07:23:04] (03PS4) 10Elukey: role::piwik::server: add regular bacula backups [puppet] - 10https://gerrit.wikimedia.org/r/364195 (https://phabricator.wikimedia.org/T164073) [07:26:26] !log Stop MySQL on db1079 for maintenance - T153743 [07:26:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:37] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [07:29:37] (03CR) 10Alexandros Kosiaris: [C: 032] Make the creation of an Icinga check for service::uwsgi configurable [puppet] - 10https://gerrit.wikimedia.org/r/364373 (owner: 10Muehlenhoff) [07:30:13] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [07:30:33] PROBLEM - HHVM rendering on mw2134 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:31:13] (03CR) 10Marostegui: [C: 032] db1079.yaml: Specify ROW as binlog format [puppet] - 10https://gerrit.wikimedia.org/r/364247 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [07:31:18] (03PS2) 10Marostegui: db1079.yaml: Specify ROW as binlog format [puppet] - 10https://gerrit.wikimedia.org/r/364247 (https://phabricator.wikimedia.org/T153743) [07:31:23] RECOVERY - HHVM rendering on mw2134 is OK: HTTP OK: HTTP/1.1 200 OK - 74473 bytes in 0.379 second response time [07:33:13] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [07:34:44] !log bouncing icinga-wm (tcpircbot) on einsteinium to get back it's primary nick [07:34:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:54] !log amending previous SAL, I meant ircecho ofc [07:36:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:25] !log Stop MySQL db1102 for maintenance - T153743 [07:38:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:36] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [07:39:20] (03CR) 10Muehlenhoff: "Icinga check fixed in https://gerrit.wikimedia.org/r/#/c/364373/" [puppet] - 10https://gerrit.wikimedia.org/r/364174 (https://phabricator.wikimedia.org/T169070) (owner: 10Muehlenhoff) [07:43:29] (03PS1) 10Muehlenhoff: Readd siddharth11 [puppet] - 10https://gerrit.wikimedia.org/r/364376 [07:47:02] 10Operations, 10DBA, 10Mail: Setup database for dmarc service - https://phabricator.wikimedia.org/T170158#3423956 (10jcrespo) a:03herron So we need: db name, account name, grants needed, ips/dns of the origin of the connections. [07:50:19] (03CR) 10Muehlenhoff: [C: 032] Readd siddharth11 [puppet] - 10https://gerrit.wikimedia.org/r/364376 (owner: 10Muehlenhoff) [07:55:15] (03CR) 10Elukey: [C: 032] role::piwik::server: add regular bacula backups [puppet] - 10https://gerrit.wikimedia.org/r/364195 (https://phabricator.wikimedia.org/T164073) (owner: 10Elukey) [07:55:22] (03PS5) 10Elukey: role::piwik::server: add regular bacula backups [puppet] - 10https://gerrit.wikimedia.org/r/364195 (https://phabricator.wikimedia.org/T164073) [07:55:25] (03CR) 10Elukey: [V: 032 C: 032] role::piwik::server: add regular bacula backups [puppet] - 10https://gerrit.wikimedia.org/r/364195 (https://phabricator.wikimedia.org/T164073) (owner: 10Elukey) [07:57:55] !log Drop localisation_file_hash table from frwiki and jawiki (s6) - T119811 [07:58:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:06] T119811: Drop localisation and localisation_file_hash tables, l10nwiki databases too - https://phabricator.wikimedia.org/T119811 [08:04:41] 10Operations, 10monitoring: Remote IPMI doesn't work for ~2% of the fleet - https://phabricator.wikimedia.org/T150160#3423983 (10MoritzMuehlenhoff) >>! In T150160#3423143, @Volans wrote: > And indeed the diff is shown as empty after that and now they are a PASS: > ``` > cp4021.mgmt.ulsfo.wmnet: PASS > bast3002... [08:10:32] (03CR) 10Alexandros Kosiaris: [C: 04-1] icinga/role:mail::mx: add monitoring of exim queue size (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/361023 (https://phabricator.wikimedia.org/T133110) (owner: 10Dzahn) [08:13:42] (03CR) 10Volans: [C: 04-1] "Seems there is a typo to me for the DNS check, see inline. Compiler results looks ok:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/363295 (https://phabricator.wikimedia.org/T169321) (owner: 10Alexandros Kosiaris) [08:15:39] (03PS5) 10Alexandros Kosiaris: monitoring::host: Monitor IPMI as well if applicable [puppet] - 10https://gerrit.wikimedia.org/r/363295 [08:16:04] (03CR) 10Alexandros Kosiaris: monitoring::host: Monitor IPMI as well if applicable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/363295 (owner: 10Alexandros Kosiaris) [08:17:07] 10Operations, 10Deployment-Systems, 10Performance-Team, 10HHVM, and 2 others: Translation cache exhaustion caused by changes to PHP code in file scope - https://phabricator.wikimedia.org/T103886#3424026 (10MoritzMuehlenhoff) >>! In T103886#3423647, @Krinkle wrote: > Looking at `enable_reusable_tc` a bit, i... [08:17:59] !log Drop localisation_file_hash table from dewiki (s5) - T119811 [08:18:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:10] T119811: Drop localisation and localisation_file_hash tables, l10nwiki databases too - https://phabricator.wikimedia.org/T119811 [08:18:26] (03PS1) 10Jcrespo: mariadb: monitor automatically any multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/364381 (https://phabricator.wikimedia.org/T169514) [08:19:16] (03PS2) 10Jcrespo: mariadb: monitor automatically any multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/364381 (https://phabricator.wikimedia.org/T169514) [08:19:19] (03CR) 10jerkins-bot: [V: 04-1] mariadb: monitor automatically any multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/364381 (https://phabricator.wikimedia.org/T169514) (owner: 10Jcrespo) [08:20:14] (03CR) 10jerkins-bot: [V: 04-1] mariadb: monitor automatically any multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/364381 (https://phabricator.wikimedia.org/T169514) (owner: 10Jcrespo) [08:24:06] (03CR) 10Volans: [C: 031] "LGTM, let's give it another try!" [puppet] - 10https://gerrit.wikimedia.org/r/363295 (owner: 10Alexandros Kosiaris) [08:24:58] !log disable puppet on einsteinium (icinga host) for merge of https://gerrit.wikimedia.org/r/#/c/363295/5 [08:25:09] (03CR) 10Jcrespo: "https://puppet-compiler.wmflabs.org/7000/" [puppet] - 10https://gerrit.wikimedia.org/r/364381 (https://phabricator.wikimedia.org/T169514) (owner: 10Jcrespo) [08:25:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:11] actually... no [08:25:18] !log disable puppet everywhere but on einsteinium (icinga host) for merge of https://gerrit.wikimedia.org/r/#/c/363295/5 [08:25:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:55] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364383 [08:28:13] (03PS3) 10Jcrespo: mariadb: monitor automatically any multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/364381 (https://phabricator.wikimedia.org/T169514) [08:29:25] (03CR) 10Alexandros Kosiaris: [C: 032] monitoring::host: Monitor IPMI as well if applicable [puppet] - 10https://gerrit.wikimedia.org/r/363295 (owner: 10Alexandros Kosiaris) [08:35:39] 10Operations, 10DC-Ops: Information missing from racktables - https://phabricator.wikimedia.org/T150651#2792080 (10Peachey88) Model details for ms6 is: Sun Fire X4540 (from: https://wikitech.wikimedia.org/wiki/Obsolete:Ms6) [08:35:49] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364383 [08:36:52] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364383 (owner: 10Marostegui) [08:37:46] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364383 (owner: 10Marostegui) [08:37:55] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364383 (owner: 10Marostegui) [08:38:43] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1084 - T168661 (duration: 00m 42s) [08:38:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:54] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [08:40:41] (03PS1) 10Marostegui: db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364384 (https://phabricator.wikimedia.org/T168661) [08:41:29] 10Operations, 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Performance-Team, and 6 others: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3424125 (10aaron) >>! In T164173#3420723, @daniel wrote: > @aaron another question: does Re... [08:41:50] (03PS2) 10Marostegui: db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364384 (https://phabricator.wikimedia.org/T168661) [08:45:14] (03PS5) 10Giuseppe Lavagetto: Add future parser run mode [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363808 (https://phabricator.wikimedia.org/T169546) [08:45:44] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364384 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [08:46:39] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364384 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [08:46:48] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364384 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [08:47:48] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1081 - T168661 (duration: 00m 42s) [08:48:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:00] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [08:49:17] !log rebooting mw1169 for kernel update [08:49:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:04] !log Deploy alter table on s4 - db1081 - T168661 [08:50:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:11] (03PS4) 10Jcrespo: mariadb: monitor automatically any multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/364381 (https://phabricator.wikimedia.org/T169514) [08:56:47] (03PS5) 10Jcrespo: mariadb: monitor automatically any multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/364381 (https://phabricator.wikimedia.org/T169514) [09:00:44] (03PS6) 10Giuseppe Lavagetto: Add future parser run mode [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363808 (https://phabricator.wikimedia.org/T169546) [09:01:35] (03PS1) 10Marostegui: sanitarium3.sysvinit: Add PATH for mysqld_multi [puppet] - 10https://gerrit.wikimedia.org/r/364386 [09:09:46] (03CR) 10Marostegui: [C: 032] sanitarium3.sysvinit: Add PATH for mysqld_multi [puppet] - 10https://gerrit.wikimedia.org/r/364386 (owner: 10Marostegui) [09:09:49] ntkenghjfhctnflkbbhggcfhkrgjjnnklbk [09:09:58] grrrrr [09:12:23] !log reboot sarin for kernel update [09:12:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:03] 10Operations, 10Performance-Team, 10User-Elukey, 10Wikimedia-log-errors: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735#3424167 (10elukey) So as stated in https://phabricator.wikimedia.org/T163337#3421600 t... [09:30:47] 10Operations, 10Analytics-Kanban, 10DBA, 10Patch-For-Review, 10User-Elukey: Puppetize Piwik's Database and set up periodical backups - https://phabricator.wikimedia.org/T164073#3424171 (10elukey) I want to observe how the patch that I merged behaves during the next days before closing. [09:40:14] !log Stop slave s6 on db1102 for exporting its content - T153743 [09:40:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:25] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [09:41:05] !log installing tiff security updates [09:41:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:53] (03PS1) 10Marostegui: s7.hosts: db1102 now replicates s7 [software] - 10https://gerrit.wikimedia.org/r/364393 (https://phabricator.wikimedia.org/T153743) [09:45:47] (03CR) 10Marostegui: [C: 032] s7.hosts: db1102 now replicates s7 [software] - 10https://gerrit.wikimedia.org/r/364393 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [09:46:30] (03Merged) 10jenkins-bot: s7.hosts: db1102 now replicates s7 [software] - 10https://gerrit.wikimedia.org/r/364393 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [09:50:11] (03PS1) 10Marostegui: db-eqiad.php: Repool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364395 (https://phabricator.wikimedia.org/T153743) [09:51:54] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364395 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [09:52:50] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364395 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [09:53:03] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364395 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [09:54:00] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1079 with low weight - T153743 (duration: 00m 42s) [09:54:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:12] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [09:54:59] (03PS1) 10Jcrespo: [WIP]prometheus: Convert mysqld-exporter into multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/364396 (https://phabricator.wikimedia.org/T169514) [09:55:57] (03CR) 10jerkins-bot: [V: 04-1] [WIP]prometheus: Convert mysqld-exporter into multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/364396 (https://phabricator.wikimedia.org/T169514) (owner: 10Jcrespo) [10:04:54] (03PS1) 10Marostegui: db-eqiad.php: Increase db1079 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364398 [10:06:06] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase db1079 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364398 (owner: 10Marostegui) [10:06:58] (03Merged) 10jenkins-bot: db-eqiad.php: Increase db1079 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364398 (owner: 10Marostegui) [10:07:07] (03CR) 10jenkins-bot: db-eqiad.php: Increase db1079 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364398 (owner: 10Marostegui) [10:08:04] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase db1079 weight (duration: 00m 42s) [10:08:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:32] !log Drop table localisation_file_hash from commonswiki - T119811 [10:11:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:43] T119811: Drop localisation and localisation_file_hash tables, l10nwiki databases too - https://phabricator.wikimedia.org/T119811 [10:17:52] (03PS1) 10Marostegui: db-eqiad.php: Increase db1079 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364399 [10:19:00] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase db1079 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364399 (owner: 10Marostegui) [10:19:50] (03Merged) 10jenkins-bot: db-eqiad.php: Increase db1079 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364399 (owner: 10Marostegui) [10:19:59] (03CR) 10jenkins-bot: db-eqiad.php: Increase db1079 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364399 (owner: 10Marostegui) [10:20:58] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase db1079 weight (duration: 00m 42s) [10:21:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:48] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1081" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364400 [10:24:11] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1081" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364400 [10:25:46] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1081" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364400 (owner: 10Marostegui) [10:26:40] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1081" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364400 (owner: 10Marostegui) [10:26:49] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1081" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364400 (owner: 10Marostegui) [10:27:36] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1081 - T168661 (duration: 00m 42s) [10:27:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:48] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [10:34:37] (03PS1) 10Alexandros Kosiaris: monitoring::host: Rename the mgmt hash key [puppet] - 10https://gerrit.wikimedia.org/r/364402 (https://phabricator.wikimedia.org/T169321) [10:35:12] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] monitoring::host: Rename the mgmt hash key [puppet] - 10https://gerrit.wikimedia.org/r/364402 (https://phabricator.wikimedia.org/T169321) (owner: 10Alexandros Kosiaris) [10:36:49] !log bump BFD timer from 300 to 600 on the eqiad-codfw link for T170131 [10:37:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:00] T170131: Recurring varnish-be fetch failures in codfw - https://phabricator.wikimedia.org/T170131 [10:37:54] !log enable puppet everywhere but on einsteinium (icinga host) for merge of https://gerrit.wikimedia.org/r/#/c/363295/5 [10:38:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:22] (03CR) 10Giuseppe Lavagetto: "I do not agree with @volans and @gehel, I really don't like linter-based comments, esp in tests. But I won't block this." [software/cumin] - 10https://gerrit.wikimedia.org/r/361040 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [10:39:06] (03PS1) 10MarcoAurelio: High density logos for es.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364405 (https://phabricator.wikimedia.org/T170248) [10:39:08] (03PS1) 10Marostegui: db-eqiad.php: Increase db1079 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364406 [10:39:25] PROBLEM - puppet last run on labtestpuppetmaster2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Service[apache2] [10:40:53] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase db1079 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364406 (owner: 10Marostegui) [10:41:51] (03Merged) 10jenkins-bot: db-eqiad.php: Increase db1079 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364406 (owner: 10Marostegui) [10:42:10] (03CR) 10jenkins-bot: db-eqiad.php: Increase db1079 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364406 (owner: 10Marostegui) [10:43:05] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase db1079 weight (duration: 00m 42s) [10:43:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:03] (03PS1) 10Muehlenhoff: Remove apt pinning for backports for ffmpeg [puppet] - 10https://gerrit.wikimedia.org/r/364408 [10:52:23] PROBLEM - Check correctness of the icinga configuration on einsteinium is CRITICAL: Icinga configuration contains errors [10:52:46] <_joe_> akosiaris: ^^ [10:52:55] (03PS1) 10Marostegui: db-eqiad.php: Repool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364409 (https://phabricator.wikimedia.org/T166204) [10:53:59] _joe_: yeah more or less expected. it's a race between the service resource getting exported but not the host resource yet and icinga populating its config [10:54:07] it should coalesce on the next run [10:54:15] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364409 (https://phabricator.wikimedia.org/T166204) (owner: 10Marostegui) [10:55:10] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364409 (https://phabricator.wikimedia.org/T166204) (owner: 10Marostegui) [10:56:08] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364409 (https://phabricator.wikimedia.org/T166204) (owner: 10Marostegui) [10:56:14] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1067 with 0 weight - T166204 (duration: 00m 41s) [10:56:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:24] T166204: Convert unique keys into primary keys for some wiki tables on s1 - https://phabricator.wikimedia.org/T166204 [10:58:13] (03PS1) 10Muehlenhoff: Install libav-tools to provide avconv compatibility wrapper [puppet] - 10https://gerrit.wikimedia.org/r/364410 [11:01:20] (03PS2) 10Jcrespo: [WIP]prometheus: Convert mysqld-exporter into multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/364396 (https://phabricator.wikimedia.org/T169514) [11:02:18] (03CR) 10jerkins-bot: [V: 04-1] [WIP]prometheus: Convert mysqld-exporter into multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/364396 (https://phabricator.wikimedia.org/T169514) (owner: 10Jcrespo) [11:05:17] (03CR) 10Alexandros Kosiaris: [C: 031] icinga: merge routers/switches monitoring groups [puppet] - 10https://gerrit.wikimedia.org/r/364206 (https://phabricator.wikimedia.org/T167279) (owner: 10Faidon Liambotis) [11:08:11] (03CR) 10Alexandros Kosiaris: [C: 031] icinga: move RIPE Atlas measurements under netops [puppet] - 10https://gerrit.wikimedia.org/r/364208 (owner: 10Faidon Liambotis) [11:09:30] (03CR) 10Alexandros Kosiaris: [C: 031] icinga: move RIPE Atlas host monitoring under netops [puppet] - 10https://gerrit.wikimedia.org/r/364207 (https://phabricator.wikimedia.org/T167279) (owner: 10Faidon Liambotis) [11:10:25] (03PS3) 10Jcrespo: [WIP]prometheus: Convert mysqld-exporter into multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/364396 (https://phabricator.wikimedia.org/T169514) [11:12:31] RECOVERY - Check correctness of the icinga configuration on einsteinium is OK: Icinga configuration is correct [11:12:50] PROBLEM - Host conf1003.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [11:13:01] PROBLEM - Host db1063.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [11:13:20] PROBLEM - Host kafka1018.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [11:13:20] PROBLEM - Host kafka1020.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [11:13:40] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [11:14:51] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [11:15:12] (03PS1) 10Alexandros Kosiaris: Fix monitoring::host RSpec suite tests [puppet] - 10https://gerrit.wikimedia.org/r/364411 [11:17:15] jouncebot: next [11:17:15] In 1 hour(s) and 42 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170711T1300) [11:17:22] (03PS1) 10Elukey: role::mariadb::analytics::custom_repl_slave: add EventLogging cleaner user [puppet] - 10https://gerrit.wikimedia.org/r/364412 (https://phabricator.wikimedia.org/T170118) [11:18:12] (03CR) 10jerkins-bot: [V: 04-1] role::mariadb::analytics::custom_repl_slave: add EventLogging cleaner user [puppet] - 10https://gerrit.wikimedia.org/r/364412 (https://phabricator.wikimedia.org/T170118) (owner: 10Elukey) [11:20:52] (03CR) 10Jcrespo: "unix socket authenticaion is missing on the server, otherwise we will create a passwordless user accessible by anyone." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/364412 (https://phabricator.wikimedia.org/T170118) (owner: 10Elukey) [11:21:56] jynus: I sent git review too soon sorry, was amending :) [11:22:40] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:23:00] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:23:31] (03CR) 10Muehlenhoff: [C: 032] Remove apt pinning for backports for ffmpeg [puppet] - 10https://gerrit.wikimedia.org/r/364408 (owner: 10Muehlenhoff) [11:25:42] (03PS2) 10Elukey: role::mariadb::analytics::custom_repl_slave: add EventLogging cleaner user [puppet] - 10https://gerrit.wikimedia.org/r/364412 (https://phabricator.wikimedia.org/T170118) [11:25:53] 10Operations, 10monitoring: Remote IPMI doesn't work for ~2% of the fleet - https://phabricator.wikimedia.org/T150160#3424617 (10Volans) `db1053.mgmt.eqiad.wmnet` seems to work now, I can both ssh and get an chassis status from neodymium. Transient issue? [11:27:11] ACKNOWLEDGEMENT - Host conf1003.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% Volans Management interface not responding to ping: https://phabricator.wikimedia.org/T150160 [11:27:11] ACKNOWLEDGEMENT - Host db1063.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% Volans Management interface not responding to ping: https://phabricator.wikimedia.org/T150160 [11:27:11] ACKNOWLEDGEMENT - Host kafka1018.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% Volans Management interface not responding to ping: https://phabricator.wikimedia.org/T150160 [11:27:11] ACKNOWLEDGEMENT - Host kafka1020.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% Volans Management interface not responding to ping: https://phabricator.wikimedia.org/T150160 [11:27:19] (03CR) 10Muehlenhoff: [C: 032] Install libav-tools to provide avconv compatibility wrapper [puppet] - 10https://gerrit.wikimedia.org/r/364410 (owner: 10Muehlenhoff) [11:42:49] PROBLEM - Host sodium.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [11:43:58] (03PS2) 10Alexandros Kosiaris: Fix monitoring::host RSpec suite tests [puppet] - 10https://gerrit.wikimedia.org/r/364411 [11:44:00] (03PS1) 10Alexandros Kosiaris: monitoring: Check mgmt SSH availability as well [puppet] - 10https://gerrit.wikimedia.org/r/364415 (https://phabricator.wikimedia.org/T169321) [11:46:24] (03CR) 10Alexandros Kosiaris: [C: 032] Fix monitoring::host RSpec suite tests [puppet] - 10https://gerrit.wikimedia.org/r/364411 (owner: 10Alexandros Kosiaris) [11:46:30] (03CR) 10Alexandros Kosiaris: [C: 032] monitoring: Check mgmt SSH availability as well [puppet] - 10https://gerrit.wikimedia.org/r/364415 (https://phabricator.wikimedia.org/T169321) (owner: 10Alexandros Kosiaris) [11:47:22] (03PS26) 10Elukey: role::mariadb::analytics::custom_repl_slave: add eventlogging_cleaner.py [puppet] - 10https://gerrit.wikimedia.org/r/356383 (https://phabricator.wikimedia.org/T108850) [11:49:04] (03PS1) 10Muehlenhoff: Revert "Install libav-tools to provide avconv compatibility wrapper" [puppet] - 10https://gerrit.wikimedia.org/r/364417 [11:49:07] ACKNOWLEDGEMENT - Host sodium.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% Volans unresponsive BMC see https://phabricator.wikimedia.org/T169360 [11:50:02] (03CR) 10Elukey: "After a chat with Jaime I switched the default --my-cnf to /etc/my.cnf. Checked on db1047 and there is a [client] section with unix_socket" [puppet] - 10https://gerrit.wikimedia.org/r/356383 (https://phabricator.wikimedia.org/T108850) (owner: 10Elukey) [11:51:14] (03PS2) 10Muehlenhoff: Revert "Install libav-tools to provide avconv compatibility wrapper" [puppet] - 10https://gerrit.wikimedia.org/r/364417 [11:55:36] (03CR) 10Muehlenhoff: [C: 032] Revert "Install libav-tools to provide avconv compatibility wrapper" [puppet] - 10https://gerrit.wikimedia.org/r/364417 (owner: 10Muehlenhoff) [12:13:25] PROBLEM - tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 20 seconds [12:14:05] RECOVERY - tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 3570 bytes in 0.548 second response time [12:29:01] 10Operations, 10monitoring: Remote IPMI doesn't work for ~2% of the fleet - https://phabricator.wikimedia.org/T150160#2775695 (10akosiaris) [12:29:05] 10Operations, 10DC-Ops, 10monitoring, 10Patch-For-Review: Monitor all management interfaces - https://phabricator.wikimedia.org/T169321#3424728 (10akosiaris) 05Open>03Resolved And with the above merged, I think we can resolve this. Of course we have a nice number of actionables from this. e.g. https://... [12:34:08] (03CR) 10Giuseppe Lavagetto: [C: 031] Package metadata and testing tools improvements [software/cumin] - 10https://gerrit.wikimedia.org/r/338808 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [12:35:03] (03CR) 10Giuseppe Lavagetto: [C: 031] "I blame gerrit for showing me the patches out of order. Integration tests are indeed better with pytest, that makes switching everything r" [software/cumin] - 10https://gerrit.wikimedia.org/r/361274 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [12:41:29] PROBLEM - mysqld processes on db1102 is CRITICAL: PROCS CRITICAL: 3 processes with command name mysqld [12:42:05] PROBLEM - mysqld processes on dbstore2002 is CRITICAL: PROCS CRITICAL: 2 processes with command name mysqld [12:42:23] (03PS5) 10Andrew Bogott: Rough in new labs puppetmaster roles [puppet] - 10https://gerrit.wikimedia.org/r/364267 [12:42:45] <_joe_> jynus, marostegui are you doing backups? [12:43:11] see my comments on other channel [12:43:49] or just look at icinga [12:45:29] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Icinga randomly forgets downtimes, causing alert and page spam - https://phabricator.wikimedia.org/T164206#3424784 (10jcrespo) p:05High>03Unbreak! Please challenge my consideration of this being an unbreak now as pages are being randomly sent at... [12:46:19] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Icinga randomly forgets downtimes, causing alert and page spam - https://phabricator.wikimedia.org/T164206#3424790 (10jcrespo) CC @mark @faidon [12:52:49] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "LGTM in general, some small coding convention comments" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/364267 (owner: 10Andrew Bogott) [12:53:53] (03Draft1) 10Paladox: phabricator/varnish: Block /file/upload instead of /file/data for WP0 users [puppet] - 10https://gerrit.wikimedia.org/r/364424 [12:53:56] (03PS2) 10Paladox: phabricator/varnish: Block /file/upload instead of /file/data for WP0 users [puppet] - 10https://gerrit.wikimedia.org/r/364424 (https://phabricator.wikimedia.org/T170200) [12:57:09] (03PS3) 10Paladox: phabricator/varnish: Block /file/upload instead of /file/data for WP0 users [puppet] - 10https://gerrit.wikimedia.org/r/364424 (https://phabricator.wikimedia.org/T170200) [12:59:54] (03CR) 10Volans: [C: 04-2] "This would undo the block completely given that the abusers were already uploading content via non-WP0 connections." [puppet] - 10https://gerrit.wikimedia.org/r/364424 (https://phabricator.wikimedia.org/T170200) (owner: 10Paladox) [13:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170711T1300). [13:00:04] phuedx and TabbyCat: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:14] o/ [13:00:51] 10Operations, 10MW-1.30-release-notes, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10Patch-For-Review: Create Atikamekw Wikipedia - https://phabricator.wikimedia.org/T167714#3345579 (10aude) I updated the sites table 2 weeks ago. interwiki links should work ok now. [13:00:57] (03CR) 10Giuseppe Lavagetto: [C: 031] Move configuration loader from cli to main module [software/cumin] - 10https://gerrit.wikimedia.org/r/363746 (https://phabricator.wikimedia.org/T169640) (owner: 10Volans) [13:01:19] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Icinga randomly forgets downtimes, causing alert and page spam - https://phabricator.wikimedia.org/T164206#3424844 (10jcrespo) if this wasn't clear, this happened at 12:41 UTC today again. [13:07:16] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Icinga randomly forgets downtimes, causing alert and page spam - https://phabricator.wikimedia.org/T164206#3424861 (10jcrespo) Tracked dates: 1 May, 2 May, 26May, 6 Jun, 22 Jun, 7 Jul, 11 Jul Those are only the dates where this was identified- down... [13:08:07] (03PS6) 10Andrew Bogott: Rough in new labs puppetmaster roles [puppet] - 10https://gerrit.wikimedia.org/r/364267 [13:09:00] (03CR) 10jerkins-bot: [V: 04-1] Rough in new labs puppetmaster roles [puppet] - 10https://gerrit.wikimedia.org/r/364267 (owner: 10Andrew Bogott) [13:09:15] (03CR) 10Giuseppe Lavagetto: [C: 031] Configuration: automatically load backend's aliases [software/cumin] - 10https://gerrit.wikimedia.org/r/363747 (https://phabricator.wikimedia.org/T169640) (owner: 10Volans) [13:15:02] (03PS7) 10Andrew Bogott: Rough in new labs puppetmaster roles [puppet] - 10https://gerrit.wikimedia.org/r/364267 [13:17:38] (03PS1) 10Ottomata: Prep for stat100[56] [puppet] - 10https://gerrit.wikimedia.org/r/364427 (https://phabricator.wikimedia.org/T152712) [13:17:56] okie poke [13:18:05] i guess we're not doing the european swat today [13:18:29] (03CR) 10jerkins-bot: [V: 04-1] Prep for stat100[56] [puppet] - 10https://gerrit.wikimedia.org/r/364427 (https://phabricator.wikimedia.org/T152712) (owner: 10Ottomata) [13:19:17] (03PS2) 10Ottomata: Prep for stat100[56] [puppet] - 10https://gerrit.wikimedia.org/r/364427 (https://phabricator.wikimedia.org/T152712) [13:19:51] Hello. [13:19:54] Yes I can SWAT. [13:20:36] (03PS1) 10Amire80: [WIP] Make compact language links default for all Wikipedias except en and de [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364428 [13:21:23] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Make compact language links default for all Wikipedias except en and de [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364428 (owner: 10Amire80) [13:22:39] (03PS1) 10Ottomata: Import cloudera jessie packages into a stretch wikimedia thirdparty component [puppet] - 10https://gerrit.wikimedia.org/r/364429 (https://phabricator.wikimedia.org/T152712) [13:22:48] (03PS8) 10Andrew Bogott: Rough in new labs puppetmaster roles [puppet] - 10https://gerrit.wikimedia.org/r/364267 [13:23:40] (03PS2) 10Dereckson: High density logos for es.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364405 (https://phabricator.wikimedia.org/T170248) (owner: 10MarcoAurelio) [13:23:46] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364405 (https://phabricator.wikimedia.org/T170248) (owner: 10MarcoAurelio) [13:24:12] o/ Dereckson [13:24:48] (03Merged) 10jenkins-bot: High density logos for es.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364405 (https://phabricator.wikimedia.org/T170248) (owner: 10MarcoAurelio) [13:26:17] (03CR) 10jenkins-bot: High density logos for es.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364405 (https://phabricator.wikimedia.org/T170248) (owner: 10MarcoAurelio) [13:26:37] !log dereckson@tin Synchronized static/images/project-logos/: High density logos for es.wikibooks (T170248, 1/2) (duration: 00m 43s) [13:26:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:49] T170248: Optimize existing logo and add HD logos for es.wikibooks - https://phabricator.wikimedia.org/T170248 [13:27:38] Dereckson: i'd like to test my change on one of the mwdebug servers plz [13:27:39] (03PS2) 10Amire80: [WIP] Make compact language links default for all Wikipedias except en and de [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364428 [13:29:16] phuedx: okay, I'll pull it to mwdebug1002 [13:29:22] (it's the standard procedure for SWAT by the way) [13:29:32] (03CR) 10Aklapper: [C: 04-1] "As long as you cannot check every single users on being "legit": No." [puppet] - 10https://gerrit.wikimedia.org/r/364424 (https://phabricator.wikimedia.org/T170200) (owner: 10Paladox) [13:29:35] ta [13:29:41] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: High density logos for es.wikibooks (T170248, 2/2) (duration: 00m 42s) [13:29:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:00] phuedx: we're waitinghttps://integration.wikimedia.org/ci/job/mediawiki-extensions-php55-trusty/4752/console [13:32:11] (03CR) 10Ottomata: [C: 032] Import cloudera jessie packages into a stretch wikimedia thirdparty component [puppet] - 10https://gerrit.wikimedia.org/r/364429 (https://phabricator.wikimedia.org/T152712) (owner: 10Ottomata) [13:32:19] phuedx: live on mwdebug1002.eqiad.wmnet [13:35:29] !log Purged https://en.wikipedia org/static/images/project-logos/eswikibooks.png [13:35:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:54] (03PS7) 10Giuseppe Lavagetto: Add future parser run mode [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363808 (https://phabricator.wikimedia.org/T169546) [13:37:04] (03PS6) 10Jcrespo: mariadb: monitor automatically any multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/364381 (https://phabricator.wikimedia.org/T169514) [13:39:29] Dereckson: lgtm, i've verified that eventlogging is still working for page previews and for echo with the change applied [13:39:30] thanks [13:39:49] (03PS7) 10Jcrespo: mariadb: monitor automatically any multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/364381 (https://phabricator.wikimedia.org/T169514) [13:41:17] (03CR) 10Jcrespo: [C: 032] mariadb: monitor automatically any multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/364381 (https://phabricator.wikimedia.org/T169514) (owner: 10Jcrespo) [13:42:02] phuedx: ok [13:44:12] !log dereckson@tin Synchronized php-1.30.0-wmf.7/extensions/EventLogging/modules/ext.eventLogging.subscriber.js: Don't subscribe EventLogging twice if window.onload fires twice (T170018) (duration: 00m 42s) [13:44:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:24] T170018: Duplicate events sent in Firefox after back button press - https://phabricator.wikimedia.org/T170018 [13:44:24] PROBLEM - puppet last run on db1096 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:45:59] (03PS2) 10Andrew Bogott: disable base monitoring for labtest* machines [puppet] - 10https://gerrit.wikimedia.org/r/364355 (owner: 10Dzahn) [13:47:53] (03PS1) 10Jcrespo: mariadb-multiinstance: avoid double declaration of monitoring class [puppet] - 10https://gerrit.wikimedia.org/r/364432 (https://phabricator.wikimedia.org/T169514) [13:49:53] (03CR) 10Jcrespo: [C: 032] mariadb-multiinstance: avoid double declaration of monitoring class [puppet] - 10https://gerrit.wikimedia.org/r/364432 (https://phabricator.wikimedia.org/T169514) (owner: 10Jcrespo) [13:50:34] (03CR) 10Andrew Bogott: [C: 032] disable base monitoring for labtest* machines [puppet] - 10https://gerrit.wikimedia.org/r/364355 (owner: 10Dzahn) [13:50:42] (03PS3) 10Andrew Bogott: disable base monitoring for labtest* machines [puppet] - 10https://gerrit.wikimedia.org/r/364355 (owner: 10Dzahn) [13:51:33] RECOVERY - puppet last run on db1096 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [13:52:38] 10Operations: Look into feasibility of disabling sha-1 host keys on our ssh daemons - https://phabricator.wikimedia.org/T167966#3425003 (10MoritzMuehlenhoff) >>! In T167966#3422178, @ayounsi wrote: > Not sure if I'm hijacking the topic, but at least it's being tracked somewhere :) Partly :-) Could you please op... [13:57:46] PROBLEM - SSH restbase1018.mgmt.eqiad.wmnet on restbase1018.mgmt.eqiad.wmnet is CRITICAL: Server answer [13:58:15] akosiaris: ^^^ :D [13:58:36] (03PS1) 10Ottomata: Add cloudera-stretch to distributions-wikimedia for stretch-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/364434 (https://phabricator.wikimedia.org/T152712) [13:59:09] volans: yay! [13:59:35] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [13:59:39] (03CR) 10Ottomata: [C: 032] Add cloudera-stretch to distributions-wikimedia for stretch-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/364434 (https://phabricator.wikimedia.org/T152712) (owner: 10Ottomata) [13:59:45] (03PS2) 10Ottomata: Add cloudera-stretch to distributions-wikimedia for stretch-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/364434 (https://phabricator.wikimedia.org/T152712) [13:59:47] (03CR) 10Ottomata: [V: 032 C: 032] Add cloudera-stretch to distributions-wikimedia for stretch-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/364434 (https://phabricator.wikimedia.org/T152712) (owner: 10Ottomata) [14:02:27] (03PS1) 10Ottomata: Remove typo comman in distributions-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/364436 [14:02:36] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [14:02:43] (03CR) 10Ottomata: [V: 032 C: 032] Remove typo comman in distributions-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/364436 (owner: 10Ottomata) [14:02:53] andrewbogott: ^ [14:09:27] (03PS1) 10Marostegui: db-eqiad.php: Restore db1079 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364437 [14:10:39] !log disabled icinga notifications for host and services for labsdb1004 [14:10:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:40] hola [14:11:52] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Restore db1079 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364437 (owner: 10Marostegui) [14:12:22] madhuvishy: could you also disable event handler if you don't downtime it please? [14:12:47] (03Merged) 10jenkins-bot: db-eqiad.php: Restore db1079 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364437 (owner: 10Marostegui) [14:13:00] (03CR) 10jenkins-bot: db-eqiad.php: Restore db1079 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364437 (owner: 10Marostegui) [14:13:13] The raid handler might create a false positive task in phab if caught at the wrong time [14:13:24] volans: okay, done [14:13:46] !log Disable event handler icinga checks for labsdb1004 [14:13:49] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Restore db1079 original weight (duration: 00m 42s) [14:13:54] thanks! :) [14:13:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:09] I am going to do a full upgrade before stopping services [14:14:16] anything against? [14:14:33] I have no idea how postgress will react [14:14:45] PROBLEM - Check correctness of the icinga configuration on einsteinium is CRITICAL: Icinga configuration contains errors [14:14:56] jynus: I'm not sure what Mariadb best practices are - but i'm slightly tentative about combining kernel upgrade with software upgrade [14:14:57] maybe stop postgres-upgrade-stopmysql-reboot? [14:15:15] well, I can upgrade mariadb only [14:15:21] and the kernel [14:15:38] kernel is updated already [14:15:53] yeah, when we reboot it's going to come up with the new kernel applied [14:16:13] who broke icinga config? [14:16:22] * volans looking [14:16:25] PROBLEM - puppet last run on labnet1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:16:39] !log upgradem wmf-mariadb10 on labsdb1004 [14:16:45] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [14:16:48] ok, then [14:16:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:07] will not put both services down and reboot, ok [14:17:13] volans: probably https://gerrit.wikimedia.org/r/#/c/364355/ [14:17:19] Error: Could not find any host matching 'labtestvirt2003' (config file '/etc/icinga/puppet_services.cfg', starting on line 248027) [14:17:20] should I be the one on the serial console? [14:17:25] RECOVERY - puppet last run on labnet1001 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [14:18:07] jynus: I can hop in if not [14:18:28] volans: that's weird, I think andrew set that up awhile ago https://phabricator.wikimedia.org/T166237#3309503 [14:18:32] ok, waiting for you to be ready and I will reboot [14:18:32] afaik not touched since [14:18:55] chasemp: it was merged 25m ago [14:19:10] gerrit /c/364355/ [14:20:11] it might just be the timing between puppet runs, checking [14:20:22] ah some global labtest monitoring exclusion volans, could this be transient w/ puppet weirdness? [14:20:24] jynus: okay I'm on the console [14:20:27] right [14:20:40] chasemp: but if puppet is not running on those hosts it will not be transient ;) [14:20:46] ok [14:21:23] !log rebooting labsdb1004 for kernel upgrade T168584 [14:21:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:34] T168584: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584 [14:21:49] andrewbogott: fyi https://gerrit.wikimedia.org/r/#/c/364355 has broken icinga at least for a bit [14:22:07] still broken, I'm running puppet now to see if it fixes [14:22:21] and it's removing definitions [14:22:27] so might just be that [14:22:32] hold on ;) [14:22:32] (03PS1) 10Marostegui: db-eqiad.php: Depool db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364442 (https://phabricator.wikimedia.org/T168661) [14:22:34] both databases down [14:22:39] Looks like were down [14:22:39] now rebooting properly [14:22:42] kk [14:22:54] volans: is it happy now? [14:23:01] not yet [14:24:13] andrewbogott: check_ipmi_temp, check_nova_compute_process, kvm_ssl_cert, kvm_ssl_cert, ntp are still defined [14:24:13] madhuvishy: ping if you see something weird or when boot finishes [14:24:20] (it's booting up, seems okay so far) [14:24:22] yup [14:24:54] !log jmm@puppetmaster1001 conftool action : set/pooled=yes; selector: mw2118.codfw.wmnet [14:24:58] probably I am making too much out of this,but I am mostly worried about 1001/3 [14:25:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:12] specially looking at their downtimesā€¦ [14:25:16] *uptimes [14:25:21] he he [14:25:31] volans: it's not really my patch, should I just revert it? Or do you think another cycle of running on hosts and monitoring host will do it? [14:25:33] i'm in [14:25:49] it's up [14:25:50] <_joe_> andrewbogott: yeah that patch is simplistic [14:25:53] Wikilabes is up [14:25:54] <_joe_> let's remove that [14:25:55] andrewbogott: it might need revert [14:25:56] *Wikilabels [14:26:04] because there are checks defined elsewhere [14:26:06] starting mysql [14:26:17] at assume that the host is defined [14:26:19] <_joe_> andrewbogott: revert it, the right way to do that is not that patch [14:26:19] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364442 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [14:26:28] (03PS1) 10Andrew Bogott: Revert "disable base monitoring for labtest* machines" [puppet] - 10https://gerrit.wikimedia.org/r/364444 [14:26:32] postgress is handled automatically [14:26:34] and seems up [14:26:38] halfak: please confirm [14:26:38] _joe_: add comments to https://gerrit.wikimedia.org/r/#/c/364444/ ? [14:26:46] !log installing apache security updates on mw2* [14:26:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:59] jynus, see my previous message :) [14:27:03] Wikilabels is back online [14:27:09] good [14:27:11] that is all [14:27:19] that is alli wil finish the mysql upgrade [14:27:20] Cool thanks folks! [14:27:22] \o/ [14:27:24] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364442 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [14:27:25] and it should catch up [14:27:27] soon [14:27:33] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364442 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [14:27:48] (03CR) 10Giuseppe Lavagetto: [C: 032] "The problem is that this removes the monitoring::host definition, but it doesn't disable all the services from being defined. That breaks " [puppet] - 10https://gerrit.wikimedia.org/r/364444 (owner: 10Andrew Bogott) [14:27:59] madhuvishy: I probably should tell you the details [14:28:02] <_joe_> andrewbogott: done [14:28:04] about mysql handling [14:28:06] jynus: yes please :) [14:28:14] even if hopefuly [14:28:19] you should never use it [14:28:24] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1064 - T168661 (duration: 00m 43s) [14:28:27] let's talk on databases- [14:28:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:34] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [14:28:35] okay [14:30:54] 10Operations, 10ops-eqiad, 10Services (watching): scb1003 unresponsive after reboot - https://phabricator.wikimedia.org/T168534#3425143 (10Cmjohnson) 05Open>03Resolved This is up and working...resolving [14:31:22] !log Deploy alter table on db1064 - commonswiki and let it replicate to db1095 and labsdb1009, 1010 and 1011 - T168661 [14:31:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:17] (03PS2) 10Andrew Bogott: Revert "disable base monitoring for labtest* machines" [puppet] - 10https://gerrit.wikimedia.org/r/364444 [14:41:10] volans: is that better? [14:41:59] andrewbogott: CR or icinga? [14:42:15] PROBLEM - Host restbase1018.mgmt.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:42:44] that's me ^ [14:42:58] volans: icinga [14:43:15] just the mgmt interface trying to reset it via the i-button [14:43:24] still broken, running puppet now [14:43:32] I thought you fixed it reverting :( [14:44:35] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:45:23] andrewbogott: did you merge the revert? [14:45:35] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [14:46:12] 10Operations, 10ops-eqiad: mgmt inaccessible on restbase1018 - https://phabricator.wikimedia.org/T169871#3425211 (10Cmjohnson) @MoritzMuehlenhoff The mgmt interface is frozen and the server will need to be powered off and unplugged for several secs to reset the interface. I attempted to try to reset it using... [14:46:26] volans: I think soā€¦ ^^ is evidence, right? [14:47:32] and did you run puppet on labtestvirt2003? [14:47:41] I've run it now [14:47:59] and it added monitoring stuff [14:48:22] running back it on einstenium [14:52:30] andrewbogott: so, how most of our checks are puppetized is: puppet run on the target hosts and export some resources with the checks to be performed, then when puppet runs on the icinga host it will collect those exported resources and creates the configuration for those in icinga [14:52:59] that's the 1-line TL;DR, so if you change add/remove stuff, you need first to run puppet on the target hosts and then on the icinga host [14:53:20] or wait 40 minutes :) [14:53:30] with icinga broken in the middle, so no [14:53:40] (03PS9) 10Andrew Bogott: Rough in new labs puppetmaster roles [puppet] - 10https://gerrit.wikimedia.org/r/364267 [14:54:01] I'm forcing a run of puppet on all labtest* that don't have it disabled [14:54:04] to fix it [14:54:09] ok, thank you [14:55:18] also you would need to wait up to 1h, given that you need puppet to run on all target hosts and then on icinga [14:55:55] 10Operations, 10DBA, 10Scoring-platform-team, 10cloud-services-team: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3425238 (10madhuvishy) [14:58:09] icinga, back fully functional [14:58:59] 10Operations, 10DBA, 10Scoring-platform-team, 10cloud-services-team: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3425254 (10madhuvishy) Status: labsdb1005 reboot is scheduled for July 12 at 1400 UTC. We've decided to wait on labsdb1001 and 1003 reboots for now - given t... [14:59:36] <_joe_> well icinga is not broken, but still having a broken config (which means icinga won't startup in case it crashes) is unacceptable. [15:00:53] <_joe_> and no, puppet working doesn't mean icinga is unbroken [15:01:09] <_joe_> you need to check icinga to ensure that [15:04:21] 10Operations, 10ops-codfw, 10monitoring, 10Patch-For-Review: rack/setup/install netmon2001 - https://phabricator.wikimedia.org/T166180#3425277 (10Papaul) @RobH can you please setup network port for netmon2001? Thanks asw-d-codfw:ge-5/0/23 [15:04:47] RECOVERY - Check correctness of the icinga configuration on einsteinium is OK: Icinga configuration is correct [15:06:27] 10Operations, 10Traffic: Decrease max object TTL in varnishes - https://phabricator.wikimedia.org/T124954#3425280 (10BBlack) >>! In T124954#3423643, @Krinkle wrote: >>>! In T124954#3421257, @BBlack wrote: >> [..] We don't believe it should be possible at this time for an object to exist in the caching layers f... [15:07:35] (03PS1) 10Alexandros Kosiaris: Remove filtertags from kubernetes roles [puppet] - 10https://gerrit.wikimedia.org/r/364449 [15:07:37] (03PS1) 10Alexandros Kosiaris: package_builder: Conditionalize dh-php inclusion [puppet] - 10https://gerrit.wikimedia.org/r/364450 [15:08:07] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Remove filtertags from kubernetes roles [puppet] - 10https://gerrit.wikimedia.org/r/364449 (owner: 10Alexandros Kosiaris) [15:08:47] RECOVERY - Host restbase1018.mgmt.eqiad.wmnet is UP: PING OK - Packet loss = 0%, RTA = 1.33 ms [15:09:10] 10Operations, 10Analytics, 10Analytics-Cluster: Clean up permissions for privatedata files on stat1002 - they should be group readable by statistics-privatedata-users - https://phabricator.wikimedia.org/T89887#3425286 (10elukey) [15:11:45] (03PS2) 10Giuseppe Lavagetto: recommendation api: refactor profile, remove module [puppet] - 10https://gerrit.wikimedia.org/r/364221 (https://phabricator.wikimedia.org/T148129) [15:11:47] (03PS1) 10Giuseppe Lavagetto: role::scb: add recommendation-api service [puppet] - 10https://gerrit.wikimedia.org/r/364451 (https://phabricator.wikimedia.org/T165760) [15:11:58] <_joe_> mobrovac: ^^ [15:12:09] (03PS1) 10Alexandros Kosiaris: Add docker::registry::username [labs/private] - 10https://gerrit.wikimedia.org/r/364452 [15:12:19] <_joe_> still need some work, but should be ok [15:12:24] cool [15:12:33] (03CR) 10Giuseppe Lavagetto: [C: 031] Add docker::registry::username [labs/private] - 10https://gerrit.wikimedia.org/r/364452 (owner: 10Alexandros Kosiaris) [15:12:47] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Add docker::registry::username [labs/private] - 10https://gerrit.wikimedia.org/r/364452 (owner: 10Alexandros Kosiaris) [15:13:30] (03CR) 10jerkins-bot: [V: 04-1] role::scb: add recommendation-api service [puppet] - 10https://gerrit.wikimedia.org/r/364451 (https://phabricator.wikimedia.org/T165760) (owner: 10Giuseppe Lavagetto) [15:13:38] RECOVERY - mediawiki-installation DSH group on mw2118 is OK: OK [15:14:21] (03CR) 10Mobrovac: [C: 04-1] role::scb: add recommendation-api service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/364451 (https://phabricator.wikimedia.org/T165760) (owner: 10Giuseppe Lavagetto) [15:14:23] (03CR) 10Marostegui: [C: 031] "Remember this needs manually applying on the desired hosts." [puppet] - 10https://gerrit.wikimedia.org/r/364412 (https://phabricator.wikimedia.org/T170118) (owner: 10Elukey) [15:15:10] (03CR) 10Volans: [C: 032] Fix Pylint and other tools reported errors [software/cumin] - 10https://gerrit.wikimedia.org/r/361040 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [15:15:50] (03Merged) 10jenkins-bot: Fix Pylint and other tools reported errors [software/cumin] - 10https://gerrit.wikimedia.org/r/361040 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [15:16:24] (03CR) 10Volans: [C: 032] Package metadata and testing tools improvements [software/cumin] - 10https://gerrit.wikimedia.org/r/338808 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [15:17:00] jynus: this still looks downtimed - https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=1&host=labsdb1004 (Jcrespo rebooting for kernel upgrade). okay if I remove? [15:18:20] (03Merged) 10jenkins-bot: Package metadata and testing tools improvements [software/cumin] - 10https://gerrit.wikimedia.org/r/338808 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [15:18:50] madhuvishy: it is ok, it will go away at 16h on its own [15:18:57] jynus: okay :) [15:19:05] or less, if icinga fails again [15:19:18] jynus: love your positiveness :-P [15:19:41] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster: rack/setup/install new kafka nodes kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T167992#3425367 (10Ottomata) [15:21:54] !log rebooting uranium for kernel update [15:22:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:56] !log restart burrow on krypton [15:25:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:06] !log Stop replication labsdb1009 for maintenance - T153743 [15:25:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:16] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [15:26:23] (03PS1) 10Alexandros Kosiaris: Add docker::registry::password [labs/private] - 10https://gerrit.wikimedia.org/r/364454 [15:29:31] !log Stop replication labsdb1010 for maintenance - T153743 [15:29:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:27] (03CR) 10Volans: [C: 032] Tests: convert unittest to pytest [software/cumin] - 10https://gerrit.wikimedia.org/r/361274 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [15:32:53] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Add docker::registry::password [labs/private] - 10https://gerrit.wikimedia.org/r/364454 (owner: 10Alexandros Kosiaris) [15:34:48] 10Operations, 10DBA, 10Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3425420 (10Cmjohnson) @jcrespo: the issue should be resolved. The cable was in the wrong eth port. Confirmed MAC cmjohnson@asw-b-eqiad> ... ethernet-switching table brief |grep ge-5/0/5... [15:35:10] (03Merged) 10jenkins-bot: Tests: convert unittest to pytest [software/cumin] - 10https://gerrit.wikimedia.org/r/361274 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [15:35:37] (03CR) 10Volans: [C: 032] TODO: remove rejected item [software/cumin] - 10https://gerrit.wikimedia.org/r/361638 (owner: 10Volans) [15:36:26] !log rolling restart of thumbor to pick up tiff and expat security updates [15:36:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:15] (03Merged) 10jenkins-bot: TODO: remove rejected item [software/cumin] - 10https://gerrit.wikimedia.org/r/361638 (owner: 10Volans) [15:39:05] (03CR) 10Volans: [C: 032] Move configuration loader from cli to main module [software/cumin] - 10https://gerrit.wikimedia.org/r/363746 (https://phabricator.wikimedia.org/T169640) (owner: 10Volans) [15:39:10] (03PS1) 10Jcrespo: install_server: Reimage db1096 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/364456 (https://phabricator.wikimedia.org/T169514) [15:39:48] (03PS1) 10Giuseppe Lavagetto: Add entries for service recommendation-api [dns] - 10https://gerrit.wikimedia.org/r/364457 (https://phabricator.wikimedia.org/T165760) [15:39:50] (03PS1) 10Giuseppe Lavagetto: Add discovery DNS entry for service recommendation-api [dns] - 10https://gerrit.wikimedia.org/r/364458 (https://phabricator.wikimedia.org/T165760) [15:40:59] 10Operations, 10DBA, 10Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3425465 (10jcrespo) May I ask you to check db1100, db1104 and db1105- probably the same issue. [15:41:11] (03CR) 10Alexandros Kosiaris: [C: 032] package_builder: Conditionalize dh-php inclusion [puppet] - 10https://gerrit.wikimedia.org/r/364450 (owner: 10Alexandros Kosiaris) [15:41:34] (03CR) 10Mobrovac: Add entries for service recommendation-api (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/364457 (https://phabricator.wikimedia.org/T165760) (owner: 10Giuseppe Lavagetto) [15:41:36] (03Merged) 10jenkins-bot: Move configuration loader from cli to main module [software/cumin] - 10https://gerrit.wikimedia.org/r/363746 (https://phabricator.wikimedia.org/T169640) (owner: 10Volans) [15:41:59] <_joe_> mobrovac: I didn't like to have "api" in the name [15:42:14] <_joe_> it sounds sooo redundant [15:42:20] you and me both _joe_, but that's the name they went with ... [15:42:26] <_joe_> w/e? [15:42:52] 10Operations, 10DBA, 10Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3425480 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['db1098.eqiad.wmnet'] ``` The log can be found in `/var/log/wmf-auto-re... [15:42:56] (03CR) 10Volans: [C: 032] Configuration: automatically load backend's aliases [software/cumin] - 10https://gerrit.wikimedia.org/r/363747 (https://phabricator.wikimedia.org/T169640) (owner: 10Volans) [15:43:08] <_joe_> but, I just noticed my question wasn't answered even yesterday (https://phabricator.wikimedia.org/T148129#3415882) [15:43:20] <_joe_> so we'll wait for an answer to that. [15:43:52] (03PS2) 10Jcrespo: install_server: Reimage db1096 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/364456 (https://phabricator.wikimedia.org/T169514) [15:45:02] (03Merged) 10jenkins-bot: Configuration: automatically load backend's aliases [software/cumin] - 10https://gerrit.wikimedia.org/r/363747 (https://phabricator.wikimedia.org/T169640) (owner: 10Volans) [15:45:20] (03CR) 10Jcrespo: [C: 032] install_server: Reimage db1096 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/364456 (https://phabricator.wikimedia.org/T169514) (owner: 10Jcrespo) [15:49:13] (03PS1) 10DCausse: [WIP] Bump ltr plugin to include logging features [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/364462 [15:49:38] PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 2017696 [15:50:36] 10Operations, 10DBA, 10Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3425524 (10jcrespo) [15:56:00] !log restarting elastic on relforge100*.eqiad.wmnet to pickup a new version of the ltr plugin [15:56:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:58] RECOVERY - SSH restbase1018.mgmt.eqiad.wmnet on restbase1018.mgmt.eqiad.wmnet is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) [16:00:05] godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170711T1600). Please do the needful. [16:08:01] 10Operations, 10ops-codfw, 10Cloud-Services, 10Cloud-VPS, 10netops: codfw: labtestpuppetmaster2001 switch port configuration - https://phabricator.wikimedia.org/T167321#3425572 (10RobH) a:05RobH>03Papaul Assigned to Papaul to try another NIC on the server, and open a support case for a bad nic if so. [16:09:31] 10Operations, 10ops-codfw, 10monitoring, 10Patch-For-Review: rack/setup/install netmon2001 - https://phabricator.wikimedia.org/T166180#3425576 (10RobH) 05Open>03Resolved ``` robh@asw-d-codfw# show | compare [edit interfaces interface-range vlan-public1-d-codfw] member ge-1/0/13 { ... } + memb... [16:12:54] 10Operations, 10Domains, 10Traffic, 10fundraising-tech-ops: revoke eventdonations.wikimedia.org SSL cert if there is one... - https://phabricator.wikimedia.org/T170193#3425597 (10RobH) If the old certificate was not compromised, it is a lot cleaner to simply let it expire. Revokcation, as I understanding... [16:13:55] 10Operations, 10ops-codfw, 10Cloud-Services, 10Cloud-VPS, 10netops: codfw: labtestpuppetmaster2001 switch port configuration - https://phabricator.wikimedia.org/T167321#3425612 (10Papaul) @Robh this is already done it was not switch problem it was DNS see T167157 [16:14:23] 10Operations, 10ops-codfw, 10Cloud-Services, 10Cloud-VPS, 10Patch-For-Review: rack/setup/install labtestpuppetmaster2001 - https://phabricator.wikimedia.org/T167157#3425615 (10RobH) [16:14:26] 10Operations, 10ops-codfw, 10Cloud-Services, 10Cloud-VPS, 10netops: codfw: labtestpuppetmaster2001 switch port configuration - https://phabricator.wikimedia.org/T167321#3425614 (10RobH) 05Open>03Resolved [16:14:34] holy crap that ticket links to a lot of other crap [16:15:26] (03PS1) 10RobH: remove eventdonations.w.o from dns [dns] - 10https://gerrit.wikimedia.org/r/364464 [16:15:28] 10Operations, 10Domains, 10Traffic, 10fundraising-tech-ops: revoke eventdonations.wikimedia.org SSL cert if there is one... - https://phabricator.wikimedia.org/T170193#3425617 (10BBlack) I think in this case we should revoke unless the expiry is already very close (it might be!). This is private key that... [16:15:51] bblack: heh, I was about to write the same :) [16:15:56] so yeah, +1 [16:16:14] Not After : Sep 4 12:10:02 2017 GMT [16:16:17] fwiw [16:16:46] 10Operations, 10Domains, 10Traffic, 10fundraising-tech-ops: revoke eventdonations.wikimedia.org SSL cert if there is one... - https://phabricator.wikimedia.org/T170193#3425630 (10BBlack) Ah I missed the part above where it stated that it expired in a week or two. In that case, there's little point for this... [16:16:55] !log Removed 2FA for Arsog1985 SUL account (T168779) [16:17:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:06] T168779: Disable Two-factor authentication for user Arsog1985 (hywiki) - https://phabricator.wikimedia.org/T168779 [16:17:36] 10Operations, 10Domains, 10Traffic, 10fundraising-tech-ops: revoke eventdonations.wikimedia.org SSL cert if there is one... - https://phabricator.wikimedia.org/T170193#3422468 (10faidon) Looks like it expires in September: ``` Validity Not Before: Jul 18 18:16:03 2016 GMT No... [16:18:18] sorry Dereckson :) [16:18:19] (03CR) 10Jgreen: [C: 031] remove eventdonations.w.o from dns [dns] - 10https://gerrit.wikimedia.org/r/364464 (owner: 10RobH) [16:18:36] (03CR) 10Umherirrender: Add ar_content_format and ar_content_model to labs views (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/363851 (https://phabricator.wikimedia.org/T89741) (owner: 10Umherirrender) [16:19:19] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [16:19:20] 10Operations, 10Domains, 10Traffic, 10fundraising-tech-ops: revoke eventdonations.wikimedia.org SSL cert if there is one... - https://phabricator.wikimedia.org/T170193#3425657 (10RobH) I only advised against revokcation since that was my understanding from @bblack, I'm not trying to block this. In fact, I... [16:20:37] tzatziki: thanks to have logged the operation [16:21:19] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [16:21:38] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [16:21:59] 10Operations, 10Domains, 10Traffic, 10fundraising-tech-ops: revoke eventdonations.wikimedia.org SSL cert if there is one... - https://phabricator.wikimedia.org/T170193#3425669 (10RobH) a:03RobH Chatted in irc, I'll revoke this shortly. [16:22:26] (03CR) 10RobH: [C: 032] remove eventdonations.w.o from dns [dns] - 10https://gerrit.wikimedia.org/r/364464 (owner: 10RobH) [16:22:59] (03PS1) 10Jcrespo: mariadb: Fix default package installation for stretch [puppet] - 10https://gerrit.wikimedia.org/r/364466 (https://phabricator.wikimedia.org/T168356) [16:23:54] 10Operations, 10Traffic: revoke benefactorevents.wikimedia.org SSL certificate - https://phabricator.wikimedia.org/T170140#3425704 (10Jgreen) [16:23:56] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: remove icinga monitoring for benefactorevents.wm.o SSL certificate - https://phabricator.wikimedia.org/T170139#3425705 (10Jgreen) [16:24:16] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Fix default package installation for stretch [puppet] - 10https://gerrit.wikimedia.org/r/364466 (https://phabricator.wikimedia.org/T168356) (owner: 10Jcrespo) [16:24:22] 10Operations, 10Traffic: revoke benefactorevents.wikimedia.org SSL certificate - https://phabricator.wikimedia.org/T170140#3425709 (10RobH) a:03RobH [16:24:52] (03PS1) 10Ottomata: Use pulls rather than updates to pull cloudera jessie packages into stretch-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/364467 (https://phabricator.wikimedia.org/T152712) [16:26:37] 10Operations, 10Traffic, 10HTTPS, 10Patch-For-Review: Enforce HTTPS+HSTS on remaining one-off sites in wikimedia.org that don't use standard cache cluster termination - https://phabricator.wikimedia.org/T132521#3425715 (10BBlack) [16:27:38] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:28:01] (03PS2) 10Jcrespo: mariadb: Fix default package installation for stretch [puppet] - 10https://gerrit.wikimedia.org/r/364466 (https://phabricator.wikimedia.org/T168356) [16:28:19] 10Operations, 10Domains, 10Traffic, 10fundraising-tech-ops, 10Patch-For-Review: remove eventdonations.wikimedia.org CNAME - https://phabricator.wikimedia.org/T170192#3425724 (10RobH) [16:28:23] 10Operations, 10Domains, 10Traffic, 10fundraising-tech-ops: revoke eventdonations.wikimedia.org SSL cert if there is one... - https://phabricator.wikimedia.org/T170193#3425722 (10RobH) 05Open>03stalled Certificate Status: Revoke Processing on Globalsign's systems. I'm going to move this to stalled, a... [16:28:40] 10Operations, 10Domains, 10Traffic, 10fundraising-tech-ops: revoke eventdonations.wikimedia.org SSL cert if there is one... - https://phabricator.wikimedia.org/T170193#3425727 (10RobH) p:05Triage>03Normal [16:29:23] 10Operations, 10Goal, 10Kubernetes, 10Services (watching), 10User-Joe: Implement a pod networking policy approach - https://phabricator.wikimedia.org/T170111#3425734 (10mobrovac) [16:29:28] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:30:03] (03PS1) 10RobH: remove eventdonations.w.o cert from repo [puppet] - 10https://gerrit.wikimedia.org/r/364468 [16:30:51] (03CR) 10RobH: [C: 032] remove eventdonations.w.o cert from repo [puppet] - 10https://gerrit.wikimedia.org/r/364468 (owner: 10RobH) [16:30:53] 10Operations: Rename 'restricted' group? - https://phabricator.wikimedia.org/T104671#1423684 (10Dereckson) I've prepared a change to remove users from `deployment` group from the `restricted` group, that will help to get a more accurate list to revisit. ```name=Data check,lang=python $ python >>> import yaml >... [16:31:05] (03PS1) 10Dereckson: Remove deployers from restricted group [puppet] - 10https://gerrit.wikimedia.org/r/364469 (https://phabricator.wikimedia.org/T104671) [16:31:28] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:31:58] 10Operations, 10DBA, 10Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3425748 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1098.eqiad.wmnet'] ``` and were **ALL** successful. [16:32:00] !log restarting varnish backend on cp1074 (mailbox lag) [16:32:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:29] 10Operations, 10Domains, 10Traffic, 10fundraising-tech-ops, 10Patch-For-Review: remove eventdonations.wikimedia.org CNAME - https://phabricator.wikimedia.org/T170192#3425768 (10RobH) 05Open>03Resolved a:03RobH dns removed [16:34:23] 10Operations, 10Patch-For-Review: Rename 'restricted' group? - https://phabricator.wikimedia.org/T104671#3425787 (10Dereckson) There is no dupe for ops/restricted by the way. ```lang=python >>> set(d['groups']['restricted']['members']) & set(d['groups']['ops']['members']) set([]) ``` [16:34:48] PROBLEM - Host kafka2003 is DOWN: PING CRITICAL - Packet loss = 100% [16:36:14] RECOVERY - Host kafka2003 is UP: PING OK - Packet loss = 0%, RTA = 36.07 ms [16:36:38] (03CR) 10Brian Wolff: [C: 04-1] "Ive been trying to argue that it should also be redacted for revision in T169097. (For revision table mw redacts when stuff is revdeleted)" [puppet] - 10https://gerrit.wikimedia.org/r/363851 (https://phabricator.wikimedia.org/T89741) (owner: 10Umherirrender) [16:38:08] hi ops! [16:38:37] net-ops in particualr: any chance this firewall update will be ready tomorrow? [16:38:45] https://phabricator.wikimedia.org/T170007 [16:39:05] huh [16:39:08] a phab task i cannot view [16:39:11] fundraising-not-tech is planning a pre-test for the big english countries [16:39:11] interesting... [16:39:30] weird... it says 'custom policy' in the visibility [16:39:32] ejegg: i would review and try to help find someone see it [16:39:34] RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 0 [16:39:35] but i dont have access [16:39:50] i'd set all of the ops stuff to #acl ops team viewable [16:40:21] i just went to open and try to hunt down help for you since im on clinic duty [16:40:23] =] [16:41:06] i also cannot see the view policy, or someone could socially engineer their way into it. basically you as owner can though, but your call [16:41:16] well huh, clicking 'other projects' in the visibility dropdown does nothing for me [16:41:22] edit the task [16:41:31] and then in the visible to drop down, its likely set to custom [16:41:41] Jeff_Green: can you change the visibility of https://phabricator.wikimedia.org/T170007 to ops? [16:41:45] add acl*ops-team [16:41:54] eyah i dunno who is owner either ;] [16:42:10] ahhh, there's the popup now. just took a whole minute to arrive [16:42:12] sorry, #acl*operations-team [16:42:16] not ops-team [16:42:28] it's it just nda now? looking... [16:42:39] its something odd, cuz i have nda and i cannto see it ;] [16:42:43] iirc i had nda [16:42:47] try now [16:42:54] now i can view [16:42:56] =] [16:43:01] i think it picked up the permissions of the parent task. sorry/thanks [16:43:29] no worries, of course, I cannot apply these updates to the frack firewalls, but i can poke others ;] [16:43:36] thanks robh! [16:43:42] though it being tuesday night for the netops [16:43:58] if icannot find one today they'll have an email waitinf or them first thign when they start [16:44:01] asking one of them to poke this [16:44:09] (03PS3) 10Jcrespo: mariadb: Fix default package installation for stretch [puppet] - 10https://gerrit.wikimedia.org/r/364466 (https://phabricator.wikimedia.org/T168356) [16:44:10] that would be great! [16:44:21] (03CR) 10Jcrespo: [C: 032] "https://puppet-compiler.wmflabs.org/7012/" [puppet] - 10https://gerrit.wikimedia.org/r/364466 (https://phabricator.wikimedia.org/T168356) (owner: 10Jcrespo) [16:44:36] the banners team is hoping to do the test that needs this at 15:00 UTC tomorrow [16:47:08] 10Operations, 10Domains, 10Traffic, 10fundraising-tech-ops, 10Patch-For-Review: remove eventdonations.wikimedia.org CNAME - https://phabricator.wikimedia.org/T170192#3425918 (10RobH) [16:47:11] 10Operations, 10Domains, 10Traffic, 10fundraising-tech-ops, 10Patch-For-Review: revoke eventdonations.wikimedia.org SSL cert if there is one... - https://phabricator.wikimedia.org/T170193#3425916 (10RobH) 05stalled>03Resolved Revocation Request Completed for eventdonations.wikimedia.org [16:47:42] ejegg: So yeah I just tried to hunt down the primary netop but he is offline since its evening in his timezone [16:48:11] i left him a PM, and will still email the three of them who do most of the netowrk changes on pfw [16:48:25] gotcha [16:48:33] thanks muchly for reaching out! [16:48:47] I don't think they knew it existed before the perms changed though [16:48:55] since it was set to wmf fr before and no netops belong to that [16:49:00] afaik [16:49:02] yeah, that would explain the lack of response! [16:49:20] We are not intentionally ignoring you, I promise! Though it stinks for the deadlie [16:49:29] deadline, i think they should be able to patch it in their EU AM without issue [16:49:43] (that isn't a promise though, i cannot speak to their schedules) [16:49:45] (03PS2) 10Umherirrender: Add ar_content_format and ar_content_model to labs views [puppet] - 10https://gerrit.wikimedia.org/r/363851 (https://phabricator.wikimedia.org/T89741) [16:50:05] yeah, looks like we're burning one of our last-minute-request credits :) [16:50:42] ejegg: how dare you raise money to pay for things [16:50:52] like us [16:50:57] and our paychecks ;] [16:53:00] just one piece of the puzzle! we like to play nice too [16:54:06] (03CR) 10Dzahn: [C: 031] "removes the following user: hoo khorn ssastry nuria legoktm addshore confirmed they are all already in deployment group. lgtm, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/364469 (https://phabricator.wikimedia.org/T104671) (owner: 10Dereckson) [16:54:46] (03CR) 10Addshore: [C: 031] Remove deployers from restricted group [puppet] - 10https://gerrit.wikimedia.org/r/364469 (https://phabricator.wikimedia.org/T104671) (owner: 10Dereckson) [16:55:26] (03PS2) 10Dereckson: Remove deployers from restricted group [puppet] - 10https://gerrit.wikimedia.org/r/364469 (https://phabricator.wikimedia.org/T104671) [16:55:30] (03PS1) 10RobH: remove benefactorevents.w.o cert as its been revoked [puppet] - 10https://gerrit.wikimedia.org/r/364472 [16:55:32] 10Operations, 10Traffic: revoke benefactorevents.wikimedia.org SSL certificate - https://phabricator.wikimedia.org/T170140#3425957 (10RobH) Please note benefactorevents.wikimedia.org doesn't expire until 04/02/2018. Since this private key is accessible by a third party (Trilogy), I'm revoking the certificate... [16:57:34] (03PS1) 10RobH: remove benefactorevents.w.o from dns [dns] - 10https://gerrit.wikimedia.org/r/364473 [16:58:07] (03CR) 10RobH: [C: 032] remove benefactorevents.w.o cert as its been revoked [puppet] - 10https://gerrit.wikimedia.org/r/364472 (owner: 10RobH) [16:59:02] 10Operations, 10DBA, 10Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3425966 (10jcrespo) [16:59:11] 10Operations, 10DBA, 10Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3156297 (10jcrespo) [16:59:19] (03CR) 10RobH: [C: 032] remove benefactorevents.w.o from dns [dns] - 10https://gerrit.wikimedia.org/r/364473 (owner: 10RobH) [17:00:05] gwicke, cscott, arlolra, subbu, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services ā€“ Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170711T1700). [17:03:24] 10Operations, 10Traffic: Fix nits in HTTPS/HSTS configs in externally-hosted fundraising domains - https://phabricator.wikimedia.org/T137161#3425988 (10RobH) [17:03:26] 10Operations, 10Traffic, 10Patch-For-Review: revoke benefactorevents.wikimedia.org SSL certificate - https://phabricator.wikimedia.org/T170140#3425984 (10RobH) 05Open>03Resolved Revocation Request Completed for benefactorevents.wikimedia.org - confirmed from globalsign. I've gone ahead and removed the k... [17:03:38] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: remove icinga monitoring for benefactorevents.wm.o SSL certificate - https://phabricator.wikimedia.org/T170139#3425989 (10Dzahn) 05Open>03Resolved a:03Dzahn It's gone from Icinga now. [17:04:24] 10Operations, 10DNS, 10Traffic, 10fundraising-tech-ops: remove benefactorevents.wikimedia.org cname from DNS - https://phabricator.wikimedia.org/T170295#3425995 (10Jgreen) [17:05:27] 10Operations, 10DNS, 10Traffic, 10fundraising-tech-ops: remove benefactorevents.wikimedia.org cname from DNS - https://phabricator.wikimedia.org/T170295#3426010 (10Jgreen) 05Open>03Resolved p:05Triage>03Normal a:03RobH [17:07:08] 10Operations, 10ops-codfw, 10monitoring, 10Patch-For-Review: rack/setup/install netmon2001 - https://phabricator.wikimedia.org/T166180#3426023 (10Dzahn) 05Resolved>03Open Thanks Papaul, Rob, i'm gonna reopen this and take it to continue with OS install and adding services. [17:07:22] 10Operations, 10ops-codfw, 10monitoring, 10Patch-For-Review: rack/setup/install netmon2001 - https://phabricator.wikimedia.org/T166180#3426026 (10Dzahn) a:05Papaul>03Dzahn [17:10:29] 10Operations, 10ops-codfw, 10monitoring, 10Patch-For-Review: rack/setup/install netmon2001 - https://phabricator.wikimedia.org/T166180#3426043 (10Dzahn) a:05Dzahn>03Papaul [17:10:48] 10Operations, 10Epic, 10Goal, 10Services (doing), and 2 others: Services Q1 2017/18 goal: Begin migrating job queue processing to multi-DC enabled eventbus infrastructure. - https://phabricator.wikimedia.org/T169937#3426048 (10Pchelolo) [17:10:50] 10Operations, 10Traffic, 10HTTPS, 10Patch-For-Review: Enforce HTTPS+HSTS on remaining one-off sites in wikimedia.org that don't use standard cache cluster termination - https://phabricator.wikimedia.org/T132521#3426051 (10BBlack) [17:10:53] 10Operations, 10Traffic: Fix nits in HTTPS/HSTS configs in externally-hosted fundraising domains - https://phabricator.wikimedia.org/T137161#3426050 (10BBlack) 05Open>03Resolved [17:11:52] 10Operations, 10Traffic, 10HTTPS, 10Patch-For-Review: Enforce HTTPS+HSTS on remaining one-off sites in wikimedia.org that don't use standard cache cluster termination - https://phabricator.wikimedia.org/T132521#2201391 (10BBlack) [17:11:55] 10Operations, 10Traffic, 10HTTPS, 10Tracking: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#3426063 (10BBlack) [17:11:58] 10Operations, 10Traffic, 10Wikimedia-Shop, 10HTTPS: store.wikimedia.org HTTPS issues - https://phabricator.wikimedia.org/T128559#3426061 (10BBlack) [17:12:01] 10Operations, 10ops-codfw, 10monitoring, 10Patch-For-Review: rack/setup/install netmon2001 - https://phabricator.wikimedia.org/T166180#3426064 (10RobH) I did not mean to resolve the task, my bad! [17:12:27] 10Operations, 10Traffic, 10HTTPS, 10Tracking: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#1423897 (10BBlack) [17:12:30] 10Operations, 10Traffic, 10HTTPS: Enable HSTS on Wikimedia sites - https://phabricator.wikimedia.org/T40516#3426075 (10BBlack) [17:12:34] 10Operations, 10Traffic, 10HTTPS, 10Patch-For-Review: Enforce HTTPS+HSTS on remaining one-off sites in wikimedia.org that don't use standard cache cluster termination - https://phabricator.wikimedia.org/T132521#2201391 (10BBlack) 05Open>03Resolved a:03BBlack Resolving this and moving the last remaini... [17:12:38] 10Operations, 10ops-codfw, 10monitoring, 10Patch-For-Review: rack/setup/install netmon2001 - https://phabricator.wikimedia.org/T166180#3426076 (10Dzahn) Do you know which partman recipe is the right one? [17:12:44] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10netops: codfw: rack frack refresh equipment - https://phabricator.wikimedia.org/T169643#3426077 (10Papaul) Racking complete [17:14:32] 10Operations, 10Traffic, 10HTTPS, 10Tracking: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#3426099 (10BBlack) So with these changes and cleanups in the past few weeks, we're basically down to two outstanding issues here from the original context: * T133548 - Create... [17:16:04] 10Operations, 10Traffic, 10HTTPS, 10Tracking: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#3426110 (10BBlack) [17:17:14] 10Operations, 10Traffic, 10HTTPS, 10Tracking: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#1423900 (10BBlack) [17:17:19] 10Operations, 10ops-codfw, 10monitoring, 10Patch-For-Review: rack/setup/install netmon2001 - https://phabricator.wikimedia.org/T166180#3426117 (10RobH) a:05Papaul>03RobH I'm going to claim this for install, so @papaul can work on other onsite tasks =] [17:18:14] 10Operations, 10monitoring: Remote IPMI doesn't work for ~2% of the fleet - https://phabricator.wikimedia.org/T150160#3426120 (10faidon) [17:18:18] 10Operations, 10ops-codfw, 10ops-eqiad, 10monitoring: Unresponsive/misconfigured iDRACs over the host-BMC interface - https://phabricator.wikimedia.org/T169360#3426123 (10faidon) [17:19:00] (03PS1) 10Jcrespo: dbstore_multiiinstance: Set default basedir depending on the os [puppet] - 10https://gerrit.wikimedia.org/r/364476 (https://phabricator.wikimedia.org/T169514) [17:19:08] 10Operations, 10Traffic, 10HTTPS, 10Tracking: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#3426128 (10BBlack) [17:19:11] (03PS2) 10Jcrespo: dbstore_multiiinstance: Set default basedir depending on the os [puppet] - 10https://gerrit.wikimedia.org/r/364476 (https://phabricator.wikimedia.org/T169514) [17:19:42] !log starting branch cut for 1.30.0-wmf.9 T167893 [17:19:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:54] T167893: MW-1.30.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T167893 [17:20:04] (03CR) 10jerkins-bot: [V: 04-1] dbstore_multiiinstance: Set default basedir depending on the os [puppet] - 10https://gerrit.wikimedia.org/r/364476 (https://phabricator.wikimedia.org/T169514) (owner: 10Jcrespo) [17:20:45] (03PS3) 10Jcrespo: dbstore_multiiinstance: Set default basedir depending on the os [puppet] - 10https://gerrit.wikimedia.org/r/364476 (https://phabricator.wikimedia.org/T169514) [17:20:45] !log mw2201, mw2202 - depool appservers for T169360 (drain flea power) [17:20:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:56] T169360: Unresponsive/misconfigured iDRACs over the host-BMC interface - https://phabricator.wikimedia.org/T169360 [17:22:52] (03CR) 10Jcrespo: [C: 032] dbstore_multiiinstance: Set default basedir depending on the os [puppet] - 10https://gerrit.wikimedia.org/r/364476 (https://phabricator.wikimedia.org/T169514) (owner: 10Jcrespo) [17:26:52] (03PS1) 10Jcrespo: mariadb: Fix typo on service unit s/LimitCore/LimitCORE/ [software] - 10https://gerrit.wikimedia.org/r/364477 [17:29:36] 10Operations, 10Traffic, 10Wikimedia-Shop, 10HTTPS: store.wikimedia.org HTTPS issues - https://phabricator.wikimedia.org/T128559#3426177 (10BBlack) [17:30:15] (03PS1) 10RobH: settting netmon2001 install params [puppet] - 10https://gerrit.wikimedia.org/r/364478 [17:30:37] PROBLEM - Host mw2201 is DOWN: PING CRITICAL - Packet loss = 100% [17:30:37] PROBLEM - Host mw2202 is DOWN: PING CRITICAL - Packet loss = 100% [17:30:56] (03CR) 10RobH: [C: 032] settting netmon2001 install params [puppet] - 10https://gerrit.wikimedia.org/r/364478 (owner: 10RobH) [17:31:09] (03PS2) 10RobH: settting netmon2001 install params [puppet] - 10https://gerrit.wikimedia.org/r/364478 [17:31:37] PROBLEM - Host mw2201.mgmt.codfw.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [17:31:37] PROBLEM - Host mw2202.mgmt.codfw.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [17:34:17] ACKNOWLEDGEMENT - Host mw2201 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn fixing IPMI [17:34:17] ACKNOWLEDGEMENT - Host mw2201.mgmt.codfw.wmnet is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn fixing IPMI [17:34:51] ACKNOWLEDGEMENT - Host mw2202 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn papaul drains flea power [17:34:51] ACKNOWLEDGEMENT - Host mw2202.mgmt.codfw.wmnet is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn papaul drains flea power [17:35:43] apergos: are nfs problems in https://phabricator.wikimedia.org/T169680 fixed? I am wondering whether wikidata dumps are back to normal [17:38:46] PROBLEM - Check systemd state on install2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:39:35] (03PS1) 10RobH: Revert "settting netmon2001 install params" [puppet] - 10https://gerrit.wikimedia.org/r/364479 [17:40:24] 10Operations, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): some elasticsearch servers in eqiad have CPU overheating - https://phabricator.wikimedia.org/T168816#3377656 (10debt) @Cmjohnson - @gehel is on vacation until July 21st, so hopefully you two can re-connect at that time. :) [17:40:26] 10Operations: sshd stretch puppet support - https://phabricator.wikimedia.org/T170298#3426212 (10jcrespo) [17:40:39] 10Operations: sshd stretch puppet support - https://phabricator.wikimedia.org/T170298#3426226 (10jcrespo) p:05Triage>03Low [17:43:32] (03CR) 10RobH: [C: 032] Revert "settting netmon2001 install params" [puppet] - 10https://gerrit.wikimedia.org/r/364479 (owner: 10RobH) [17:45:56] RECOVERY - Check systemd state on install2002 is OK: OK - running: The system is fully operational [17:52:42] (03PS1) 10RobH: correcting netmon2001 dns entries [dns] - 10https://gerrit.wikimedia.org/r/364481 [17:53:31] 10Operations, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): some elasticsearch servers in eqiad have CPU overheating - https://phabricator.wikimedia.org/T168816#3426274 (10EBernhardson) I think we should be able to take care of this before @gehel comes back, the main sticking point will be havi... [17:54:12] (03CR) 10RobH: [C: 032] correcting netmon2001 dns entries [dns] - 10https://gerrit.wikimedia.org/r/364481 (owner: 10RobH) [17:54:26] PROBLEM - cassandra-a SSL 10.192.16.154:7001 on restbase-test2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:No route to host [17:56:55] 10Operations, 10ops-codfw, 10ops-eqiad, 10monitoring: Unresponsive/misconfigured iDRACs over the host-BMC interface - https://phabricator.wikimedia.org/T169360#3426282 (10Papaul) on both mw2201 and mw2202 I am getting {F8706259} {F8706260} I can not reset the IDRAC in the BIOS also. This looks like HW p... [17:58:24] (03Abandoned) 10Paladox: phabricator/varnish: Block /file/upload instead of /file/data for WP0 users [puppet] - 10https://gerrit.wikimedia.org/r/364424 (https://phabricator.wikimedia.org/T170200) (owner: 10Paladox) [17:58:41] (03PS24) 10Paladox: Gerrit: Add support for scap [puppet] - 10https://gerrit.wikimedia.org/r/363726 (https://phabricator.wikimedia.org/T157414) [17:58:54] (03PS1) 10RobH: setting netmon2001 install params [puppet] - 10https://gerrit.wikimedia.org/r/364482 [17:59:36] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [17:59:36] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [17:59:46] RECOVERY - Host mw2201 is UP: PING OK - Packet loss = 0%, RTA = 36.17 ms [18:00:46] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [18:05:49] (03CR) 10Dzahn: [C: 031] setting netmon2001 install params [puppet] - 10https://gerrit.wikimedia.org/r/364482 (owner: 10RobH) [18:06:26] SMalyshev: they have been running normally; I sent an update to the list. However I want to leave the ticket open a while yet. [18:06:36] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:06:54] at least I think I sent one, either that or I hallucinated it! [18:07:36] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:07:46] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:09:15] !log mw2201 - repooled [18:09:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:52] (03CR) 10RobH: [C: 032] setting netmon2001 install params [puppet] - 10https://gerrit.wikimedia.org/r/364482 (owner: 10RobH) [18:12:52] (03PS1) 10Jcrespo: dbstore_multiinstance: uncomment includedir [puppet] - 10https://gerrit.wikimedia.org/r/364483 (https://phabricator.wikimedia.org/T169514) [18:13:22] (03PS2) 10Jcrespo: dbstore_multiinstance: uncomment includedir [puppet] - 10https://gerrit.wikimedia.org/r/364483 (https://phabricator.wikimedia.org/T169514) [18:13:42] (03CR) 10Jcrespo: [V: 032 C: 032] dbstore_multiinstance: uncomment includedir [puppet] - 10https://gerrit.wikimedia.org/r/364483 (https://phabricator.wikimedia.org/T169514) (owner: 10Jcrespo) [18:13:54] (03PS2) 10Ottomata: Use pulls rather than updates to pull cloudera jessie packages into stretch-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/364467 (https://phabricator.wikimedia.org/T152712) [18:14:01] (03CR) 10Ottomata: [V: 032 C: 032] Use pulls rather than updates to pull cloudera jessie packages into stretch-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/364467 (https://phabricator.wikimedia.org/T152712) (owner: 10Ottomata) [18:15:56] RECOVERY - Host mw2202 is UP: PING OK - Packet loss = 0%, RTA = 36.12 ms [18:16:16] PROBLEM - Check systemd state on install2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:17:00] !log ms2202 - repooled [18:17:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:17] PROBLEM - Check systemd state on install1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:17:24] I broke dhcp again [18:17:28] on a change that shouldnt break it [18:17:38] its being investigated by both myself and mutante ;] [18:17:46] /etc/dhcp/dhcpd.conf line 444: /etc/dhcp/linux-host-entries.ttyS1-115200: bad parse. [18:17:49] it's in this file [18:18:19] 444 include "/etc/dhcp/linux-host-entries.ttyS1-115200"; [18:18:21] that's the line [18:18:26] PROBLEM - puppet last run on install2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[isc-dhcp-server] [18:18:26] PROBLEM - MariaDB Slave Lag: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 848.94 seconds [18:18:34] but where is the error, heh [18:18:36] PROBLEM - MariaDB Slave Lag: m3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 856.68 seconds [18:19:06] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 883.28 seconds [18:19:25] (03PS3) 10Ottomata: Prep for stat100[56] [puppet] - 10https://gerrit.wikimedia.org/r/364427 (https://phabricator.wikimedia.org/T152712) [18:19:29] (03CR) 10Ottomata: [V: 032 C: 032] Prep for stat100[56] [puppet] - 10https://gerrit.wikimedia.org/r/364427 (https://phabricator.wikimedia.org/T152712) (owner: 10Ottomata) [18:19:36] PROBLEM - MariaDB Slave Lag: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 905.33 seconds [18:19:46] PROBLEM - MariaDB Slave Lag: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 896.15 seconds [18:20:17] PROBLEM - MariaDB Slave Lag: s2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 868.14 seconds [18:20:46] robh: i think i finally see it [18:20:55] ? [18:20:57] (03PS1) 10Thcipriani: Group0 to 1.30.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364485 [18:20:57] the only one with "S1-115200" [18:21:02] vs. S0-115200 ? [18:21:12] the issue isnt in dhcp.conf [18:21:27] its in linux-host-entries.ttyS1-115200 [18:21:27] it says line 444 in that file is the issue [18:21:27] PROBLEM - MariaDB Slave Lag: s6 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect [18:21:35] that line is the include for linux-host-entries.ttyS1-115200 [18:21:42] if i revert my chagne to linux-host-entries.ttyS1-115200 and remove netmon2001, it works [18:21:46] if i add it back in, it breaks [18:21:58] so there is some kind of error being introduced by my change to linux-host-entries.ttyS1-115200 [18:22:17] PROBLEM - MariaDB Slave IO: s5 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [18:22:19] though Im not entirely sure how since im just coping the known good stanza above it and substituting out the hostname, fqdn, and mac address entries [18:22:27] PROBLEM - MariaDB Slave IO: s4 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [18:22:28] PROBLEM - MariaDB Slave IO: x1 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [18:22:31] does that make sense? [18:22:37] PROBLEM - MariaDB Slave IO: m3 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [18:22:37] PROBLEM - MariaDB Slave IO: m2 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [18:22:37] PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [18:22:37] PROBLEM - MariaDB Slave IO: s2 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [18:22:37] PROBLEM - MariaDB Slave SQL: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [18:22:38] PROBLEM - MariaDB Slave IO: s1 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [18:22:38] PROBLEM - MariaDB Slave IO: s3 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [18:22:44] (03PS2) 10Jcrespo: mariadb: Fix systemd unit for controling multi-instances [software] - 10https://gerrit.wikimedia.org/r/364477 (https://phabricator.wikimedia.org/T169514) [18:22:47] PROBLEM - MariaDB Slave SQL: m3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [18:22:48] PROBLEM - MariaDB Slave SQL: s4 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [18:22:57] PROBLEM - MariaDB Slave SQL: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [18:23:08] PROBLEM - MariaDB Slave SQL: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [18:23:08] PROBLEM - MariaDB Slave SQL: s6 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [18:23:08] PROBLEM - MariaDB Slave SQL: s2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [18:25:07] robh: yes, it does and i still dont see why it breaks [18:25:16] my only guess is some special char from copy/paste now [18:25:30] !log mw2154 - depool for attempting IPMI fix [18:25:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:48] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [18:26:47] PROBLEM - MariaDB Slave IO: s6 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [18:27:32] !log dzahn@neodymium conftool action : set/pooled=no; selector: name=mw2154.codfw.wmnet [18:27:37] PROBLEM - MariaDB Slave IO: s7 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [18:27:39] (03PS1) 10Ottomata: Add stat1005 in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/364487 (https://phabricator.wikimedia.org/T152712) [18:27:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:49] elukey: I think dbstore1002 crashed [18:27:53] ottomata: ^ [18:27:57] ya just saw [18:28:08] jynus: that's not related to the commit you just made? [18:28:10] mariadb: Fix systemd unit ? [18:28:10] !log T169498: elastic@eqiad huge but short load spike on 24+ nodes (despite the workaround on token_count_router deployed) [18:28:15] ottomata: no [18:28:17] ok looking [18:28:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:22] T169498: Investigate load spikes on the elasticsearch cluster in eqiad - https://phabricator.wikimedia.org/T169498 [18:28:27] RECOVERY - Check systemd state on install2002 is OK: OK - running: The system is fully operational [18:28:28] PROBLEM - puppet last run on install1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 8 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[isc-dhcp-server] [18:29:19] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [600.0] [18:29:27] PROBLEM - MariaDB Slave Lag: m2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect [18:29:37] PROBLEM - MariaDB Slave Lag: s4 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect [18:29:40] (03CR) 10Legoktm: [C: 031] Remove deployers from restricted group [puppet] - 10https://gerrit.wikimedia.org/r/364469 (https://phabricator.wikimedia.org/T104671) (owner: 10Dereckson) [18:30:09] PROBLEM - MariaDB Slave Lag: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect [18:30:17] PROBLEM - MariaDB Slave SQL: m2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [18:30:38] hm jynus dbstore1002 does not have an error log? [18:30:43] (03PS1) 10RobH: fixing ordering of servers [puppet] - 10https://gerrit.wikimedia.org/r/364489 [18:30:49] ottomata: 2017-07-11 18:20:09 7f01a17fd700 InnoDB: Assertion failure in thread 139644981204736 in file srv0srv.cc line 2200 [18:30:57] InnoDB: We intentionally generate a memory trap. [18:31:10] jynus: how'd you get that? [18:31:14] (03CR) 10RobH: [C: 032] fixing ordering of servers [puppet] - 10https://gerrit.wikimedia.org/r/364489 (owner: 10RobH) [18:31:23] log is not on syslog before systemd [18:31:53] it is on /srv/sqldata/dbstore1002.err [18:31:58] PROBLEM - MariaDB Slave SQL: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [18:32:02] ahhhh hostname.err [18:32:07] so many files in sqldata [18:32:18] ottomata: we call it "database" [18:32:23] :-) [18:32:29] haha, database with error log files [18:32:45] hmm, its starting back up? [18:32:50] yes [18:32:54] it is recovering [18:32:55] ok [18:33:04] but if innodb killed itself [18:33:12] it normally has a good reason [18:33:47] RECOVERY - puppet last run on install2002 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [18:34:25] it was also doing a long alter, I think [18:34:27] RECOVERY - Number of backend failures per minute from CirrusSearch on graphite1001 is OK: OK: Less than 20.00% above the threshold [300.0] [18:34:48] PROBLEM - Host mw2154 is DOWN: PING CRITICAL - Packet loss = 100% [18:35:08] from replication or some maintainence? [18:35:27] maintenance, I don't have the details [18:35:32] only looking at the graph [18:35:41] https://grafana-admin.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=dbstore1002 [18:35:57] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [18:36:02] ACKNOWLEDGEMENT - Host mw2154 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn IPMI [18:36:30] marostegui: yt? know anything about this alter? [18:37:16] innodb purge lag? [18:37:31] don't worry too much about it [18:37:37] it just signals that a transaction [18:37:43] was open for a long time [18:37:53] normally it is due to maintenance [18:38:00] aye, ok, well, dunno what that alter ways, but i guess the node will come back up without it [18:38:08] yep [18:38:12] it has to revert it [18:38:17] PROBLEM - Host mw2154.mgmt.codfw.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [18:38:22] which is why it is taking so long to reboot [18:38:24] aye [18:38:40] this is why we do not want multi-source host anymore [18:38:55] if we had several instances, one crashes, the rest are still up [18:39:28] we have too much data to have a monolitical highly-compressed 5tb db [18:39:28] 10Operations, 10Analytics-Cluster, 10Analytics-Kanban: Reinstall Analytics Hadoop Cluster with Debian Jessie - https://phabricator.wikimedia.org/T157807#3426537 (10Nuria) 05Open>03Resolved [18:39:34] 10Operations, 10Analytics-Cluster, 10Analytics-Kanban: Reinstall Analytics Hadoop Cluster with Debian Jessie - https://phabricator.wikimedia.org/T157807#3017036 (10Nuria) [18:39:35] (03PS2) 10Ottomata: Add stat1005 in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/364487 (https://phabricator.wikimedia.org/T152712) [18:39:43] aye [18:40:10] (03CR) 10Ottomata: [V: 032 C: 032] Add stat1005 in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/364487 (https://phabricator.wikimedia.org/T152712) (owner: 10Ottomata) [18:40:10] can you singnal at least on the irc channel about this outage [18:40:25] it shouldn't take too long, but before folks ask [18:41:07] RECOVERY - Host mw2154 is UP: PING OK - Packet loss = 0%, RTA = 36.18 ms [18:41:21] bless y'all right now. I'm sitting at the edge of my seat with fingers crossed waiting to be able to connect. good luck and hoping this gets resolved soon :) [18:41:26] done [18:43:13] (03PS1) 10Ottomata: Don't try to set up rsync server for hdfs archive on stat1005 yet [puppet] - 10https://gerrit.wikimedia.org/r/364494 (https://phabricator.wikimedia.org/T152712) [18:43:20] !log thcipriani@tin Pruned MediaWiki: 1.30.0-wmf.6 [keeping static files] (duration: 06m 28s) [18:43:28] RECOVERY - Host mw2154.mgmt.codfw.wmnet is UP: PING OK - Packet loss = 0%, RTA = 36.81 ms [18:43:29] (03CR) 10Ottomata: [V: 032 C: 032] Don't try to set up rsync server for hdfs archive on stat1005 yet [puppet] - 10https://gerrit.wikimedia.org/r/364494 (https://phabricator.wikimedia.org/T152712) (owner: 10Ottomata) [18:43:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:18] RECOVERY - MariaDB Slave SQL: m2 on dbstore1002 is OK: OK slave_sql_state not a slave [18:44:18] jynus: it just came back up, but it says a couple of the mysql db tables need to be repaired [18:44:21] should I do that manually? [18:44:27] PROBLEM - puppet last run on stat1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:44:37] RECOVERY - MariaDB Slave Lag: m2 on dbstore1002 is OK: OK slave_sql_lag not a slave [18:45:05] !log dzahn@neodymium conftool action : set/pooled=yes; selector: name=mw2154.codfw.wmnet [18:45:29] ottomata: let me see [18:45:30] 10Operations, 10ops-codfw, 10ops-eqiad, 10monitoring: Unresponsive/misconfigured iDRACs over the host-BMC interface - https://phabricator.wikimedia.org/T169360#3426557 (10Dzahn) [18:46:10] ottomata: it should do that automatically [18:47:14] 10Operations, 10ops-codfw, 10ops-eqiad, 10monitoring: Unresponsive/misconfigured iDRACs over the host-BMC interface - https://phabricator.wikimedia.org/T169360#3395803 (10Dzahn) worked with papaul to drain flea power for the remaining codfw ones: mw2154 has been fixed after draining flea power and now wor... [18:47:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:22] RECOVERY - MariaDB Slave IO: m2 on dbstore1002 is OK: OK slave_io_state not a slave [18:49:22] RECOVERY - MariaDB Slave SQL: s3 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [18:49:22] RECOVERY - MariaDB Slave SQL: s1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [18:49:22] RECOVERY - MariaDB Slave SQL: s7 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [18:49:22] RECOVERY - MariaDB Slave SQL: s6 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [18:49:22] RECOVERY - MariaDB Slave SQL: s2 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [18:49:22] RECOVERY - MariaDB Slave IO: s5 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [18:49:23] RECOVERY - MariaDB Slave IO: s4 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [18:49:23] RECOVERY - MariaDB Slave IO: x1 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [18:49:23] RECOVERY - MariaDB Slave IO: s7 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [18:49:23] RECOVERY - MariaDB Slave SQL: x1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [18:49:23] RECOVERY - MariaDB Slave IO: m3 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [18:49:23] RECOVERY - MariaDB Slave SQL: s5 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [18:49:23] RECOVERY - MariaDB Slave IO: s2 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [18:49:23] RECOVERY - MariaDB Slave IO: s1 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [18:49:23] RECOVERY - MariaDB Slave IO: s3 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [18:49:23] RECOVERY - MariaDB Slave IO: s6 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [18:49:23] RECOVERY - MariaDB Slave SQL: m3 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [18:49:23] RECOVERY - MariaDB Slave SQL: s4 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [18:49:23] RECOVERY - MariaDB Slave Lag: x1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [18:49:23] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-3/2/3: down - Core: cr2-codfw:xe-5/0/1 (Zayo, OGYX/120003//ZYO) 36ms {#2909} [10Gbps wave]BR [18:49:23] RECOVERY - MariaDB Slave Lag: m3 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 212.51 seconds [18:49:23] PROBLEM - DPKG on stat1005 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [18:49:23] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [18:49:23] RECOVERY - DPKG on stat1005 is OK: All packages OK [18:50:07] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [18:50:38] RECOVERY - Check systemd state on install1002 is OK: OK - running: The system is fully operational [18:50:52] ottomata: 8% free disk [18:50:57] RECOVERY - puppet last run on install1002 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [18:51:17] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [18:51:22] I think at 5%, tokudb stops writing [18:52:47] PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table wikishared.echo_unread_wikis: Cant find record in echo_unread_wikis, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1031-bin.002174, end_log_pos 699146723 [18:53:38] RECOVERY - MariaDB Slave Lag: s6 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 132.03 seconds [18:56:17] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:57:33] 10Operations, 10ops-codfw, 10monitoring: mw2201, mw2202 - contact Dell and replace main board - https://phabricator.wikimedia.org/T170307#3426580 (10Dzahn) [18:57:39] (03PS1) 10Ottomata: Use openjdk-8 for analytics and statistics servers if it is available [puppet] - 10https://gerrit.wikimedia.org/r/364496 [18:57:50] jynus: aye [18:58:03] elukey: is hoping the cleaner stuff along with maybe some optimize tables will help [18:58:07] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:58:10] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:58:34] (03CR) 10jerkins-bot: [V: 04-1] Use openjdk-8 for analytics and statistics servers if it is available [puppet] - 10https://gerrit.wikimedia.org/r/364496 (owner: 10Ottomata) [19:00:04] thcipriani: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170711T1900). Please do the needful. [19:00:24] (03PS2) 10Ottomata: Use openjdk-8 for analytics and statistics servers if it is available [puppet] - 10https://gerrit.wikimedia.org/r/364496 [19:01:08] RECOVERY - MariaDB Slave Lag: s3 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 190.32 seconds [19:01:27] * thcipriani does needful [19:02:25] (03CR) 10Brian Wolff: [C: 031] Add ar_content_format and ar_content_model to labs views [puppet] - 10https://gerrit.wikimedia.org/r/363851 (https://phabricator.wikimedia.org/T89741) (owner: 10Umherirrender) [19:03:07] RECOVERY - MariaDB Slave Lag: s5 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 212.96 seconds [19:04:02] !log thcipriani@tin Started scap: testwiki to php-1.30.0-wmf.9 and rebuild l10n cache [19:04:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:13] (03CR) 10Ottomata: [C: 032] "No op: https://puppet-compiler.wmflabs.org/7013/" [puppet] - 10https://gerrit.wikimedia.org/r/364496 (owner: 10Ottomata) [19:04:37] RECOVERY - MariaDB Slave Lag: s2 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 278.70 seconds [19:05:27] PROBLEM - MariaDB Slave Lag: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 892.52 seconds [19:06:24] 10Operations, 10DBA, 10Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3426638 (10Cmjohnson) @jcrespo db1100, 1105 were the same issue db1104 is something else. I will update once I figure it out [19:09:57] RECOVERY - MariaDB Slave Lag: s4 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 144.79 seconds [19:10:37] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 285.95 seconds [19:14:48] (03PS1) 10Ottomata: Install jupyter-notebook for stretch [puppet] - 10https://gerrit.wikimedia.org/r/364501 (https://phabricator.wikimedia.org/T152712) [19:15:36] RECOVERY - MariaDB Slave Lag: s7 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 271.20 seconds [19:16:23] 10Operations, 10ops-codfw, 10monitoring: mw2201, mw2202 - contact Dell and replace main board - https://phabricator.wikimedia.org/T170307#3426678 (10Dzahn) [19:17:03] (03CR) 10Ottomata: [C: 032] Install jupyter-notebook for stretch [puppet] - 10https://gerrit.wikimedia.org/r/364501 (https://phabricator.wikimedia.org/T152712) (owner: 10Ottomata) [19:17:59] !log shutting down sodium for iDRAC reset (T169360) [19:18:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:08] T169360: Unresponsive/misconfigured iDRACs over the host-BMC interface - https://phabricator.wikimedia.org/T169360 [19:18:10] cmjohnson1: shutting down [19:18:26] cmjohnson1: aand down [19:18:41] great...give me about 1 min and I will power up [19:18:51] thx! [19:20:07] PROBLEM - Host sodium is DOWN: PING CRITICAL - Packet loss = 100% [19:21:24] (03PS1) 10Ottomata: Add hosts/stat1005.yaml to add user groups to stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/364503 [19:21:56] RECOVERY - Host sodium.mgmt.eqiad.wmnet is UP: PING OK - Packet loss = 0%, RTA = 1.95 ms [19:21:56] (03PS1) 10Dzahn: Revert "Revert "disable base monitoring for labtest* machines"" [puppet] - 10https://gerrit.wikimedia.org/r/364504 [19:22:10] hi, it seems mirrors has stopped reponding when i do apt-get update. [19:22:12] i get [19:22:13] 0% [Connecting to mirrors.wikimedia.org (208.80.154.15)] [19:22:17] and gets stuck there [19:22:18] known [19:22:22] ok [19:22:23] and planned [19:23:08] paravoid:coming back up [19:23:10] (03CR) 10Dzahn: "created this as a reminder to try it again but incl. manually cleaning up Icinga to avoid the issue mentioned on the revert" [puppet] - 10https://gerrit.wikimedia.org/r/364504 (owner: 10Dzahn) [19:23:16] RECOVERY - Host sodium is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [19:23:55] cmjohnson1: awesome! [19:23:59] cmjohnson1: it's fixed :) [19:24:09] (03CR) 10Ottomata: [C: 032] Add hosts/stat1005.yaml to add user groups to stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/364503 (owner: 10Ottomata) [19:25:12] 10Operations, 10ops-codfw, 10ops-eqiad, 10monitoring: Unresponsive/misconfigured iDRACs over the host-BMC interface - https://phabricator.wikimedia.org/T169360#3426742 (10faidon) [19:26:06] PROBLEM - puppet last run on sodium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:28:06] RECOVERY - puppet last run on sodium is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [19:29:13] !log thcipriani@tin Finished scap: testwiki to php-1.30.0-wmf.9 and rebuild l10n cache (duration: 25m 11s) [19:29:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:54] (03PS1) 10Jcrespo: dbstore-multiinstance: Open firewall of mysql service [puppet] - 10https://gerrit.wikimedia.org/r/364506 (https://phabricator.wikimedia.org/T169514) [19:32:25] !log powering off mw1199 to reset idrac [19:32:26] (03PS2) 10Jcrespo: dbstore-multiinstance: Open firewall for multiple mysql services [puppet] - 10https://gerrit.wikimedia.org/r/364506 (https://phabricator.wikimedia.org/T169514) [19:32:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:16] PROBLEM - IPMI Temperature on mw1199 is CRITICAL: Return code of 255 is out of bounds [19:34:03] (03PS3) 10Jcrespo: dbstore-multiinstance: Open firewall for multiple mysql services [puppet] - 10https://gerrit.wikimedia.org/r/364506 (https://phabricator.wikimedia.org/T169514) [19:34:18] (03PS1) 10Ottomata: Adding not about aspell-id [puppet] - 10https://gerrit.wikimedia.org/r/364507 [19:34:30] (03CR) 10Ottomata: [V: 032 C: 032] Adding not about aspell-id [puppet] - 10https://gerrit.wikimedia.org/r/364507 (owner: 10Ottomata) [19:34:46] ACKNOWLEDGEMENT - kartotherian endpoints health on maps-test2001 is CRITICAL: /{src}/{z}/{x}/{y}.{format} (get a tile in the middle of the ocean, with overzoom) is CRITICAL: Test get a tile in the middle of the ocean, with overzoom returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Small scaled map) is CRITICAL: Test Small scaled map returned the unexpected status 400 (exp [19:34:46] }/{z}/{x}/{y}@{scale}x.{format} (default scaled tile) is CRITICAL: Test default scaled tile returned the unexpected status 400 (expecting: 200): /{src}/info.json (tile service info for osm-intl) is CRITICAL: Test tile service info for osm-intl returned the unexpected status 400 (expecting: 200): /{src}/info.json (tile service info for osm-pbf) is CRITICAL: Test tile service info for osm-pbf returned the unexpected status 400 (e [19:34:46] ns Freshly reimaged few days ago, test hosts, gehel on holiday. ACKing and linking to https://phabricator.wikimedia.org/T169011 [19:34:46] ACKNOWLEDGEMENT - kartotherian endpoints health on maps-test2002 is CRITICAL: /{src}/{z}/{x}/{y}.{format} (get a tile in the middle of the ocean, with overzoom) is CRITICAL: Test get a tile in the middle of the ocean, with overzoom returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Small scaled map) is CRITICAL: Test Small scaled map returned the unexpected status 400 (exp [19:34:47] }/{z}/{x}/{y}@{scale}x.{format} (default scaled tile) is CRITICAL: Test default scaled tile returned the unexpected status 400 (expecting: 200): /{src}/info.json (tile service info for osm-intl) is CRITICAL: Test tile service info for osm-intl returned the unexpected status 400 (expecting: 200): /{src}/info.json (tile service info for osm-pbf) is CRITICAL: Test tile service info for osm-pbf returned the unexpected status 400 (e [19:34:47] ns Freshly reimaged few days ago, test hosts, gehel on holiday. ACKing and linking to https://phabricator.wikimedia.org/T169011 [19:34:49] ACKNOWLEDGEMENT - kartotherian endpoints health on maps-test2003 is CRITICAL: /{src}/{z}/{x}/{y}.{format} (get a tile in the middle of the ocean, with overzoom) is CRITICAL: Test get a tile in the middle of the ocean, with overzoom returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Small scaled map) is CRITICAL: Test Small scaled map returned the unexpected status 400 (exp [19:34:50] }/{z}/{x}/{y}@{scale}x.{format} (default scaled tile) is CRITICAL: Test default scaled tile returned the unexpected status 400 (expecting: 200): /{src}/info.json (tile service info for osm-intl) is CRITICAL: Test tile service info for osm-intl returned the unexpected status 400 (expecting: 200): /{src}/info.json (tile service info for osm-pbf) is CRITICAL: Test tile service info for osm-pbf returned the unexpected status 400 (e [19:34:50] ns Freshly reimaged few days ago, test hosts, gehel on holiday. ACKing and linking to https://phabricator.wikimedia.org/T169011 [19:34:52] ACKNOWLEDGEMENT - kartotherian endpoints health on maps-test2004 is CRITICAL: /{src}/{z}/{x}/{y}.{format} (get a tile in the middle of the ocean, with overzoom) is CRITICAL: Test get a tile in the middle of the ocean, with overzoom returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Small scaled map) is CRITICAL: Test Small scaled map returned the unexpected status 400 (exp [19:34:53] }/{z}/{x}/{y}@{scale}x.{format} (default scaled tile) is CRITICAL: Test default scaled tile returned the unexpected status 400 (expecting: 200): /{src}/info.json (tile service info for osm-intl) is CRITICAL: Test tile service info for osm-intl returned the unexpected status 400 (expecting: 200): /{src}/info.json (tile service info for osm-pbf) is CRITICAL: Test tile service info for osm-pbf returned the unexpected status 400 (e [19:34:53] ns Freshly reimaged few days ago, test hosts, gehel on holiday. ACKing and linking to https://phabricator.wikimedia.org/T169011 [19:35:26] 10Operations, 10Goal, 10Kubernetes, 10Services (watching): Upgrade to kubernetes >=1.5 - https://phabricator.wikimedia.org/T170119#3426800 (10mobrovac) [19:35:28] PROBLEM - Check whether ferm is active by checking the default input chain on mw1199 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:35:28] PROBLEM - configured eth on mw1199 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:35:46] PROBLEM - DPKG on mw1199 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:35:47] PROBLEM - HHVM rendering on mw1199 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:35:56] PROBLEM - SSH on mw1199 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:35:56] PROBLEM - Disk space on mw1199 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:35:56] PROBLEM - HHVM processes on mw1199 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:36:06] PROBLEM - Check systemd state on mw1199 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:36:06] PROBLEM - dhclient process on mw1199 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:36:07] PROBLEM - nutcracker process on mw1199 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:36:07] PROBLEM - Apache HTTP on mw1199 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:36:14] what's wrong? [19:36:16] PROBLEM - puppet last run on mw1199 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:36:16] PROBLEM - salt-minion processes on mw1199 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:36:18] PROBLEM - Check size of conntrack table on mw1199 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:36:18] PROBLEM - nutcracker port on mw1199 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:36:18] PROBLEM - Nginx local proxy to apache on mw1199 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:36:23] chris is rebooting it [19:36:26] volans i didn't silence mw1199 sorry [19:37:05] ah ok, np :) [19:37:22] 10Operations, 10ops-codfw, 10monitoring, 10Patch-For-Review: rack/setup/install netmon2001 - https://phabricator.wikimedia.org/T166180#3426811 (10RobH) [19:37:24] ACKNOWLEDGEMENT - MariaDB Slave Lag: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2771.34 seconds Jcrespo pending fix: T170308 [19:37:24] ACKNOWLEDGEMENT - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table wikishared.echo_unread_wikis: Cant find record in echo_unread_wikis, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1031-bin.002174, end_log_pos 699146723 Jcrespo pending fix: T170308 [19:37:37] brion: History question - do you remember the story behind migrateuser_medium table? I found references to the migrateuser table from 2007, but not sure where the medium comes in [19:38:05] (03PS1) 10Ottomata: Install hunspell-en-us instead of myspell-en-us in Stretch [puppet] - 10https://gerrit.wikimedia.org/r/364508 (https://phabricator.wikimedia.org/T152712) [19:38:16] RECOVERY - Check whether ferm is active by checking the default input chain on mw1199 is OK: OK ferm input default policy is set [19:38:17] RECOVERY - configured eth on mw1199 is OK: OK - interfaces up [19:38:37] RECOVERY - DPKG on mw1199 is OK: All packages OK [19:38:46] RECOVERY - SSH on mw1199 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [19:38:46] RECOVERY - Disk space on mw1199 is OK: DISK OK [19:38:47] RECOVERY - HHVM processes on mw1199 is OK: PROCS OK: 6 processes with command name hhvm [19:38:56] RECOVERY - dhclient process on mw1199 is OK: PROCS OK: 0 processes with command name dhclient [19:38:56] RECOVERY - Check systemd state on mw1199 is OK: OK - running: The system is fully operational [19:38:58] 10Operations, 10DBA, 10Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3426824 (10Cmjohnson) @jcrespo db1104 is fixed, vlan conflict. [19:39:06] RECOVERY - nutcracker process on mw1199 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [19:39:06] RECOVERY - puppet last run on mw1199 is OK: OK: Puppet is currently enabled, last run 12 minutes ago with 0 failures [19:39:07] RECOVERY - salt-minion processes on mw1199 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:39:07] RECOVERY - Check size of conntrack table on mw1199 is OK: OK: nf_conntrack is 0 % full [19:39:15] 10Operations, 10monitoring: rack/setup/install netmon2001 - https://phabricator.wikimedia.org/T166180#3287592 (10RobH) a:05RobH>03Dzahn Assigned to @dzahn for service implemetnation. I've assumed it goes to him, since he is handling the stretch service updates for netmon1002. [19:39:16] RECOVERY - nutcracker port on mw1199 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [19:39:16] RECOVERY - Nginx local proxy to apache on mw1199 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 615 bytes in 2.215 second response time [19:39:45] 10Operations, 10ops-codfw, 10ops-eqiad, 10monitoring: Unresponsive/misconfigured iDRACs over the host-BMC interface - https://phabricator.wikimedia.org/T169360#3426835 (10Cmjohnson) [19:39:47] RECOVERY - HHVM rendering on mw1199 is OK: HTTP OK: HTTP/1.1 200 OK - 74474 bytes in 4.027 second response time [19:39:49] (03CR) 10Ottomata: [C: 032] Install hunspell-en-us instead of myspell-en-us in Stretch [puppet] - 10https://gerrit.wikimedia.org/r/364508 (https://phabricator.wikimedia.org/T152712) (owner: 10Ottomata) [19:40:06] RECOVERY - Apache HTTP on mw1199 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.674 second response time [19:40:07] bawolff: hmm, no don't recall that one [19:40:38] Was that when we did a password salt setup? Or sth else... [19:41:20] Tim might know but he's on vacation I believe [19:41:36] brion: I think it has something to do with unifying accounts [19:41:50] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [19:42:03] 10Operations, 10DBA, 10Mail: Setup database for dmarc service - https://phabricator.wikimedia.org/T170158#3426863 (10herron) [19:42:04] I'm mostly just curious, all that matters at this point is its not used anymore [19:42:13] Heh [19:42:48] Database alters should come with a git history and commit messages [19:43:12] Would make it easier to understand wtf happened :) [19:45:03] 10Operations, 10ops-codfw, 10ops-eqiad, 10monitoring: Unresponsive/misconfigured iDRACs over the host-BMC interface - https://phabricator.wikimedia.org/T169360#3426893 (10faidon) [19:45:25] brion: I am thinking of maintaining my own tables.sql on a wmf repo [19:45:45] eventually to integrate it with a migration handler [19:46:49] but I say I am going to do that since I entered: https://phabricator.wikimedia.org/T104459 [19:47:02] 10Operations, 10Goal, 10Kubernetes, 10Services (watching), 10User-Joe: Implement a pod networking policy approach - https://phabricator.wikimedia.org/T170111#3419728 (10GWicke) The whitelisting & defaults described in the description generally make sense to me. This will be a big step forward from the st... [19:47:20] :) [19:48:02] (03CR) 10Thcipriani: [C: 032] Group0 to 1.30.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364485 (owner: 10Thcipriani) [19:49:01] (03Merged) 10jenkins-bot: Group0 to 1.30.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364485 (owner: 10Thcipriani) [19:49:10] (03CR) 10jenkins-bot: Group0 to 1.30.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364485 (owner: 10Thcipriani) [19:50:20] RECOVERY - Host conf1003.mgmt.eqiad.wmnet is UP: PING OK - Packet loss = 0%, RTA = 1.18 ms [19:50:20] RECOVERY - Host kafka1018.mgmt.eqiad.wmnet is UP: PING OK - Packet loss = 0%, RTA = 1.09 ms [19:51:20] RECOVERY - Host kafka1020.mgmt.eqiad.wmnet is UP: PING OK - Packet loss = 0%, RTA = 0.98 ms [19:51:45] (03PS4) 10Jcrespo: dbstore-multiinstance: Open firewall for multiple mysql services [puppet] - 10https://gerrit.wikimedia.org/r/364506 (https://phabricator.wikimedia.org/T169514) [19:52:44] 10Operations, 10monitoring: Remote IPMI doesn't work for ~2% of the fleet - https://phabricator.wikimedia.org/T150160#3426956 (10faidon) [19:53:03] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 to 1.30.0-wmf.9 [19:53:05] 10Operations, 10monitoring: Remote IPMI doesn't work for ~2% of the fleet - https://phabricator.wikimedia.org/T150160#2775695 (10faidon) Chris fixed the cables for conf1003, kafka1018, kafka1020 and db1063. All fixed! [19:53:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:47] (03PS10) 10Andrew Bogott: Rough in new labs puppetmaster roles [puppet] - 10https://gerrit.wikimedia.org/r/364267 [19:54:35] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: revert group0 to 1.30.0-wmf.9 [19:54:42] (03CR) 10Jcrespo: [C: 032] dbstore-multiinstance: Open firewall for multiple mysql services [puppet] - 10https://gerrit.wikimedia.org/r/364506 (https://phabricator.wikimedia.org/T169514) (owner: 10Jcrespo) [19:54:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:53] paravoid: \o/ [19:55:00] yup [19:55:06] and mw1196 is dead, needs decom [19:55:21] so that leaves just mw2201/2202 which is T170307 [19:55:22] T170307: mw2201, mw2202 - contact Dell and replace main board - https://phabricator.wikimedia.org/T170307 [19:55:28] thcipriani: Wait, wmf.**9**? Today's branch was going to be wmf.8ā€¦ [19:55:41] RECOVERY - Host db1063.mgmt.eqiad.wmnet is UP: PING OK - Packet loss = 0%, RTA = 0.92 ms [19:55:58] James_F: we've moved to a model where the branch number is tied to the week, so if we skip a week (July 4th) we skip that branch number [19:56:12] thcipriani: That'sā€¦ really profoundly unhelpful. :-( [19:56:31] thcipriani: Was there an e-mail I missed? I'm going to need to go fix a few dozen things. [19:56:39] 10Operations, 10RESTBase-Cassandra, 10Services (later): cassandra slow streaming during (de)commission - https://phabricator.wikimedia.org/T126619#3426984 (10GWicke) [19:56:51] would be more useful with calendar ISO week numbers [19:57:02] Sure, but we're not using that either. [19:57:02] 10Operations, 10RESTBase-Cassandra, 10Services (later): cassandra slow streaming during (de)commission - https://phabricator.wikimedia.org/T126619#2018891 (10GWicke) a:03Eevans [19:57:07] James_F: sorry about that :( we posted our intentions on the task https://phabricator.wikimedia.org/T167893#3350385 [19:58:01] thcipriani: Ah. OK, well then, everything'll just have to be broken for a week or two until I have time to fix things. [19:58:11] :( [19:58:54] James_F: sorry 'bout that. Hopefully this new method (short of going to just ISO weeks) is easier as it's basically ISO weeks just offset and reset one MW releases [19:59:11] s/reset one/reset on/ [19:59:12] (03PS11) 10Andrew Bogott: Rough in new labs puppetmaster roles [puppet] - 10https://gerrit.wikimedia.org/r/364267 [19:59:26] greg-g: Well, ish. The main reasons for humans referring to train releases (i.e., cache cycles) are going to be lots harder. [19:59:41] greg-g: But yeah, I appreciate the desire. [19:59:51] ReleaseTaggerBot is going to be completely screwed. [19:59:59] cache cycles are number of days, not number of releases, so nothing changes (or is actually simplified) with this [20:00:04] No? [20:00:30] we keep old static files until they're older than 30ish days [20:00:35] (03CR) 10Andrew Bogott: [C: 032] Rough in new labs puppetmaster roles [puppet] - 10https://gerrit.wikimedia.org/r/364267 (owner: 10Andrew Bogott) [20:00:45] Yeah, but HTML caching is now down to release+2 weeks. [20:00:54] (03PS1) 10Thcipriani: Revert "Group0 to 1.30.0-wmf.9" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364513 [20:01:05] Which is the main one, thanks to Timo's awesome work on ResourceLoader static loading. [20:01:08] so wmf.9 corresponds to which week? [20:01:13] 9th week since 1.30 was released? [20:01:15] paravoid: This one, apparently. [20:01:31] paravoid: 9th week since the 1.30-alpha branch cut [20:01:46] (1.29 is being released this week ;) ) [20:01:48] (03CR) 10Thcipriani: [C: 032] Revert "Group0 to 1.30.0-wmf.9" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364513 (owner: 10Thcipriani) [20:01:49] Maybe going to whole hog and calling them wmf.2017W20 ISO-style would be nice. [20:02:26] sure, this seemed like a good middle step as we weren't sure at first past of all the other places that depend on a wmf.XX notation [20:02:29] greg-g: But in general, notices to wikitech-l when RelEng change deployment and train processes would be nice. [20:02:33] first pass [20:02:36] Sure. [20:02:42] (03Merged) 10jenkins-bot: Revert "Group0 to 1.30.0-wmf.9" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364513 (owner: 10Thcipriani) [20:02:47] But half of those will break now, so will need fixing twice. ;-) [20:02:54] (03CR) 10jenkins-bot: Revert "Group0 to 1.30.0-wmf.9" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364513 (owner: 10Thcipriani) [20:02:56] 10Operations, 10RESTBase, 10Services: restbase unable to start after machine reimage - https://phabricator.wikimedia.org/T120379#3427027 (10GWicke) 05Open>03Invalid This is no longer relevant. [20:02:57] the week notation (even the ISO one) has the problem that you're fixing a weekly cadence into the version number :) [20:03:17] paravoid: Sure, but sadly that ship has sailed. :-( [20:03:20] RECOVERY - IPMI Temperature on mw1199 is OK: Sensor Type(s) Temperature Status: OK [20:03:21] PROBLEM - puppet last run on druid1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:03:33] and what happens if we go with a shorter cadence? :) [20:03:45] should all just be sha1s :) [20:03:52] yeah that was my question :) [20:04:08] just purely incremental svn/hg-style isn't bad either [20:04:09] Who needs branches anyway, deploy from master! Dockerise the thing. [20:04:18] Krinkle: :P :P [20:04:19] Krinkle: Hush you. [20:04:42] * Krinkle misses svn numbers [20:06:14] James_F: re announce (had a reply typed, then moved on in the convo), yeah, I'll make an expo facto one now with a mea culpa (any other latin I should include?) [20:06:58] 10Operations, 10Goal, 10Kubernetes, 10Services (watching), 10User-Joe: Implement a pod networking policy approach - https://phabricator.wikimedia.org/T170111#3427109 (10mobrovac) Same. Having a default white-list that covers most-used cases makes sense too. However, explicitly white-listing incoming con... [20:07:33] (03PS1) 10Reedy: Can't use NS_MODULE constant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364516 (https://phabricator.wikimedia.org/T170317) [20:08:01] greg-g: :-) Thanks. [20:08:16] I've fiddled with the Phab milestones so ReleaseTaggerBot will now be doing the right thing. [20:08:28] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Cloud-VPS: rack/setup/install labpuppetmaster100[12].wikimedia.org - https://phabricator.wikimedia.org/T167905#3427136 (10Andrew) btw, these will be Jessie boxes, despite being in the Labs cluster. Thanks. [20:09:09] 10Operations, 10ChangeProp, 10Services (done): Add storage to Change-Prop for deduplication - https://phabricator.wikimedia.org/T157089#3427137 (10Pchelolo) 05Open>03Resolved Redis was added to #changeprop nodes and is already successfully used for blacklisting unparseable pages. This is done. [20:09:10] (03PS1) 10Andrew Bogott: Make labtestpuppetmaster2001 a Jessie box [puppet] - 10https://gerrit.wikimedia.org/r/364519 [20:10:20] (03CR) 10Andrew Bogott: [C: 032] Make labtestpuppetmaster2001 a Jessie box [puppet] - 10https://gerrit.wikimedia.org/r/364519 (owner: 10Andrew Bogott) [20:10:35] James_F: gracias [20:11:18] (03CR) 10Krinkle: "See also ab9ae6efe041a2b5c847c4a6e13a3671a1d33431" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364516 (https://phabricator.wikimedia.org/T170317) (owner: 10Reedy) [20:11:54] Reedy: I'm not sure how that NS_MODULE notice gets triggered by wmf.9. The extension was already wfLoadExtension-loaded. [20:11:57] What changed? [20:12:08] Scribunto being swapped to extension registration? [20:12:21] Oh [20:12:23] https://github.com/wikimedia/mediawiki-extensions-Scribunto/commit/246df8d4275c2eb1e6fc1e8116c5e6ea7f0571e3 [20:12:23] I don't think that made it into .9 [20:12:29] We load it with require_once .php [20:12:30] in wmf-config [20:12:35] but internally it became extension registered. [20:12:58] yup. Id00a2a00bddf72f5c8716f21226695456b3a32c6 is in wmf.9 [20:13:21] (03PS1) 10Thcipriani: Remove use of NS_MODULE in CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364522 (https://phabricator.wikimedia.org/T170317) [20:14:06] (03CR) 10Thcipriani: [C: 032] Can't use NS_MODULE constant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364516 (https://phabricator.wikimedia.org/T170317) (owner: 10Reedy) [20:14:26] (03Abandoned) 10Thcipriani: Remove use of NS_MODULE in CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364522 (https://phabricator.wikimedia.org/T170317) (owner: 10Thcipriani) [20:14:52] heh, of course, never play patch quick draw with Reedy :) [20:14:59] (03Merged) 10jenkins-bot: Can't use NS_MODULE constant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364516 (https://phabricator.wikimedia.org/T170317) (owner: 10Reedy) [20:15:01] PROBLEM - Host labtestpuppetmaster2001 is DOWN: PING CRITICAL - Packet loss = 100% [20:15:20] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [20:15:59] (03CR) 10jenkins-bot: Can't use NS_MODULE constant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364516 (https://phabricator.wikimedia.org/T170317) (owner: 10Reedy) [20:16:18] thcipriani: oops. I had https://gerrit.wikimedia.org/r/#/c/363531/ ready for that [20:17:37] legoktm: Written but not merged doesn't butter any parsnips. :-) [20:17:54] yeah, I dropped the ball on that [20:18:02] 10Operations, 10RESTBase, 10RESTBase-Cassandra, 10Services, 10Patch-For-Review: column family cassandra metrics size - https://phabricator.wikimedia.org/T113733#3427209 (10GWicke) @eevans, @fgiunchedi: Are we good with the blacklist? Should we resolve this task? [20:18:07] 10Operations, 10RESTBase, 10RESTBase-Cassandra, 10Patch-For-Review, 10Services (watching): column family cassandra metrics size - https://phabricator.wikimedia.org/T113733#3427210 (10GWicke) [20:18:30] RECOVERY - Host labtestpuppetmaster2001 is UP: PING OK - Packet loss = 0%, RTA = 36.46 ms [20:18:47] legoktm: heh, any opposition to changing it back to the constant later after wmf.9 is everywhere? [20:19:01] after wmf.9 is everywhere that code can just be removed entirely [20:19:32] Scribunto now sets $wgTemplateSandboxEditNamespaces in the extension itself [20:20:28] sure, I suppose my question is, is the patch I just merged fine with you for the time being? :) [20:20:31] PROBLEM - Check whether ferm is active by checking the default input chain on labtestpuppetmaster2001 is CRITICAL: Return code of 255 is out of bounds [20:20:32] 10Operations, 10Epic, 10Goal, 10Services (next): End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939#3427245 (10Pchelolo) [20:20:40] PROBLEM - Check size of conntrack table on labtestpuppetmaster2001 is CRITICAL: Return code of 255 is out of bounds [20:20:51] PROBLEM - dhclient process on labtestpuppetmaster2001 is CRITICAL: Return code of 255 is out of bounds [20:21:00] PROBLEM - salt-minion processes on labtestpuppetmaster2001 is CRITICAL: Return code of 255 is out of bounds [20:21:19] oh yeah [20:21:21] PROBLEM - labspuppetbackend uWSGI web app on labtestpuppetmaster2001 is CRITICAL: Return code of 255 is out of bounds [20:21:21] PROBLEM - MD RAID on labtestpuppetmaster2001 is CRITICAL: Return code of 255 is out of bounds [20:21:21] PROBLEM - Disk space on labtestpuppetmaster2001 is CRITICAL: Return code of 255 is out of bounds [20:21:30] PROBLEM - configured eth on labtestpuppetmaster2001 is CRITICAL: Return code of 255 is out of bounds [20:21:30] PROBLEM - DPKG on labtestpuppetmaster2001 is CRITICAL: Return code of 255 is out of bounds [20:21:54] ok :) [20:22:03] "butter any parsnips"? Now I'm hungry for lunch. [20:23:06] greg-g: If you're hungry for parsnips I've got some bad news about living in California. :-( [20:23:24] !log thcipriani@tin Synchronized wmf-config/CommonSettings.php: [[gerrit:364516|Can't use NS_MODULE constant]] T170317 (duration: 00m 43s) [20:23:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:38] T170317: Notice: Use of undefined constant NS_MODULE - assumed 'NS_MODULE' in /srv/mediawiki/wmf-config/CommonSettings.php on line 3099 - https://phabricator.wikimedia.org/T170317 [20:23:52] alright, let's try this group0 thing once more [20:25:04] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 to 1.30.0-wmf.9 [20:25:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:33] 10Operations, 10Services, 10Documentation, 10Service-Architecture: Create a doc explaining the SLA between services and the monitoring tool - https://phabricator.wikimedia.org/T105780#3427318 (10GWicke) a:03mobrovac [20:27:50] 10Operations, 10Documentation, 10Service-Architecture, 10Services (later): Create a doc explaining the SLA between services and the monitoring tool - https://phabricator.wikimedia.org/T105780#1451502 (10GWicke) [20:27:58] (03PS1) 10Thcipriani: Revert "Revert "Group0 to 1.30.0-wmf.9"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364525 [20:28:11] PROBLEM - Unmerged changes on repository puppet on labtestpuppetmaster2001 is CRITICAL: Return code of 255 is out of bounds [20:29:34] (03CR) 10Thcipriani: [C: 032] Revert "Revert "Group0 to 1.30.0-wmf.9"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364525 (owner: 10Thcipriani) [20:29:42] 10Operations, 10Cassandra, 10RESTBase-Cassandra, 10Services: Evaluate efficacy of DateTieredCompactionStrategy - https://phabricator.wikimedia.org/T126221#3427342 (10GWicke) 05Open>03Resolved [20:30:32] (03Merged) 10jenkins-bot: Revert "Revert "Group0 to 1.30.0-wmf.9"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364525 (owner: 10Thcipriani) [20:30:41] RECOVERY - puppet last run on druid1002 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [20:30:46] (03CR) 10jenkins-bot: Revert "Revert "Group0 to 1.30.0-wmf.9"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364525 (owner: 10Thcipriani) [20:33:20] PROBLEM - Host labtestpuppetmaster2001 is DOWN: PING CRITICAL - Packet loss = 100% [20:33:50] RECOVERY - Host labtestpuppetmaster2001 is UP: PING OK - Packet loss = 0%, RTA = 36.09 ms [20:44:29] (03PS1) 10Andrew Bogott: Labtestpuppetmaster2001: Switch to Jessie, take two [puppet] - 10https://gerrit.wikimedia.org/r/364539 [20:46:07] (03CR) 10Andrew Bogott: [C: 032] Labtestpuppetmaster2001: Switch to Jessie, take two [puppet] - 10https://gerrit.wikimedia.org/r/364539 (owner: 10Andrew Bogott) [20:53:07] 10Operations, 10Cloud-Services, 10RESTBase, 10Traffic, and 3 others: Fix RESTBase support for wikitech.wikimedia.org - https://phabricator.wikimedia.org/T102178#3427614 (10GWicke) [20:53:47] 10Operations, 10Traffic, 10Services (watching), 10discovery-system, 10services-tooling: Figure out an etcd deploy strategy that includes multi DC failure scenarios. - https://phabricator.wikimedia.org/T98165#3427621 (10GWicke) [20:54:30] PROBLEM - mediawiki originals uploads -hourly- for codfw-prod on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [3000.0] [20:54:30] PROBLEM - mediawiki originals uploads -hourly- for eqiad-prod on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [3000.0] [20:55:08] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10Hindi-Sites: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3427636 (10Dereckson) Your [[ https://lists.wikimedia.org/pipermail/langcom/2017-June/001508.html | message ]] has successfully been sent to the list, an... [20:59:10] 10Operations, 10monitoring: setup netmon1002.wikimedia.org - https://phabricator.wikimedia.org/T159756#3427663 (10Dzahn) as @robh pointed out this used the wrong partman recipe and needs to be reinstalled to use both SSDs?! [21:02:30] RECOVERY - mediawiki originals uploads -hourly- for codfw-prod on graphite1001 is OK: OK: Less than 80.00% above the threshold [2000.0] [21:02:31] RECOVERY - mediawiki originals uploads -hourly- for eqiad-prod on graphite1001 is OK: OK: Less than 80.00% above the threshold [2000.0] [21:03:21] 10Operations, 10Fundraising-Backlog, 10Technical-Debt: Determine if benefactorevents.wikimedia.org should be hosted on the production cluster or still on Microsoft Azure - https://phabricator.wikimedia.org/T166240#3427675 (10Jgreen) 05Open>03Resolved a:03Jgreen Closing this task because we are no longe... [21:03:53] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10Patch-For-Review, 10User-Urbanecm: Create Dinka Wikipedia - https://phabricator.wikimedia.org/T168518#3427681 (10Dereckson) a:03Dereckson Wiki scheduled for creation 2017-07-12 10:00ā€“13:00 UTC. [21:04:07] 10Operations, 10DBA, 10Wikimedia-Site-requests, 10Patch-For-Review: Create CoC committee private wiki - https://phabricator.wikimedia.org/T165977#3427683 (10Dereckson) a:03Dereckson Wiki scheduled for creation 2017-07-12 10:00ā€“13:00 UTC. [21:05:42] 10Operations, 10Fundraising-Backlog, 10Technical-Debt: Determine if benefactorevents.wikimedia.org should be hosted on the production cluster or still on Microsoft Azure - https://phabricator.wikimedia.org/T166240#3427711 (10Dereckson) Thanks for the update. [21:05:54] (03CR) 10Chad: Gerrit: Add support for scap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/363726 (https://phabricator.wikimedia.org/T157414) (owner: 10Paladox) [21:06:21] (03CR) 10Chad: Gerrit: Add support for scap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/363726 (https://phabricator.wikimedia.org/T157414) (owner: 10Paladox) [21:07:43] (03PS1) 10Dzahn: add netmon2001 to site, equal to netmon1002 [puppet] - 10https://gerrit.wikimedia.org/r/364585 (https://phabricator.wikimedia.org/T166180) [21:13:24] PROBLEM - NTP on labtestpuppetmaster2001 is CRITICAL: NTP CRITICAL: No response from NTP server [21:15:44] PROBLEM - mediawiki originals uploads -hourly- for codfw-prod on graphite1001 is CRITICAL: CRITICAL: 83.33% of data above the critical threshold [3000.0] [21:15:44] PROBLEM - mediawiki originals uploads -hourly- for eqiad-prod on graphite1001 is CRITICAL: CRITICAL: 83.33% of data above the critical threshold [3000.0] [21:16:34] mediawiki original uploads alerts: are those noise now or? [21:16:58] aren't usually AFAIK [21:17:12] *aren't noise [21:17:16] what are they? [21:17:40] new feature request: all alerts should have links to some dashboard showing what they're alerting on :) [21:17:51] yeah!! [21:18:01] I've already asked that for all grafana related alerts [21:18:03] PROBLEM - Host labtestpuppetmaster2001 is DOWN: PING CRITICAL - Packet loss = 100% [21:18:07] Helpful bot? https://commons.wikimedia.org/wiki/Special:ListFiles [21:18:08] a link to a graph [21:18:15] bblack: +10000 [21:18:27] MET it seems https://commons.wikimedia.org/wiki/File:The_Last_Judgment_MET_DP818258.jpg [21:18:33] is it alerting on a high rate of new uploads? [21:18:38] apparently [21:18:39] apparently [21:18:43] RECOVERY - Host labtestpuppetmaster2001 is UP: PING OK - Packet loss = 0%, RTA = 36.20 ms [21:20:17] !log varnish backend restart on cp1072 (mailbox lag) [21:20:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:54] (unrelated, just happened to early-notice it nearing the warning->crit threshold) [21:22:43] PROBLEM - IPMI Temperature on labtestpuppetmaster2001 is CRITICAL: Return code of 255 is out of bounds [21:23:24] PROBLEM - puppet last run on cp1072 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[varnish] [21:23:53] PROBLEM - MariaDB Slave Lag: s4 on db2019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 316.21 seconds [21:23:53] (03PS25) 10Paladox: Gerrit: Add support for scap [puppet] - 10https://gerrit.wikimedia.org/r/363726 (https://phabricator.wikimedia.org/T157414) [21:23:54] PROBLEM - MariaDB Slave Lag: s4 on db2037 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 318.85 seconds [21:24:03] (03CR) 10Paladox: Gerrit: Add support for scap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/363726 (https://phabricator.wikimedia.org/T157414) (owner: 10Paladox) [21:24:13] PROBLEM - MariaDB Slave Lag: s4 on db2044 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 329.13 seconds [21:24:13] PROBLEM - MariaDB Slave Lag: s4 on db2051 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 329.35 seconds [21:24:14] PROBLEM - MariaDB Slave Lag: s4 on db2065 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 329.49 seconds [21:24:14] PROBLEM - MariaDB Slave Lag: s4 on db2058 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 329.66 seconds [21:24:23] RECOVERY - puppet last run on cp1072 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [21:25:53] (03PS26) 10Paladox: Gerrit: Add support for scap [puppet] - 10https://gerrit.wikimedia.org/r/363726 (https://phabricator.wikimedia.org/T157414) [21:25:58] heh I think I found a run-no-puppet bug - it doesn't know about the splay times [21:26:29] the scenario is I did a "run-no-puppet varnish-backend-restart" on a host, and it ended up having the varnish-backend-restart and an agent run going concurrently, resulting in the puppetfail above [21:26:36] in the syslog you can see: [21:26:38] Jul 11 21:20:06 cp1072 puppet-agent[11717]: Sleeping for 41 seconds (splay is enabled) [21:26:41] Jul 11 21:20:10 cp1072 puppet-agent[11763]: Disabling Puppet. [21:26:49] lol, yes BUT the final behaviour is the intended one [21:26:53] Jul 11 21:20:47 cp1072 puppet-agent[11717]: Retrieving pluginfacts [21:27:15] I thought run-no-puppet was supposed to stall/wait if an agent was already going? [21:27:35] yes, if puppet if the lock is there [21:27:46] it looks like the cron'd agent started up and did a splay sleep, then run-no-puppet said "disabling" and starting taking my action, then the agent woke up from splay and started conflicting [21:27:47] s/if puppet// [21:28:23] I guess the agent doesn't lock during the splay-sleep, and doesn't check for disable after the splay-sleep? [21:28:29] which is kind of messed up [21:28:48] I actually had a race with run-no-puppet --failed-only that a puppet run started in the middle of the check if last run failed, it went right in the middle [21:29:12] bblack: probably, I can j.oe tomorrow about that if he already knows [21:29:41] the thing is that in the end that's what you wanted [21:30:03] there was no real puppet running, you run puppet, the other one failed because of the lock [21:30:23] oh sorry you run run-puppet-agent or disable-puppet? [21:30:38] * volans re-looking at the log lines [21:30:55] I ran "run-no-puppet varnish-backend-restart", which I thougt meant "ensure no agent is running when varnish-backend-restart runs" [21:31:19] sorry, I completely confused run-no-puppet with run-puppet-agent [21:31:21] my bad [21:31:28] run-no-puppet is e.ma's one right [21:32:08] yeah [21:32:32] it does seem to do what I would think would be the right thing (disable, then wait on any existing lock) [21:32:45] yeah, looking at the code now [21:32:48] but I can only guess as above that puppet fails to lock during splay sleep, and also doesn't check disable afterwards [21:33:06] so did puppet run as it was not disabled? [21:33:34] I think the sequence in syslog ended up being: [21:33:40] that's bad, checking syslog [21:33:43] 10Puppet, 10Release-Engineering-Team (Watching / External): Preload TestingAccessWrapper in production mwrepl - https://phabricator.wikimedia.org/T143607#3427878 (10greg) @Mattflaschen-WMF ^^ is that sufficient for your use case? [21:33:51] ACKNOWLEDGEMENT - Check systemd state on labstore2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. andrew bogott known issues [21:33:51] (03PS2) 10Chad: WIP: Simple wrapper around updating the interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363970 [21:34:31] 1) Cron starts puppet-agent, which is not disabled and starts a 41s splay sleep 2) I start run-no-puppet, which disables and checks for a lock and doesn't see one, executes command 3) While command is still executing, the agent wakes up from splay, ignores the (new) disabling-lockfile, and executes stuff [21:35:18] which is easily explainable by puppet not taking a lock before splay-sleep, and not checking the disable afterwards either [21:35:29] (at least that's my simple theory) [21:35:51] (03PS3) 10Chad: WIP: Simple wrapper around updating the interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363970 [21:36:29] instead it probably checks for disable before the splay sleep, and then takes the lock for the first time after the splay sleep [21:37:14] seems so [21:37:17] (03PS1) 10BBlack: VCL: fix keep values at 7d [puppet] - 10https://gerrit.wikimedia.org/r/364605 [21:37:19] (03PS1) 10BBlack: VCL: grace-within-TTL [puppet] - 10https://gerrit.wikimedia.org/r/364606 [21:40:13] (03CR) 10Chad: WIP: Simple wrapper around updating the interwiki cache (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363970 (owner: 10Chad) [21:44:05] (03CR) 10BBlack: [C: 04-1] "This is mostly a thought experiment so far, there's more thinking to do. I was mostly thinking from the backend-most cache's perspective " [puppet] - 10https://gerrit.wikimedia.org/r/364606 (owner: 10BBlack) [21:48:40] bblack: seems indeed that :/ [21:49:09] check if running, check if disabled, splay, acquire lock (if I'm reading the code correctly) [21:49:12] https://github.com/puppetlabs/puppet/blob/3.x/lib/puppet/agent.rb#L30 [21:50:15] and the code in master branch seems more or less the same [21:59:10] (03PS4) 1020after4: Remove 'din' from wmgExtraLanguageNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362876 (https://phabricator.wikimedia.org/T168523) (owner: 10D3r1ck01) [21:59:26] 10Operations, 10monitoring: Icinga: timeseries checks should have the link to a graph with the data - https://phabricator.wikimedia.org/T170353#3428036 (10Volans) [21:59:40] bblack: feel free to add your thought ^^^ ;) [22:01:38] for the run-no-puppet, given that the problem is inside puppet agent and that splay is more though for the daemonized version of it, we could just drop the splay and do it ourselves in the /usr/local/sbin/puppet-run script before calling puppet [22:11:17] (03CR) 10Dzahn: [C: 04-2] add netmon2001 to site, equal to netmon1002 [puppet] - 10https://gerrit.wikimedia.org/r/364585 (https://phabricator.wikimedia.org/T166180) (owner: 10Dzahn) [22:13:00] (03PS2) 10Dzahn: add netmon2001 to site, equal to netmon1002 [puppet] - 10https://gerrit.wikimedia.org/r/364585 (https://phabricator.wikimedia.org/T166180) [22:18:17] (03CR) 10Dzahn: [C: 032] add netmon2001 to site, equal to netmon1002 [puppet] - 10https://gerrit.wikimedia.org/r/364585 (https://phabricator.wikimedia.org/T166180) (owner: 10Dzahn) [22:22:34] RECOVERY - mediawiki originals uploads -hourly- for eqiad-prod on graphite1001 is OK: OK: Less than 80.00% above the threshold [2000.0] [22:23:43] PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:24:34] PROBLEM - puppet last run on netmon2001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/share/smokeping/www/smokeping.fcgi],Exec[acme-setup-acme-librenms] [22:24:54] known, brand new [22:25:43] PROBLEM - mediawiki originals uploads -hourly- for eqiad-prod on graphite1001 is CRITICAL: CRITICAL: 83.33% of data above the critical threshold [3000.0] [22:26:07] (03PS1) 10Dzahn: netmon: disable Letsencrypt on netmon2001 [puppet] - 10https://gerrit.wikimedia.org/r/364613 (https://phabricator.wikimedia.org/T166180) [22:29:39] (03CR) 10Dzahn: [C: 032] netmon: disable Letsencrypt on netmon2001 [puppet] - 10https://gerrit.wikimedia.org/r/364613 (https://phabricator.wikimedia.org/T166180) (owner: 10Dzahn) [22:29:59] 10Operations, 10RESTBase, 10RESTBase-API, 10Traffic, 10Services (next): RESTBase support for www.wikimedia.org missing - https://phabricator.wikimedia.org/T133178#3428189 (10GWicke) [22:30:03] 10Operations, 10RESTBase, 10Services, 10Wikimedia-Site-requests: Index page https://wikimedia.org/api/ is broken / RESTBase not discoverable - https://phabricator.wikimedia.org/T138848#3428191 (10GWicke) [22:30:43] RECOVERY - puppet last run on netmon2001 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [22:43:27] (03PS1) 10Dzahn: switch librenms from netmon1002 to netmon1002 [dns] - 10https://gerrit.wikimedia.org/r/364617 (https://phabricator.wikimedia.org/T159756) [22:51:24] PROBLEM - Keyholder SSH agent on netmon2001 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [22:52:44] RECOVERY - MariaDB Slave Lag: s4 on db2058 is OK: OK slave_sql_lag Replication lag: 33.64 seconds [22:53:04] RECOVERY - MariaDB Slave Lag: s4 on db2044 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [22:53:14] RECOVERY - MariaDB Slave Lag: s4 on db2051 is OK: OK slave_sql_lag Replication lag: 0.21 seconds [22:53:15] RECOVERY - MariaDB Slave Lag: s4 on db2019 is OK: OK slave_sql_lag Replication lag: 0.14 seconds [22:53:24] RECOVERY - MariaDB Slave Lag: s4 on db2065 is OK: OK slave_sql_lag Replication lag: 0.32 seconds [22:53:24] RECOVERY - MariaDB Slave Lag: s4 on db2037 is OK: OK slave_sql_lag Replication lag: 0.79 seconds [22:54:00] 10Operations, 10MediaWiki-API, 10Traffic, 10monitoring, 10Services (watching): Set up action API latency / error rate metrics & alerts - https://phabricator.wikimedia.org/T123854#3428308 (10GWicke) [22:56:04] 10Operations, 10RESTBase-Cassandra, 10Patch-For-Review, 10Services (watching): setup an alertable threshold for Cassandra heap dumps - https://phabricator.wikimedia.org/T106346#3428315 (10GWicke) [23:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170711T2300). [23:00:06] Niharika: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:15] o/ [23:02:37] Hello [23:02:42] Niharika: you're SWATting? [23:03:12] Dereckson: I'm not. Can you? [23:04:48] I can if you wish, but I need first to wrap up something else. Available in 10 minutes. [23:05:17] 10Operations, 10Citoid, 10ContentTranslation, 10ContentTranslation-CXserver, and 4 others: Decom legacy ex-parsoidcache cxserver, citoid, and restbase service hostnames - https://phabricator.wikimedia.org/T133001#3428358 (10GWicke) So at this point basically only cxserver is remaining. Work on that is ongo... [23:05:45] 10Operations, 10Citoid, 10ContentTranslation, 10ContentTranslation-CXserver, and 4 others: Decom legacy ex-parsoidcache cxserver, citoid, and restbase service hostnames - https://phabricator.wikimedia.org/T133001#3428360 (10GWicke) [23:08:20] Dereckson: That would be great. :) No rush. [23:09:11] 10Operations, 10Cassandra, 10RESTBase-Cassandra, 10Services (later): Highest SSTables / read thresholds - https://phabricator.wikimedia.org/T133091#3428365 (10GWicke) [23:14:51] * Dereckson is back. [23:15:43] Niharika: there is a merge conflict to solve manually apparently [23:16:16] Dereckson: Gah, gimme a moment. [23:16:17] (apparently because sometimes Gerrit states so but a 3 merge works) [23:16:27] /home/dereckson/dev/mediawiki/operations/mediawiki-config (review/niharika29/t107707) ] git rebase origin/master [23:16:30] Current branch review/niharika29/t107707 is up to date. [23:16:33] as I said [23:18:04] (03PS2) 10Dereckson: Config changes for LoginNotify [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362323 (https://phabricator.wikimedia.org/T107707) (owner: 10Niharika29) [23:18:19] (03CR) 10Dereckson: "PS2: rebased" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362323 (https://phabricator.wikimedia.org/T107707) (owner: 10Niharika29) [23:18:51] Niharika: the code still uses wgLoginNotifyEnableOnSuccess? [23:19:16] Dereckson: No, it's removed because we're using the default (True) value. [23:19:25] It's in the extension, I mean [23:19:38] Not in the patch. [23:19:40] ok [23:20:07] (03CR) 10Dereckson: [C: 032] Config changes for LoginNotify [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362323 (https://phabricator.wikimedia.org/T107707) (owner: 10Niharika29) [23:21:02] (03Merged) 10jenkins-bot: Config changes for LoginNotify [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362323 (https://phabricator.wikimedia.org/T107707) (owner: 10Niharika29) [23:21:15] (03CR) 10jenkins-bot: Config changes for LoginNotify [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362323 (https://phabricator.wikimedia.org/T107707) (owner: 10Niharika29) [23:22:35] Niharika: live on mwdebug1002.eqiad.wmnet [23:23:09] Dereckson: Checking. [23:23:36] Dereckson: LGTM. [23:24:08] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10Hindi-Sites: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3428443 (10Koavf) Thanks. I understand how this may be disappointing to some contributors but I am really concerned about the prospect of deploying someth... [23:24:45] ok, syncing to prod [23:25:17] !log dereckson@tin Synchronized wmf-config/CommonSettings.php: Config changes for LoginNotify (T107707) (duration: 00m 47s) [23:25:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:25:30] T107707: Login alert when user logs in from new machine - https://phabricator.wikimedia.org/T107707 [23:26:29] We don't have any flood about configuration error, so all looks good Niharika [23:26:40] Dereckson: Thanks! [23:26:46] You're welcome [23:32:27] (03PS1) 10Dzahn: rancid: add rsync::quickdatacopy to sync /var/lib/rancid [puppet] - 10https://gerrit.wikimedia.org/r/364620 (https://phabricator.wikimedia.org/T166180) [23:34:23] (03CR) 10Dzahn: [C: 032] "so much nicer with the new "quickdatacopy" abstraction" [puppet] - 10https://gerrit.wikimedia.org/r/364620 (https://phabricator.wikimedia.org/T166180) (owner: 10Dzahn) [23:44:35] (03PS1) 10Dzahn: rsync::quickdatacopy: make auto-sync via cron optional [puppet] - 10https://gerrit.wikimedia.org/r/364621 [23:45:25] (03PS2) 10Dzahn: rsync::quickdatacopy: make auto-sync via cron optional [puppet] - 10https://gerrit.wikimedia.org/r/364621 [23:45:30] (03CR) 10jerkins-bot: [V: 04-1] rsync::quickdatacopy: make auto-sync via cron optional [puppet] - 10https://gerrit.wikimedia.org/r/364621 (owner: 10Dzahn) [23:46:18] (03CR) 10jerkins-bot: [V: 04-1] rsync::quickdatacopy: make auto-sync via cron optional [puppet] - 10https://gerrit.wikimedia.org/r/364621 (owner: 10Dzahn) [23:47:25] (03PS3) 10Dzahn: rsync::quickdatacopy: make auto-sync via cron optional [puppet] - 10https://gerrit.wikimedia.org/r/364621 [23:54:55] (03PS4) 10Dzahn: rsync::quickdatacopy: make auto-sync via cron optional [puppet] - 10https://gerrit.wikimedia.org/r/364621 [23:55:47] (03CR) 10jerkins-bot: [V: 04-1] rsync::quickdatacopy: make auto-sync via cron optional [puppet] - 10https://gerrit.wikimedia.org/r/364621 (owner: 10Dzahn) [23:56:51] (03PS5) 10Dzahn: rsync::quickdatacopy: make auto-sync via cron optional [puppet] - 10https://gerrit.wikimedia.org/r/364621 [23:58:25] (03CR) 10Dzahn: [C: 032] rsync::quickdatacopy: make auto-sync via cron optional [puppet] - 10https://gerrit.wikimedia.org/r/364621 (owner: 10Dzahn)