[01:09:07] (03CR) 10Dzahn: "@Ema thank you very much! back when i checked i could not seem to find that option that worked :) this is nice! except i noticed on a gane" [puppet] - 10https://gerrit.wikimedia.org/r/310383 (https://phabricator.wikimedia.org/T125205) (owner: 10Dzahn) [01:12:53] (03CR) 10Dzahn: "yea, after a while it ends with "ipmi_sdr_cache_create: internal IPMI error" on a VM" [puppet] - 10https://gerrit.wikimedia.org/r/310383 (https://phabricator.wikimedia.org/T125205) (owner: 10Dzahn) [01:18:09] (03PS5) 10Dzahn: monitoring/base: add NRPE command to check temperature [puppet] - 10https://gerrit.wikimedia.org/r/310383 (https://phabricator.wikimedia.org/T125205) [01:19:16] (03CR) 10jerkins-bot: [V: 04-1] monitoring/base: add NRPE command to check temperature [puppet] - 10https://gerrit.wikimedia.org/r/310383 (https://phabricator.wikimedia.org/T125205) (owner: 10Dzahn) [01:19:24] (03PS6) 10Dzahn: monitoring/base: add NRPE command to check temperature [puppet] - 10https://gerrit.wikimedia.org/r/310383 (https://phabricator.wikimedia.org/T125205) [01:24:24] 06Operations, 10Monitoring, 13Patch-For-Review: Monitor hardware thermal issues - https://phabricator.wikimedia.org/T125205#3302713 (10Dzahn) @Ema Thank you very much! Back when i checked i somehow could not find the working option :) This is great. Just one thing, we should not be running it on VMs (ganeti... [01:27:35] (03PS7) 10Dzahn: monitoring/base: add temperature monitoring via NRPE [puppet] - 10https://gerrit.wikimedia.org/r/310383 (https://phabricator.wikimedia.org/T125205) [01:28:34] (03PS8) 10Dzahn: monitoring/base: add temperature monitoring via NRPE [puppet] - 10https://gerrit.wikimedia.org/r/310383 (https://phabricator.wikimedia.org/T125205) [01:43:16] (03CR) 10Dzahn: "@akosiaris yea, the upper limits can be see when adding "-vvv" to the command, see the "Upper C" column on https://phabricator.wikimedia.o" [puppet] - 10https://gerrit.wikimedia.org/r/310383 (https://phabricator.wikimedia.org/T125205) (owner: 10Dzahn) [01:44:49] (03CR) 10Dzahn: [C: 032] "works on 3 different test hosts, if there are any special cases we'll see them. trying it." [puppet] - 10https://gerrit.wikimedia.org/r/310383 (https://phabricator.wikimedia.org/T125205) (owner: 10Dzahn) [01:48:47] ^ adds new Icinga temperature checks.. if some are surprising it will be because they are new :p [01:53:43] ok, it just adds a bunch of UNKNOWN but will look why that is. command worked manually [02:07:41] (03CR) 10Dzahn: "not quite yet. we have UNKNOWNs because "sudo: no tty present and no askpass program specified". we've been there before.." [puppet] - 10https://gerrit.wikimedia.org/r/310383 (https://phabricator.wikimedia.org/T125205) (owner: 10Dzahn) [02:25:17] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.1) (duration: 07m 45s) [02:25:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:32:44] (03PS1) 10Dzahn: monitoring/base: add nagios sudo privs for IPMI sensors [puppet] - 10https://gerrit.wikimedia.org/r/356324 (https://phabricator.wikimedia.org/T125205) [02:34:25] (03PS2) 10Dzahn: monitoring/base: add nagios sudo privs for IPMI sensors [puppet] - 10https://gerrit.wikimedia.org/r/356324 (https://phabricator.wikimedia.org/T125205) [02:35:17] (03CR) 10Dzahn: "needed follow-up https://gerrit.wikimedia.org/r/#/c/356324/" [puppet] - 10https://gerrit.wikimedia.org/r/310383 (https://phabricator.wikimedia.org/T125205) (owner: 10Dzahn) [02:35:59] (03PS3) 10Dzahn: monitoring/base: add nagios sudo privs for IPMI sensors [puppet] - 10https://gerrit.wikimedia.org/r/356324 (https://phabricator.wikimedia.org/T125205) [02:38:18] (03CR) 10Dzahn: [C: 032] "tested on wtp2018" [puppet] - 10https://gerrit.wikimedia.org/r/356324 (https://phabricator.wikimedia.org/T125205) (owner: 10Dzahn) [02:40:41] PROBLEM - IPMI Temperature on ms-be2028 is CRITICAL: Sensor Type(s) Temperature Status: Critical [System Board 12 29-LOM = Critical] [02:40:57] ^ that's a good thing, it means it works now and found the first one [02:41:07] already seeing others OK [02:41:31] PROBLEM - IPMI Temperature on ms-be2020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:41:43] (was: UNKNOWN - so OK or PROBLEM are both new) [02:41:51] PROBLEM - IPMI Temperature on ms-be1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:42:22] RECOVERY - IPMI Temperature on ms-be2020 is OK: Sensor Type(s) Temperature Status: OK [02:43:21] PROBLEM - IPMI Temperature on ms-be2017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:44:11] RECOVERY - IPMI Temperature on ms-be2017 is OK: Sensor Type(s) Temperature Status: OK [02:44:51] PROBLEM - IPMI Temperature on lvs1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:44:55] (a few are expected due to the number of checks and puppet runs on icinga server and the hosts must both happen) [02:45:11] PROBLEM - IPMI Temperature on ms-be2014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:45:31] PROBLEM - IPMI Temperature on poolcounter1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:45:41] RECOVERY - IPMI Temperature on lvs1004 is OK: Sensor Type(s) Temperature Status: OK [02:46:11] PROBLEM - IPMI Temperature on aqs1004 is CRITICAL: Sensor Type(s) Temperature Status: Critical [System Board 2 26-LOM = Critical, System Board 2 26-LOM = Critical] [02:46:11] PROBLEM - IPMI Temperature on db2038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:46:21] RECOVERY - IPMI Temperature on poolcounter1002 is OK: Sensor Type(s) Temperature Status: OK [02:47:01] RECOVERY - IPMI Temperature on db2038 is OK: Sensor Type(s) Temperature Status: OK [02:47:54] (reducing spam until they all ran once) [02:48:07] be back soon [03:22:45] PROBLEM - IPMI Temperature on silver is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:22:47] ACKNOWLEDGEMENT - IPMI Temperature on ms-be2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. daniel_zahn host in scheduled downtime but new service on host [03:22:47] ACKNOWLEDGEMENT - IPMI Temperature on ms-be2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. daniel_zahn host in scheduled downtime but new service on host [03:22:47] ACKNOWLEDGEMENT - IPMI Temperature on ms-be2006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. daniel_zahn host in scheduled downtime but new service on host [03:22:47] ACKNOWLEDGEMENT - IPMI Temperature on ms-be2007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. daniel_zahn host in scheduled downtime but new service on host [03:22:47] ACKNOWLEDGEMENT - IPMI Temperature on ms-be2008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. daniel_zahn host in scheduled downtime but new service on host [03:22:47] ACKNOWLEDGEMENT - IPMI Temperature on ms-be2010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. daniel_zahn host in scheduled downtime but new service on host [03:22:47] ACKNOWLEDGEMENT - IPMI Temperature on ms-be2011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. daniel_zahn host in scheduled downtime but new service on host [03:22:48] ACKNOWLEDGEMENT - IPMI Temperature on ms-be2012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. daniel_zahn host in scheduled downtime but new service on host [03:24:31] ACKNOWLEDGEMENT - IPMI Temperature on db2049 is CRITICAL: Sensor Type(s) Temperature Status: Critical [Power Unit 2 18-VR P2 = Critical, Power Unit 2 18-VR P2 = Critical] daniel_zahn host in scheduled downtime but new service on host [03:24:31] ACKNOWLEDGEMENT - Check systemd state on ms-be2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn host in scheduled downtime but new service on host [03:25:01] PROBLEM - IPMI Temperature on ocg1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:26:01] PROBLEM - IPMI Temperature on labsdb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:27:41] PROBLEM - IPMI Temperature on silver is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:27:51] ACKNOWLEDGEMENT - IPMI Temperature on labsdb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. daniel_zahn https://phabricator.wikimedia.org/T142807 ? [03:28:01] PROBLEM - IPMI Temperature on ocg1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:29:41] PROBLEM - IPMI Temperature on silver is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:31:31] RECOVERY - IPMI Temperature on silver is OK: Sensor Type(s) Temperature Status: OK [03:32:11] PROBLEM - Nginx local proxy to apache on mw1264 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.152 second response time [03:32:31] PROBLEM - Apache HTTP on mw1264 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.074 second response time [03:33:01] PROBLEM - HHVM rendering on mw1264 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.073 second response time [03:33:31] RECOVERY - Apache HTTP on mw1264 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.107 second response time [03:34:01] RECOVERY - HHVM rendering on mw1264 is OK: HTTP OK: HTTP/1.1 200 OK - 74616 bytes in 0.253 second response time [03:34:11] RECOVERY - Nginx local proxy to apache on mw1264 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.188 second response time [03:40:17] 06Operations, 10Monitoring, 13Patch-For-Review: Monitor hardware thermal issues - https://phabricator.wikimedia.org/T125205#3302753 (10Dzahn) Works now and is OK on > 1000 machines :) https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=IPMI+Temperature I had to ACK a bunch that were on ho... [03:42:21] PROBLEM - IPMI Temperature on ms-be2014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:44:21] PROBLEM - IPMI Temperature on ms-be2014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:44:49] (03CR) 10Dzahn: "now: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=IPMI+temperature" [puppet] - 10https://gerrit.wikimedia.org/r/310383 (https://phabricator.wikimedia.org/T125205) (owner: 10Dzahn) [03:49:21] PROBLEM - IPMI Temperature on ms-be2014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:50:05] grumbles (but it's just a few ms-be2*) [03:51:21] PROBLEM - IPMI Temperature on ms-be2014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:51:21] PROBLEM - IPMI Temperature on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:53:11] RECOVERY - IPMI Temperature on ms-be1016 is OK: Sensor Type(s) Temperature Status: OK [03:54:21] PROBLEM - IPMI Temperature on ms-be2014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:55:49] 06Operations, 10Monitoring, 13Patch-For-Review: Monitor hardware thermal issues - https://phabricator.wikimedia.org/T125205#3302768 (10Dzahn) special case: cp1049 - "System Board 1 Inlet Temp = Critical" ``` ID | Name | Type | State | Reading | Units | Lower NR | Lower C | Lower N... [04:00:21] PROBLEM - IPMI Temperature on ms-be2014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:03:00] ACKNOWLEDGEMENT - IPMI Temperature on aqs1004 is CRITICAL: Sensor Type(s) Temperature Status: Critical [System Board 2 26-LOM = Critical, System Board 2 26-LOM = Critical] daniel_zahn https://phabricator.wikimedia.org/T125205 [04:03:00] ACKNOWLEDGEMENT - IPMI Temperature on cp1049 is CRITICAL: Sensor Type(s) Temperature Status: Critical [System Board 1 Inlet Temp = Warning, System Board 1 Inlet Temp = Critical, System Board 1 Inlet Temp = Warning] daniel_zahn https://phabricator.wikimedia.org/T125205 [04:03:00] ACKNOWLEDGEMENT - IPMI Temperature on labsdb1003 is CRITICAL: Sensor Type(s) Temperature Status: Critical [Processor 1 P1_TEMP_SENS = Critical, Processor 1 P1_TEMP_SENS = Warning, Processor 1 P1_TEMP_SENS = Critical, Processor 1 P1_TEMP_SENS = Warning, Processor 1 P1_TEMP_SENS = Critical, Processor 1 P1_TEMP_SENS = Warning, Processor 1 P1_TEMP_SENS = Critical, Processor 1 P1_TEMP_SENS = Warning, Processor 1 P1_TEMP_SENS = Critical, [04:03:00] ACKNOWLEDGEMENT - IPMI Temperature on labsdb1011 is CRITICAL: Sensor Type(s) Temperature Status: Critical [System Board 12 29-LOM = Critical] daniel_zahn https://phabricator.wikimedia.org/T125205 [04:03:00] ACKNOWLEDGEMENT - IPMI Temperature on ms-be2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. daniel_zahn https://phabricator.wikimedia.org/T125205 [04:03:00] ACKNOWLEDGEMENT - IPMI Temperature on ms-be2028 is CRITICAL: Sensor Type(s) Temperature Status: Critical [System Board 12 29-LOM = Critical] daniel_zahn https://phabricator.wikimedia.org/T125205 [04:03:00] ACKNOWLEDGEMENT - IPMI Temperature on sodium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. daniel_zahn https://phabricator.wikimedia.org/T125205 [04:03:01] ACKNOWLEDGEMENT - IPMI Temperature on wtp1010 is CRITICAL: Sensor Type(s) Temperature Status: Critical [System Board Inlet Temp = Warning, System Board Inlet Temp = Critical, System Board Inlet Temp = Warning, System Board Inlet Temp = Warning, System Board Inlet Temp = Critical, System Board Inlet Temp = Warning, System Board Inlet Temp = Warning, System Board Inlet Temp = Critical, System Board Inlet Temp = Warning, System Board I [04:04:21] PROBLEM - IPMI Temperature on ms-be2014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:21:21] PROBLEM - IPMI Temperature on labnodepool1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:25:11] RECOVERY - IPMI Temperature on labnodepool1001 is OK: Sensor Type(s) Temperature Status: OK [05:28:47] (03CR) 10Dzahn: [C: 04-1] "see inline comment. should be version "8" on jessie AND stretch and _not_ install 7 unless it's still trusty." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/356241 (https://phabricator.wikimedia.org/T166611) (owner: 10Paladox) [05:42:47] (03CR) 10Dzahn: [C: 04-1] "see inline comment" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/356243 (https://phabricator.wikimedia.org/T166611) (owner: 10Paladox) [05:44:55] (03CR) 10Dzahn: [C: 031] "lgtm (for stretch testing)" [puppet] - 10https://gerrit.wikimedia.org/r/356237 (https://phabricator.wikimedia.org/T166611) (owner: 10Paladox) [05:46:16] (03CR) 10Dzahn: [C: 032] ganglia: remove dup define in postgresql check [puppet] - 10https://gerrit.wikimedia.org/r/356235 (owner: 10BryanDavis) [05:50:30] (03PS3) 10Dzahn: ganglia: remove dup define in postgresql check [puppet] - 10https://gerrit.wikimedia.org/r/356235 (owner: 10BryanDavis) [05:53:30] (03PS4) 10Dzahn: ganglia: remove dup define in postgresql check [puppet] - 10https://gerrit.wikimedia.org/r/356235 (owner: 10BryanDavis) [05:56:31] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=302.70 Read Requests/Sec=3340.30 Write Requests/Sec=689.40 KBytes Read/Sec=23649.60 KBytes_Written/Sec=9454.00 [06:04:17] !log Deploy alter table on s3 revision table - db1095 - T166278 [06:04:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:04:28] T166278: Unify revision table on s3 - https://phabricator.wikimedia.org/T166278 [06:04:40] !log Deploy alter table on s3 revision table - db1078 - T166278 [06:04:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:30] (03PS1) 10Marostegui: db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356332 (https://phabricator.wikimedia.org/T166278) [06:07:31] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=8.10 Read Requests/Sec=0.00 Write Requests/Sec=0.30 KBytes Read/Sec=0.00 KBytes_Written/Sec=13.20 [06:08:03] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356332 (https://phabricator.wikimedia.org/T166278) (owner: 10Marostegui) [06:10:21] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356332 (https://phabricator.wikimedia.org/T166278) (owner: 10Marostegui) [06:12:17] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1078 - T166278 (duration: 00m 43s) [06:12:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:26] T166278: Unify revision table on s3 - https://phabricator.wikimedia.org/T166278 [06:19:11] PROBLEM - puppet last run on lvs3003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:26:46] (03PS1) 10Marostegui: db-eqiad.php: Depool db1059 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356333 (https://phabricator.wikimedia.org/T166206) [06:28:41] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1059 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356333 (https://phabricator.wikimedia.org/T166206) (owner: 10Marostegui) [06:29:44] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1059 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356333 (https://phabricator.wikimedia.org/T166206) (owner: 10Marostegui) [06:31:03] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1081, depool db1059 - T166206 (duration: 00m 41s) [06:31:06] !log Deploy alter table on s4 - db1059 - T166206 [06:31:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:12] T166206: Convert unique keys into primary keys for some wiki tables on s4 - https://phabricator.wikimedia.org/T166206 [06:31:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:09] (03PS1) 10Marostegui: wmnet: Point m3 slave to codfw master [dns] - 10https://gerrit.wikimedia.org/r/356334 (https://phabricator.wikimedia.org/T160731) [06:45:20] !log Deploy alter table s3 revision table - dbstore1002 - T166278 [06:45:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:28] T166278: Unify revision table on s3 - https://phabricator.wikimedia.org/T166278 [06:45:30] !log installing sudo security updates [06:45:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:11] RECOVERY - puppet last run on lvs3003 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:47:15] [07:01:03] 06Operations, 10ops-eqiad, 06Analytics-Kanban, 10DBA: Degraded RAID on db1046 - https://phabricator.wikimedia.org/T166422#3302895 (10Marostegui) 05Open>03Resolved This is now back to Optimal ``` root@db1046:~# megacli -LDPDInfo -aAll Adapter #0 Number of Virtual Disks: 1 Virtual Drive: 0 (Target Id:... [07:12:21] PROBLEM - IPMI Temperature on kafka2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:13:21] RECOVERY - IPMI Temperature on kafka2001 is OK: Sensor Type(s) Temperature Status: OK [07:15:23] 06Operations, 10OTRS: Upgrade OTRS to 5.0.19 - https://phabricator.wikimedia.org/T165284#3302902 (10akosiaris) This is happening today, May 31st 2017, 14:00 UTC per the MOTD all these days on ticket.wikimedia.org. [07:20:44] 06Operations, 06Operations-Software-Development, 07Technical-Debt: Sunset our use of Salt - https://phabricator.wikimedia.org/T164780#3302922 (10hashar) [07:21:41] 06Operations, 06Operations-Software-Development, 07Technical-Debt: Sunset our use of Salt - https://phabricator.wikimedia.org/T164780#3245193 (10hashar) Per T151996#3300615 the deployment of ElasticSearch will not be migrated from Trebuchet to Scap but instead use Debian packages which is T158560. Now a sub... [07:27:19] 06Operations, 15User-Joe: Sync internal nutcracker package with Debian package - https://phabricator.wikimedia.org/T166038#3283563 (10Joe) The only thing we have added to 0.4.1 is https://github.com/wikimedia/operations-debs-nutcracker/commit/37fb9a2b939821c6d704ba09b7d80bcc88961224, which is useful if we rais... [07:32:11] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/1/0: down - Core: cr2-eqiad:xe-4/2/0 (Telia, IC-314533, 24ms) {#11371} [10Gbps wave]BR [07:32:21] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-eqord:xe-1/0/0 (Telia, IC-314533, 29ms) {#3658} [10Gbps wave]BR [07:41:01] PROBLEM - IPMI Temperature on ms-be2016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:41:27] (03PS2) 10DCausse: [WIP] Switch this repo to a deb package [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/352170 (https://phabricator.wikimedia.org/T158560) [07:41:51] RECOVERY - IPMI Temperature on ms-be2016 is OK: Sensor Type(s) Temperature Status: OK [07:46:50] (03CR) 10DCausse: [WIP] Switch this repo to a deb package (032 comments) [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/352170 (https://phabricator.wikimedia.org/T158560) (owner: 10DCausse) [07:47:31] !log restart kafka on kafka10[14,22,20] for jvm upgrades [07:47:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:02] (03PS1) 10Hashar: beta: keep less mysql bin logs [puppet] - 10https://gerrit.wikimedia.org/r/356337 (https://phabricator.wikimedia.org/T166060) [07:50:33] (03CR) 10Marostegui: [C: 031] beta: keep less mysql bin logs [puppet] - 10https://gerrit.wikimedia.org/r/356337 (https://phabricator.wikimedia.org/T166060) (owner: 10Hashar) [07:54:09] (03CR) 10Hashar: [C: 031] "All of that thanks to Manuel who gave me all the instruction to fix it up :-} I have cherry picked the patch on the beta puppet master an" [puppet] - 10https://gerrit.wikimedia.org/r/356337 (https://phabricator.wikimedia.org/T166060) (owner: 10Hashar) [07:55:41] (03CR) 10Marostegui: [C: 032] beta: keep less mysql bin logs [puppet] - 10https://gerrit.wikimedia.org/r/356337 (https://phabricator.wikimedia.org/T166060) (owner: 10Hashar) [08:01:11] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [08:01:20] 06Operations, 10Citoid, 10VisualEditor, 06Services (blocked), 15User-mobrovac: Wiley requests for DOI and some other publishers don't work in production - https://phabricator.wikimedia.org/T165105#3302978 (10Mvolz) I've sent a cold e-mail to technical support. [08:01:21] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [08:04:01] PROBLEM - Nginx local proxy to apache on mw1198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:04:35] checking --^ [08:04:51] RECOVERY - Nginx local proxy to apache on mw1198 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.413 second response time [08:05:01] PROBLEM - HHVM rendering on mw1199 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.073 second response time [08:05:19] 3.18 api-appserver [08:05:31] PROBLEM - Host elastic2014 is DOWN: PING CRITICAL - Packet loss = 100% [08:06:02] RECOVERY - HHVM rendering on mw1199 is OK: HTTP OK: HTTP/1.1 200 OK - 74629 bytes in 0.671 second response time [08:06:41] RECOVERY - MariaDB Slave Lag: s4 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 144.15 seconds [08:07:25] so on mw1198 the hhvm load went up to 64, then apache was unhappy with AH01067: Failed to read FastCGI header etc.. [08:08:16] that's some hardware problem, [08:08:33] lots of MCE and CPU heating errors in syslog [08:08:53] https://grafana.wikimedia.org/dashboard/db/prometheus-apache-hhvm-dc-stats?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-instance=mw1198&from=now-1h&to=now [08:09:03] queued requests for hhvm [08:09:21] RECOVERY - Host elastic2014 is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms [08:09:54] mw1199 queued for a brief amount of time https://grafana.wikimedia.org/dashboard/db/prometheus-apache-hhvm-dc-stats?orgId=1&from=now-1h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-instance=mw1199 [08:10:53] ah on mw1199 hhvm was restarted 5min ago [08:11:24] (03CR) 10Hashar: "Thanks! Lets remove that safe guard entirely from contint::package_builder , there is already one in the parent role :-}" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/356237 (https://phabricator.wikimedia.org/T166611) (owner: 10Paladox) [08:12:05] moritzm: (on mw1199 syslog - May 31 08:05:07 mw1199 systemd[1]: hhvm.service: main process exited, code=killed, status=11/SEGV) [08:13:32] elukey: looks very similar on the luasandbox crashes [08:13:50] elukey: looks very similar like the luasandbox crashes [08:14:30] (03CR) 10Hashar: [C: 031] "Straight forward :-}" [puppet] - 10https://gerrit.wikimedia.org/r/356234 (owner: 10BryanDavis) [08:18:17] mw1198 sometimes shows an increased hhvm load that eventually corresponds to queueing (apache workers rising too). This could be more traffic or hhvm slowing down in serving requests and accumulating them [08:18:51] (03CR) 10Hashar: [C: 031] "Very nice thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/356236 (owner: 10BryanDavis) [08:20:58] (03CR) 10Filippo Giunchedi: "> it seems PS1 of this was left on the beta cluster, rather than the" [puppet] - 10https://gerrit.wikimedia.org/r/353282 (https://phabricator.wikimedia.org/T149451) (owner: 10Filippo Giunchedi) [08:21:37] (03PS1) 10Alexandros Kosiaris: Update Templates for 5.0.19 OTRS version [software/otrs] - 10https://gerrit.wikimedia.org/r/356340 (https://phabricator.wikimedia.org/T165284) [08:24:19] !log installing imagemagick security updates on trusty (jessie already fixed) [08:24:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:39] 06Operations, 15User-Joe: Switch etcd back to eqiad, document switchover procedure - https://phabricator.wikimedia.org/T166552#3303022 (10Joe) [08:26:29] (03PS2) 10Giuseppe Lavagetto: Reduce TTL of etcd conftool entries [dns] - 10https://gerrit.wikimedia.org/r/356135 [08:27:36] <_joe_> akosiaris: I'll start with this ^^ [08:27:49] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Update Templates for 5.0.19 OTRS version [software/otrs] - 10https://gerrit.wikimedia.org/r/356340 (https://phabricator.wikimedia.org/T165284) (owner: 10Alexandros Kosiaris) [08:27:52] !log complete linux 4.9 upgrade on Debian ms-be2* machines [08:28:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:04] 06Operations, 15User-Joe: Switch etcd back to eqiad, document switchover procedure - https://phabricator.wikimedia.org/T166552#3303028 (10Joe) The simple script to set the replication index in codfw before starting replication: ``` import etcd import sys if len(sys.argv) < 3: raise RuntimeError("Usage: {... [08:31:39] (03CR) 10Hashar: [C: 031] "we might consider adding Package[bash-completion] in ::standard. Currently it is installed via the scap package, but I think it is worth " [puppet] - 10https://gerrit.wikimedia.org/r/353937 (owner: 10BryanDavis) [08:37:41] PROBLEM - IPMI Temperature on db2041 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:37:59] (03CR) 10Giuseppe Lavagetto: [C: 032] Reduce TTL of etcd conftool entries [dns] - 10https://gerrit.wikimedia.org/r/356135 (owner: 10Giuseppe Lavagetto) [08:38:31] RECOVERY - IPMI Temperature on db2041 is OK: Sensor Type(s) Temperature Status: OK [08:40:21] (03CR) 10jenkins-bot: Compress all project SVG logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356231 (owner: 10Jdlrobson) [08:40:31] PROBLEM - IPMI Temperature on labnodepool1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:41:21] RECOVERY - IPMI Temperature on labnodepool1001 is OK: Sensor Type(s) Temperature Status: OK [08:41:46] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1059 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356333 (https://phabricator.wikimedia.org/T166206) (owner: 10Marostegui) [08:42:06] (03CR) 10jenkins-bot: Make Flow default in all talk namespaces on cawikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356306 (https://phabricator.wikimedia.org/T165497) (owner: 10Catrope) [08:42:11] 06Operations, 15User-Joe: Switch etcd back to eqiad, document switchover procedure - https://phabricator.wikimedia.org/T166552#3303052 (10Joe) [08:42:20] (03PS2) 10Giuseppe Lavagetto: etcd: switch writes to eqiad [dns] - 10https://gerrit.wikimedia.org/r/356136 [08:42:44] (03CR) 10jenkins-bot: Page images can come outside the lead for all projects except Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356225 (https://phabricator.wikimedia.org/T166493) (owner: 10Jdlrobson) [08:43:02] PROBLEM - Apache HTTP on mw1201 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.075 second response time [08:43:11] PROBLEM - HHVM rendering on mw1201 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.074 second response time [08:43:15] (03PS2) 10Giuseppe Lavagetto: etcd: set codfw to read-only too. [puppet] - 10https://gerrit.wikimedia.org/r/356138 [08:43:33] checking mw1201.. [08:44:01] RECOVERY - Apache HTTP on mw1201 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.112 second response time [08:44:01] PROBLEM - IPMI Temperature on snapshot1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:44:11] RECOVERY - HHVM rendering on mw1201 is OK: HTTP OK: HTTP/1.1 200 OK - 74629 bytes in 0.450 second response time [08:44:19] hhvm restarted a min ago, I bet same problem [08:44:24] (03CR) 10Volans: [C: 031] "LGTM to keep the current behaviour" [puppet] - 10https://gerrit.wikimedia.org/r/356212 (https://phabricator.wikimedia.org/T166203) (owner: 10Alexandros Kosiaris) [08:44:36] May 31 08:43:05 mw1201 systemd[1]: hhvm.service: main process exited, code=killed, status=11/SEGV [08:44:44] moritzm: --^ api server with 2.18 [08:44:47] err 3.18 [08:44:51] RECOVERY - IPMI Temperature on snapshot1005 is OK: Sensor Type(s) Temperature Status: OK [08:45:07] (03CR) 10jenkins-bot: Test wgLogoHD keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344798 (https://phabricator.wikimedia.org/T161416) (owner: 10Dereckson) [08:45:08] (03CR) 10jenkins-bot: Add Wikipedia wordmark in Serbian/Macedonian [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355625 (https://phabricator.wikimedia.org/T165896) (owner: 10Dereckson) [08:45:10] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356332 (https://phabricator.wikimedia.org/T166278) (owner: 10Marostegui) [08:45:24] (03CR) 10Filippo Giunchedi: [C: 031] logstash - start using elasticsearch-curator for indices cleanup [puppet] - 10https://gerrit.wikimedia.org/r/356063 (https://phabricator.wikimedia.org/T166154) (owner: 10Gehel) [08:45:54] (03PS6) 10Gehel: logstash - start using elasticsearch-curator for indices cleanup [puppet] - 10https://gerrit.wikimedia.org/r/356063 (https://phabricator.wikimedia.org/T166154) [08:46:18] godog: thanks! [08:46:41] PROBLEM - Apache HTTP on mw1198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:47:16] (03CR) 10Gehel: [C: 032] logstash - start using elasticsearch-curator for indices cleanup [puppet] - 10https://gerrit.wikimedia.org/r/356063 (https://phabricator.wikimedia.org/T166154) (owner: 10Gehel) [08:47:31] RECOVERY - Apache HTTP on mw1198 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.129 second response time [08:48:00] (03PS3) 10Giuseppe Lavagetto: etcd: set codfw to read-only too. [puppet] - 10https://gerrit.wikimedia.org/r/356138 [08:48:02] (03PS2) 10Giuseppe Lavagetto: etcd: enable replication eqiad => codfw [puppet] - 10https://gerrit.wikimedia.org/r/356139 [08:48:04] (03PS1) 10Giuseppe Lavagetto: etcd: set eqiad in r-w mode [puppet] - 10https://gerrit.wikimedia.org/r/356341 [08:49:06] elukey: mw1201 segfaulted, can't even get a backtrace there [08:49:20] <_joe_> akosiaris: so when I merge https://gerrit.wikimedia.org/r/356138, we will go in read-only mode; after that we need to disable puppet on all conf* hosts, merge https://gerrit.wikimedia.org/r/#/c/356139, run puppet in eqiad, run the script I prepared on conf2002, run puppet in codfw as well (this will start replica), then merge https://gerrit.wikimedia.org/r/356341 and run puppet in eqiad [08:50:25] this is a good candidate for a "thing" with tasks in the switchdc spinoff :D [08:50:35] <_joe_> yes [08:50:40] <_joe_> I said so in the ticket [08:50:45] <_joe_> I told you in Vienna [08:50:50] <_joe_> ;P [08:50:57] it was not for you [08:50:58] <_joe_> but no pressure man [08:51:01] (03PS1) 10Gehel: logstash - fix typo in template name [puppet] - 10https://gerrit.wikimedia.org/r/356342 (https://phabricator.wikimedia.org/T166154) [08:51:02] I know you know [08:51:03] :D [08:51:11] PROBLEM - puppet last run on logstash1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:51:21] PROBLEM - HHVM rendering on mw1198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:52:11] RECOVERY - HHVM rendering on mw1198 is OK: HTTP OK: HTTP/1.1 200 OK - 74617 bytes in 0.314 second response time [08:52:37] (03CR) 10Gehel: [C: 032] logstash - fix typo in template name [puppet] - 10https://gerrit.wikimedia.org/r/356342 (https://phabricator.wikimedia.org/T166154) (owner: 10Gehel) [08:54:11] RECOVERY - puppet last run on logstash1001 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [08:54:11] PROBLEM - puppet last run on logstash1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:55:12] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/1/0: down - Core: cr2-eqiad:xe-4/2/0 (Telia, IC-314533, 24ms) {#11371} [10Gbps wave]BR [08:55:21] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-eqord:xe-1/0/0 (Telia, IC-314533, 29ms) {#3658} [10Gbps wave]BR [08:56:11] RECOVERY - puppet last run on logstash1002 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [09:00:41] 06Operations, 15User-Joe: Switch etcd back to eqiad, document switchover procedure - https://phabricator.wikimedia.org/T166552#3303087 (10Joe) Play-by-play: # Merge https://gerrit.wikimedia.org/r/356138 # sudo cumin 'R:class = role::configmaster and *.eqiad.wmnet' 'run-puppet-agent' (begins read-only) # sud... [09:00:56] <_joe_> akosiaris: I'd start, it's the scheduled time [09:01:08] <_joe_> I've just put a play-by-play in the ticket [09:01:28] missing a single quote in point 2 at the end _joe_ ;) [09:01:46] <_joe_> volans: I don't think so? [09:01:54] and 3 fwiw :D [09:02:04] <_joe_> 3 that's right [09:02:06] <_joe_> 2, nope [09:02:11] <_joe_> sudo cumin 'R:class = role::configmaster and *.eqiad.wmnet' 'run-puppet-agent' [09:02:13] sorry I meant 3 from the start [09:02:26] misread the lines :D [09:02:35] <_joe_> fixed :P [09:02:42] <_joe_> so, let's start [09:03:22] <_joe_> meh, there is an error in the first patch, sigh [09:05:57] 06Operations, 10DBA, 10MediaWiki-extensions-SecurePoll, 07Wikimedia-log-errors: Error (Wikimedia\Rdbms\DBQueryError) when creating a SecurePoll poll on testwiki - https://phabricator.wikimedia.org/T166568#3300419 (10Marostegui) I have checked all the wikis listed on dblists and the one I have found missing... [09:06:35] 06Operations, 10Citoid, 10VisualEditor, 06Services (blocked), 15User-mobrovac: Wiley requests for DOI and some other publishers don't work in production - https://phabricator.wikimedia.org/T165105#3303128 (10Mvolz) p:05High>03Normal [09:07:31] <_joe_> ok time to go [09:07:41] (03PS4) 10Giuseppe Lavagetto: etcd: set codfw to read-only too. [puppet] - 10https://gerrit.wikimedia.org/r/356138 [09:07:42] (03PS3) 10Giuseppe Lavagetto: etcd: enable replication eqiad => codfw [puppet] - 10https://gerrit.wikimedia.org/r/356139 [09:07:45] (03PS2) 10Giuseppe Lavagetto: etcd: set eqiad in r-w mode [puppet] - 10https://gerrit.wikimedia.org/r/356341 [09:08:21] (03PS5) 10Giuseppe Lavagetto: etcd: set codfw to read-only too. [puppet] - 10https://gerrit.wikimedia.org/r/356138 [09:08:37] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] etcd: set codfw to read-only too. [puppet] - 10https://gerrit.wikimedia.org/r/356138 (owner: 10Giuseppe Lavagetto) [09:09:27] <_joe_> !log etcd in read-only mode for switchover to eqiad [09:09:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:52] <_joe_> puppet still running in codfw [09:12:21] (03PS4) 10Giuseppe Lavagetto: etcd: enable replication eqiad => codfw [puppet] - 10https://gerrit.wikimedia.org/r/356139 [09:12:41] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] etcd: enable replication eqiad => codfw [puppet] - 10https://gerrit.wikimedia.org/r/356139 (owner: 10Giuseppe Lavagetto) [09:15:03] <_joe_> now I'm waiting for the replica to stop [09:15:18] <_joe_> it takes some time, around 1 minute or so usually [09:15:58] <_joe_> !log etcd replica codfw => eqiad now stopped [09:16:01] PROBLEM - etcdmirror-conftool-codfw-wmnet service on conf1002 is CRITICAL: CRITICAL - Expecting active but unit etcdmirror-conftool-codfw-wmnet is failed [09:16:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:10] <_joe_> that;s expected [09:16:16] <_joe_> sigh damn async icinga [09:16:45] PROBLEM - Etcd replication lag on conf1002 is CRITICAL: connect to address 10.64.32.180 and port 8000: Connection refused [09:17:23] <_joe_> sorry I forgot to downtime [09:17:25] <_joe_> will do next time [09:17:28] :) [09:17:37] (03CR) 10Giuseppe Lavagetto: [C: 032] etcd: switch writes to eqiad [dns] - 10https://gerrit.wikimedia.org/r/356136 (owner: 10Giuseppe Lavagetto) [09:18:21] PROBLEM - Check systemd state on conf1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:19:03] :) [09:19:50] (03CR) 10Alexandros Kosiaris: [C: 032] Force fact stringification in servermon reporter [puppet] - 10https://gerrit.wikimedia.org/r/356212 (https://phabricator.wikimedia.org/T166203) (owner: 10Alexandros Kosiaris) [09:19:55] (03PS3) 10Alexandros Kosiaris: Force fact stringification in servermon reporter [puppet] - 10https://gerrit.wikimedia.org/r/356212 (https://phabricator.wikimedia.org/T166203) [09:19:59] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Force fact stringification in servermon reporter [puppet] - 10https://gerrit.wikimedia.org/r/356212 (https://phabricator.wikimedia.org/T166203) (owner: 10Alexandros Kosiaris) [09:22:04] <_joe_> !log started etcd replica eqiad => codfw [09:22:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:50] (03PS3) 10Giuseppe Lavagetto: etcd: set eqiad in r-w mode [puppet] - 10https://gerrit.wikimedia.org/r/356341 [09:22:56] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] etcd: set eqiad in r-w mode [puppet] - 10https://gerrit.wikimedia.org/r/356341 (owner: 10Giuseppe Lavagetto) [09:24:05] (03PS1) 10Marostegui: db-eqiad.php: Repool db1078, depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356345 (https://phabricator.wikimedia.org/T166278) [09:24:25] <_joe_> !log etcd in eqiad in read-write mode [09:24:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:03] <_joe_> verified writes go to eqiad and they work [09:25:12] yay [09:25:13] nice [09:25:17] \o/ [09:25:25] <_joe_> taking a small pause now [09:25:26] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1078, depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356345 (https://phabricator.wikimedia.org/T166278) (owner: 10Marostegui) [09:25:30] <_joe_> will finish up afterwards [09:26:46] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1078, depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356345 (https://phabricator.wikimedia.org/T166278) (owner: 10Marostegui) [09:26:55] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1078, depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356345 (https://phabricator.wikimedia.org/T166278) (owner: 10Marostegui) [09:27:36] (03PS1) 10Alexandros Kosiaris: Change the change for the future parser invocation [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/356346 [09:27:58] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1078, depool db1077 - T166278 (duration: 00m 42s) [09:28:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:07] T166278: Unify revision table on s3 - https://phabricator.wikimedia.org/T166278 [09:28:10] akosiaris: "Change the change" sublime! [09:28:31] lol [09:28:44] nice PEBKAC [09:29:37] (03PS2) 10Alexandros Kosiaris: Toggle future parser based on change directory [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/356346 [09:29:53] (03CR) 10Jcrespo: [C: 04-1] "I would use the master (we used to have the master for this -I was the one that asked to use the slave-, and it wasn't a big deal). Cross-" [dns] - 10https://gerrit.wikimedia.org/r/356334 (https://phabricator.wikimedia.org/T160731) (owner: 10Marostegui) [09:30:15] !log Deploy alter table on s3 revision table - db1078 - https://phabricator.wikimedia.org/T166278 [09:30:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:31] (03PS1) 10Muehlenhoff: Add ferm service for rpc.mountd on labstore [puppet] - 10https://gerrit.wikimedia.org/r/356347 (https://phabricator.wikimedia.org/T165136) [09:30:49] (03CR) 10Alexandros Kosiaris: [C: 032] Toggle future parser based on change directory [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/356346 (owner: 10Alexandros Kosiaris) [09:31:52] (03PS2) 10Marostegui: wmnet: Point m3 slave to eqiad master [dns] - 10https://gerrit.wikimedia.org/r/356334 (https://phabricator.wikimedia.org/T160731) [09:32:39] volans: the reporter seems to work fine [09:32:43] no probs so far [09:32:54] great! [09:32:55] I 'll clean a node fully in the db and see what happens [09:33:00] ack [09:33:12] (03PS1) 10Alexandros Kosiaris: puppet-compiler: Upgrade to version 0.1.7 [puppet] - 10https://gerrit.wikimedia.org/r/356348 [09:33:34] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] puppet-compiler: Upgrade to version 0.1.7 [puppet] - 10https://gerrit.wikimedia.org/r/356348 (owner: 10Alexandros Kosiaris) [09:33:38] (03PS2) 10Alexandros Kosiaris: puppet-compiler: Upgrade to version 0.1.7 [puppet] - 10https://gerrit.wikimedia.org/r/356348 [09:33:41] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] puppet-compiler: Upgrade to version 0.1.7 [puppet] - 10https://gerrit.wikimedia.org/r/356348 (owner: 10Alexandros Kosiaris) [09:35:35] (03CR) 10Jcrespo: [C: 031] "ok. Can you have a quick check that grants are all the same for both the proxy, the master and the slave (it should be already, but better" [dns] - 10https://gerrit.wikimedia.org/r/356334 (https://phabricator.wikimedia.org/T160731) (owner: 10Marostegui) [09:35:41] (03PS1) 10Hashar: phpunit: replace deprecated strict=true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356349 [09:37:15] (03PS2) 10Hashar: phpunit: replace deprecated strict=true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356349 [09:38:31] (03PS2) 10Giuseppe Lavagetto: Restore 5M TTL for conftool SRV records [dns] - 10https://gerrit.wikimedia.org/r/356137 [09:39:48] (03CR) 10Giuseppe Lavagetto: [C: 032] Restore 5M TTL for conftool SRV records [dns] - 10https://gerrit.wikimedia.org/r/356137 (owner: 10Giuseppe Lavagetto) [09:40:21] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [09:40:41] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [09:42:15] 06Operations, 15User-Joe: Switch etcd back to eqiad, document switchover procedure - https://phabricator.wikimedia.org/T166552#3303157 (10Joe) All done, the play-by-play is how I executed the switchover. I'll write up some more documentation, and close the ticket as resolved. [09:42:34] _joe_: when you have a minute, worth to add the play-by-play list into a wikitech page for now until we have the script ;) [09:43:00] <_joe_> "I'll write up some more documentation, and close the ticket as resolved." [09:43:09] <_joe_> one line above your suggestion ;) [09:43:20] lol [09:45:29] (03PS1) 10Ema: check_ipmi_temp: bump check/retry intervals and timeout [puppet] - 10https://gerrit.wikimedia.org/r/356350 (https://phabricator.wikimedia.org/T125205) [09:48:50] (03CR) 10Volans: [C: 031] "LGTM, surely an improvement. I'd like to have alex/faidon input too on which could be a "good" frequency that doesn't bother the BMC. (if " [puppet] - 10https://gerrit.wikimedia.org/r/356350 (https://phabricator.wikimedia.org/T125205) (owner: 10Ema) [09:50:14] volans: ok test looks fine. servermon reporter change was a noop. [09:50:24] 06Operations, 10Analytics, 15User-Elukey: Investigate recent Kafka Burrow alarms for EventLogging - https://phabricator.wikimedia.org/T160886#3303179 (10elukey) 05Open>03Resolved a:03elukey Didn't re-occur and after a chat with Andrew we didn't find any good root cause. Inclined to close this issue as... [09:50:30] will redo the test with stringify_facts = false now [09:50:58] 06Operations: debdeploy-minion fails to install on stretch - https://phabricator.wikimedia.org/T166646#3303182 (10fgiunchedi) [09:51:02] akosiaris: ok, great [09:51:27] (03PS3) 10Marostegui: wmnet: Point m3 slave to eqiad master [dns] - 10https://gerrit.wikimedia.org/r/356334 (https://phabricator.wikimedia.org/T160731) [09:51:52] (03PS2) 10Alexandros Kosiaris: Test the future parser in puppet compiler [puppet] - 10https://gerrit.wikimedia.org/r/322898 [09:52:35] volans: and fine with stringify_facts = false as well. All done on servermon part [09:53:02] \o/ so technically we could merge the stringify false in prod today :) [09:53:33] <_joe_> akosiaris: are you making a PCC run? [09:54:06] yeah [09:54:15] starting small first ofc [09:54:18] 5-6 hosts [09:54:54] so, what do you suggest, disable puppet on 5~6 hosts and do it manually? [09:55:08] * volans hates puppet for this kind of deploys [09:55:11] (03CR) 10Marostegui: [C: 032] "Grants are the same indeed" [dns] - 10https://gerrit.wikimedia.org/r/356334 (https://phabricator.wikimedia.org/T160731) (owner: 10Marostegui) [09:55:14] volans: no that was an answer to _joe_ [09:56:06] oops :) [09:57:52] volans: with all the testing we 've done I think we can enable fleet wide [09:58:05] but if you are feeling cautious... do it per DC ? [09:58:31] disabling puppet in all and re-enabling it dc by dc? [09:58:40] yeah sounds good enough [09:59:32] (03CR) 10Elukey: [C: 031] Add Druid hosts to network constants [puppet] - 10https://gerrit.wikimedia.org/r/356189 (owner: 10Muehlenhoff) [09:59:38] ok [10:01:44] <_joe_> j-f-d-i [10:01:50] heh [10:02:21] anything else we might have forgot that use facts? :D [10:03:48] (03PS1) 10Volans: admin: remove my old SSH key (volans) [puppet] - 10https://gerrit.wikimedia.org/r/356353 [10:06:12] (03CR) 10Alexandros Kosiaris: [C: 031] "BMCs are generally a pain point but I think this looks for now. 30 mins doesn't sound too bad to allow us to catch a temperature increase " [puppet] - 10https://gerrit.wikimedia.org/r/356350 (https://phabricator.wikimedia.org/T125205) (owner: 10Ema) [10:06:14] (03PS2) 10Volans: admin: remove my old SSH key (volans) [puppet] - 10https://gerrit.wikimedia.org/r/356353 [10:07:32] (03CR) 10Volans: [C: 032] admin: remove my old SSH key (volans) [puppet] - 10https://gerrit.wikimedia.org/r/356353 (owner: 10Volans) [10:08:23] (03PS2) 10Ema: check_ipmi_temp: bump check/retry intervals and timeout [puppet] - 10https://gerrit.wikimedia.org/r/356350 (https://phabricator.wikimedia.org/T125205) [10:08:30] (03CR) 10Ema: [V: 032 C: 032] check_ipmi_temp: bump check/retry intervals and timeout [puppet] - 10https://gerrit.wikimedia.org/r/356350 (https://phabricator.wikimedia.org/T125205) (owner: 10Ema) [10:15:45] 06Operations: debdeploy-minion fails to install on stretch - https://phabricator.wikimedia.org/T166646#3303182 (10MoritzMuehlenhoff) We can't easily circumvent the Breaks:, as we'd need to fork openssl internally, which we really should avoid. I've tested debdeploy to work fine with the 2016.11 salt-minion shipp... [10:16:46] 06Operations: salt-minion fails to install on stretch - https://phabricator.wikimedia.org/T166646#3303215 (10MoritzMuehlenhoff) [10:17:46] moritzm: agreed ^ I'll remove salt from stretch-wikimedia [10:20:19] godog: can you wait a few mins? I'm currently upgrading reprepro, that'll be a nice confirmation that all works fine with 5.1 [10:20:35] moritzm: for sure, ping me when done [10:20:57] we need some changes for stretch builds, like dbgsym and .buildinfo files [10:24:23] uh? [10:24:34] we've been over this [10:24:46] when I first installed a stretch system I went with stretch's salt-minion [10:24:55] there were a couple of commits needed in ops/puppet that I did back then [10:24:58] after that it just worked [10:25:17] yes, but then openssl introduced a Breaks: on < 2016 salt-minion versions [10:25:24] not sure why we're still trying to run salt 2014.whatever on stretch? [10:25:39] just go with stretch's stock salt? [10:25:43] https://packages.qa.debian.org/o/openssl/news/20161121T215011Z.html [10:26:05] yeah that's the plan, remove salt from stretch-wikimedia and that's it [10:26:08] why is it there in the first place? [10:26:10] I didn't add it [10:26:11] yeah, that's what I tested and why godog is removing the 2014 build from stretch-wikimedia :-) [10:26:17] cf. https://phabricator.wikimedia.org/T164595 [10:26:40] PROBLEM - puppet last run on ms-be2036 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 13 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[xfs_label-/dev/sdb3],Exec[mkfs-/dev/sdc1] [10:26:50] I've patched our stuff to work with salt 2016.11.1+ds-1 as shipped by stretch [10:27:18] cde1a1508c55f93baa79fe1d209d87c5988a6d99 and 2b58b1b92f114f1471f80742ef4626c164376d5f [10:28:25] not sure who/why salt/2014 was added to stretch-wikimedia, maybe Andrew when working on the stretch images for labs [10:29:40] !log remove salt-minion salt-common from stretch-wikimedia - T166646 [10:29:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:51] T166646: salt-minion fails to install on stretch - https://phabricator.wikimedia.org/T166646 [10:31:52] 06Operations: salt-minion fails to install on stretch - https://phabricator.wikimedia.org/T166646#3303277 (10fgiunchedi) 05Open>03Invalid Note that salt was in `stretch-wikimedia` by mistake as @faidon pointed out, see also T164595 [10:32:06] "nothing to see here" etc [10:34:33] (03CR) 10Elukey: [WIP] Add the eventlogging_cleaner script and base package (037 comments) [software/analytics-eventlogging-maintenance] - 10https://gerrit.wikimedia.org/r/355604 (https://phabricator.wikimedia.org/T156933) (owner: 10Elukey) [10:35:29] !log updated reprepro on install1002 to 5.1.1 from backports (for support of dbgsym and buildinfo files) [10:35:34] is stretch-wikimedia available for upload already? [10:35:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:04] jynus: yeah [10:36:28] I haven't finished my packages, but I would not mind doing some -1 testing [10:40:46] (03PS1) 10Alexandros Kosiaris: Upgrade puppet-compiler to version 0.1.8 [puppet] - 10https://gerrit.wikimedia.org/r/356363 [10:41:08] 06Operations, 10Monitoring, 13Patch-For-Review: Monitor hardware thermal issues - https://phabricator.wikimedia.org/T125205#3303287 (10ema) The check seems to be OK at the moment except for a few hosts where the BMC just doesn't seem to want to behave properly (eg: cp3006, db1020). For the record, an intere... [10:41:17] (03PS2) 10Alexandros Kosiaris: Upgrade puppet-compiler to version 0.1.8 [puppet] - 10https://gerrit.wikimedia.org/r/356363 [10:41:22] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Upgrade puppet-compiler to version 0.1.8 [puppet] - 10https://gerrit.wikimedia.org/r/356363 (owner: 10Alexandros Kosiaris) [10:42:18] jynus: yes [10:54:48] (03PS1) 10Alexandros Kosiaris: puppet_compiler: Install ruby-rgen [puppet] - 10https://gerrit.wikimedia.org/r/356364 [10:55:03] (03CR) 10jerkins-bot: [V: 04-1] puppet_compiler: Install ruby-rgen [puppet] - 10https://gerrit.wikimedia.org/r/356364 (owner: 10Alexandros Kosiaris) [10:58:46] (03PS1) 10Volans: Icinga: improve raid handler message [puppet] - 10https://gerrit.wikimedia.org/r/356365 (https://phabricator.wikimedia.org/T166519) [10:59:58] !log preparing for backup and reimage to jessie of db2044 [11:00:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:13] <_joe_> akosiaris: no need for that, you just need to update puppet on the compiler [11:03:55] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Newer puppet-common packages require this package already." [puppet] - 10https://gerrit.wikimedia.org/r/356364 (owner: 10Alexandros Kosiaris) [11:04:43] moritzm: not install2002? [11:05:05] (03CR) 10Jcrespo: [C: 031] "Not tested" [puppet] - 10https://gerrit.wikimedia.org/r/356365 (https://phabricator.wikimedia.org/T166519) (owner: 10Volans) [11:06:39] paravoid: I'll puppetise the change once I've verified that 5.1.1 works fine [11:06:48] ok :) [11:14:32] 06Operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 06Performance-Team, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#3303337 (10BBlack) We can get broader averages by dividing the values seen in the aggregate client status code... [11:17:24] 06Operations: ferm broken in stretch - https://phabricator.wikimedia.org/T166653#3303341 (10MoritzMuehlenhoff) [11:20:05] 06Operations: ferm broken in stretch - https://phabricator.wikimedia.org/T166653#3303341 (10jcrespo) Should this be reported #upstream and fixed there or is this a WMF-only problem? [11:23:55] 06Operations: ferm broken in stretch - https://phabricator.wikimedia.org/T166653#3303357 (10MoritzMuehlenhoff) It's a Debian-specific problem,not a WMF-specific one. I'm currently writing up a bug report for the Debian BTS. [11:34:46] 06Operations, 10OCG-General, 10Reading-Community-Engagement, 06Reading-Web-Backlog, and 3 others: [EPIC] (Proposal) Replicate core OCG features and sunset OCG service - https://phabricator.wikimedia.org/T150871#2799650 (10ovasileva) exploratory testing of PDF creation OCG vs Electron: https://phabricator.w... [11:38:11] (03PS5) 10Paladox: contint: Fix stretch support in package_builder [puppet] - 10https://gerrit.wikimedia.org/r/356237 (https://phabricator.wikimedia.org/T166611) [11:38:31] (03CR) 10Paladox: contint: Fix stretch support in package_builder (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/356237 (https://phabricator.wikimedia.org/T166611) (owner: 10Paladox) [11:42:07] 06Operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 06Performance-Team, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#1955371 (10Gilles) Would the use of xkey help here? It sounds like a single user action currently generates se... [11:45:39] 06Operations: ferm broken in stretch - https://phabricator.wikimedia.org/T166653#3303341 (10faidon) Ouch. I would argue that we should probably get rid of those @resolve calls in most (if not all) of those cases, as they are problematic in general: they can often result into unpredictable behavior, as well as DN... [11:49:34] (03PS6) 10Paladox: jenkins: Install java 8 on stretch and jessie [puppet] - 10https://gerrit.wikimedia.org/r/356243 (https://phabricator.wikimedia.org/T166611) [11:50:18] (03CR) 10Paladox: jenkins: Install java 8 on stretch and jessie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/356243 (https://phabricator.wikimedia.org/T166611) (owner: 10Paladox) [11:50:23] (03PS7) 10Paladox: jenkins: Install java 8 on stretch and jessie [puppet] - 10https://gerrit.wikimedia.org/r/356243 (https://phabricator.wikimedia.org/T166611) [11:51:51] (03CR) 10jerkins-bot: [V: 04-1] jenkins: Install java 8 on stretch and jessie [puppet] - 10https://gerrit.wikimedia.org/r/356243 (https://phabricator.wikimedia.org/T166611) (owner: 10Paladox) [11:54:09] jouncebot: next [11:54:09] In 1 hour(s) and 5 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170531T1300) [11:57:09] (03CR) 10Volans: Puppet: disable stringified facts in prod [puppet] - 10https://gerrit.wikimedia.org/r/356062 (https://phabricator.wikimedia.org/T166372) (owner: 10Volans) [11:57:14] (03PS4) 10Volans: Puppet: disable stringified facts in prod [puppet] - 10https://gerrit.wikimedia.org/r/356062 (https://phabricator.wikimedia.org/T166372) [11:57:15] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 3 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3303470 (10faidon) What's the status of this? [11:57:41] paravoid: AFAIK waiting an SSD ^^^ [11:58:28] paladox: https://gerrit.wikimedia.org/r/#/c/356243/7/modules/jenkins/manifests/slave/requisites.pp is not even close to being syntactically correct puppet code [11:58:34] paladox: and it's at PS7 [11:59:24] paladox: with the rate and quality of patches you're pushing, you're probably doing more damage than good [11:59:32] Yep, sorry. I was following how dzahn said it. But i tryed doing a google search for pseudo code example in puppet but coulden't find any. Then tryed a github search on the wikimedia puppet repo for if <> then [11:59:45] paladox: so either start paying *a lot* more attention, or stop pushing patches and wasting our time please [12:00:24] paravoid See the comment here https://gerrit.wikimedia.org/r/#/c/356243/5/modules/jenkins/manifests/slave/requisites.pp@10 which i was following [12:00:50] stop trying to do stuff by trying to follow along comments and random github/stackoverflow searches [12:00:59] go back and read a manual or a book or something [12:01:22] and start from a smaller project that you can fully understand [12:02:09] the whole "throw stuff against the wall and see if it sticks" approach that you've been using isn't working well [12:02:16] it's causing a lot of noise and wastes our time [12:02:51] (03CR) 10Volans: [C: 032] Puppet: disable stringified facts in prod [puppet] - 10https://gerrit.wikimedia.org/r/356062 (https://phabricator.wikimedia.org/T166372) (owner: 10Volans) [12:03:15] I do test some of my changes. though the one i did up there i didnt test. [12:03:21] it's not just about testing [12:03:39] it's about understanding e.g. how the language of what you're writing works [12:03:54] ok [12:04:03] !log merged stringify_facts=false for production hosts T166372 [12:04:14] this change above is three lines of code, two of which are entirely syntactically wrong code [12:04:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:14] T166372: Puppet: test non stringified facts across the fleet - https://phabricator.wikimedia.org/T166372 [12:05:58] Yep, i have a change but wont publish it as you told me not to spam (which im not), i also doint mean to waste anybodys time but im trying to fix something but looks like no one cares and thinks i keep spamming. [12:06:37] this isn't about "noone caring", it's about you and your approach to submitting patches [12:08:46] ok [12:08:53] 06Operations: ferm broken in stretch - https://phabricator.wikimedia.org/T166653#3303575 (10MoritzMuehlenhoff) Reported in the Debian BTS as https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=863802 [12:09:06] seriously, I say this with 100% good intentions [12:09:17] it seems like you're spending a lot of time working on wikimedia-related stuff [12:09:24] and some of them are useful projects as well [12:10:03] but you have a very superficial understanding (if at all) and the signal-to-noise ratio of your contributions is abysmal [12:10:47] I think you'd be much better off if you tried to understand how things work first (both from a language perspective and from our code) [12:10:56] find an area, learn it well, start contributing and build up from there [12:10:56] ok [12:11:16] !log Disable v6 BGP session with Init7 in knams because of routing loop on their network [12:11:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:25] rather than just throwing whatever you found on SO/GH/Google to Gerrit and hoping someone will just tell you if it's a good idea or not [12:14:29] Ok. I did test all the stretch changes for ci. [12:15:03] He only said it was best to not install java 8 by default. [12:17:04] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 3 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3303614 (10Gehel) Waiting for a new SSD. Then reimage, test and back in the cluster if everything looks fine. [12:19:11] (03CR) 10Hashar: "I have missed an issue with the way globals are restored. They would left behind the ones that have a null value due to isset(). Fix is i" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349413 (owner: 10Hashar) [12:19:16] 06Operations, 10netops: Init7 routing loop - https://phabricator.wikimedia.org/T166663#3303641 (10ayounsi) [12:22:43] !log cp1008: upgrade to jessie 8.8 point release T164703 [12:22:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:52] T164703: Integrate jessie 8.8 point release - https://phabricator.wikimedia.org/T164703 [12:28:21] (03PS3) 10Hashar: phpunit: replace deprecated strict=true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356349 [12:28:23] (03PS1) 10Hashar: test: be strict regarding globals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356373 [12:31:26] (03PS1) 10Marostegui: db-eqiad.php: Repool db1077, depool db1035 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356375 (https://phabricator.wikimedia.org/T166278) [12:33:00] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1077, depool db1035 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356375 (https://phabricator.wikimedia.org/T166278) (owner: 10Marostegui) [12:34:05] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1077, depool db1035 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356375 (https://phabricator.wikimedia.org/T166278) (owner: 10Marostegui) [12:35:15] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1077, depool db1035 - T166278 (duration: 00m 41s) [12:35:19] !log Deploy alter table on s3 revision table - db1035 - https://phabricator.wikimedia.org/T166278 [12:35:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:24] T166278: Unify revision table on s3 - https://phabricator.wikimedia.org/T166278 [12:35:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:49] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1077, depool db1035 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356375 (https://phabricator.wikimedia.org/T166278) (owner: 10Marostegui) [12:36:52] !log cache_ulsfo: upgrade to jessie 8.8 point release T164703 [12:37:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:01] T164703: Integrate jessie 8.8 point release - https://phabricator.wikimedia.org/T164703 [12:40:36] (03PS2) 10Hashar: test: be strict regarding globals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356373 [12:45:36] PROBLEM - puppet last run on cp4006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[initramfs-tools] [12:45:53] * volans looking [12:46:09] that's from ema's 8.8 rollout [12:46:36] RECOVERY - puppet last run on cp4006 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [12:47:08] I assumed so, but beacause of my merge I'm more suspicious :D [12:56:36] PROBLEM - HP RAID on ms-be1032 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [12:56:56] PROBLEM - HP RAID on ms-be1038 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [12:57:15] 06Operations, 15User-Joe: Switch etcd back to eqiad, document switchover procedure - https://phabricator.wikimedia.org/T166552#3303763 (10Joe) 05Open>03Resolved [12:57:19] 06Operations, 07Upstream: ferm broken in stretch - https://phabricator.wikimedia.org/T166653#3303764 (10jcrespo) [12:59:42] jouncebot: refresh [12:59:43] I refreshed my knowledge about deployments. [13:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170531T1300). Please do the needful. [13:00:26] PROBLEM - HP RAID on ms-be1037 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [13:00:26] PROBLEM - HP RAID on ms-be1030 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [13:01:46] PROBLEM - HP RAID on ms-be1029 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [13:01:46] PROBLEM - HP RAID on ms-be1039 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [13:01:56] PROBLEM - HP RAID on ms-be1035 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [13:02:36] PROBLEM - HP RAID on ms-be1028 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [13:02:36] PROBLEM - HP RAID on ms-be1031 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [13:02:40] godog: swift overloaded? [13:02:56] PROBLEM - HP RAID on ms-be1033 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [13:07:46] (03CR) 10Mark Bergsma: [C: 031] "Let's get rid of the eval()s everywhere later." [debs/pybal] - 10https://gerrit.wikimedia.org/r/354680 (https://phabricator.wikimedia.org/T103882) (owner: 10Ema) [13:08:00] !log uploaded zookeeper 3.4.5+dfsg-2+deb8u2 to apt.wikimedia.org [13:08:06] !log Stop MySQL on db1048 and shutdown the host for maintenance - T160731 [13:08:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:17] T160731: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731 [13:09:04] I guess I'm going to hijack the SWAT window for some trivial config changes [13:09:51] (03PS2) 10Hoo man: Log "api-readonly" errors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354246 (https://phabricator.wikimedia.org/T164191) [13:10:04] !log cache_esams: upgrade to jessie 8.8 point release T164703 [13:10:08] hoo: hold on. Swift has some issues [13:10:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:13] T164703: Integrate jessie 8.8 point release - https://phabricator.wikimedia.org/T164703 [13:10:18] hashar: No worries [13:10:18] 06Operations, 10ops-eqiad, 10DBA, 10Phabricator, 13Patch-For-Review: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3303799 (10Marostegui) @Cmjohnson db1048 is now down and ready for you to swap the BBU Thanks! [13:11:36] PROBLEM - haproxy failover on dbproxy1008 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [13:11:57] PROBLEM - haproxy failover on dbproxy1003 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [13:12:11] that is the passive slave going down, right, marostegui? [13:12:26] jynus: yes [13:12:27] that is it [13:12:47] let's add an expiring comment so paravoid doesn't get too worried :-) [13:13:17] hehe I am doing it now [13:13:32] hashar: Do you have an ETA? My things are not important, I can just do them tomorrow [13:13:42] oh, I was doing it, let's do like ethernet, and wait a random time before retrying [13:14:10] volans: godog: the HP RAID issues for ms-be10* hosts started around 7am this morning apparently [13:14:35] (03PS1) 10Faidon Liambotis: Disable temperature monitoring via NRPE [puppet] - 10https://gerrit.wikimedia.org/r/356378 (https://phabricator.wikimedia.org/T125205) [13:14:51] volans: godog there was a scheduled downtime for the hosts apparently. The window has expired and thus alarms are being triggered [13:14:55] (03PS1) 10Giuseppe Lavagetto: Use directory manifests when enabling the future parser [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/356379 [13:15:02] hoo: Swift looks fine so you can hijack the swat window :) [13:15:03] <_joe_> akosiaris: ^^ [13:15:17] hashar: Thanks [13:15:25] (03CR) 10Faidon Liambotis: [C: 032] Disable temperature monitoring via NRPE [puppet] - 10https://gerrit.wikimedia.org/r/356378 (https://phabricator.wikimedia.org/T125205) (owner: 10Faidon Liambotis) [13:15:45] (03CR) 10Jcrespo: [C: 031] "I am ok with this- this certainly needs to happen and will be very useful, but we need to fix several things first (both on the check itse" [puppet] - 10https://gerrit.wikimedia.org/r/356378 (https://phabricator.wikimedia.org/T125205) (owner: 10Faidon Liambotis) [13:15:50] !log cache_codfw: upgrade to jessie 8.8 point release T164703 [13:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:57] T164703: Integrate jessie 8.8 point release - https://phabricator.wikimedia.org/T164703 [13:16:28] jynus: yeah, agreed [13:16:45] (03CR) 10Hoo man: [C: 032] Log "api-readonly" errors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354246 (https://phabricator.wikimedia.org/T164191) (owner: 10Hoo man) [13:16:49] that was my justification of why my raid check was so awful and opt-in [13:16:54] they are very dififcult to get right [13:17:13] so I did the smallest thing that was ok to continue [13:17:32] knowing we have to continue working on it [13:17:57] but I love there is a recent spike on interest on this [13:18:01] (03Merged) 10jenkins-bot: Log "api-readonly" errors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354246 (https://phabricator.wikimedia.org/T164191) (owner: 10Hoo man) [13:18:10] (03CR) 10jenkins-bot: Log "api-readonly" errors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354246 (https://phabricator.wikimedia.org/T164191) (owner: 10Hoo man) [13:19:58] !log hoo@tin Synchronized wmf-config/InitialiseSettings.php: Log "api-readonly" errors (T164191, T123867) (duration: 00m 43s) [13:20:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:07] T164191: Tired of APIError: readonly - https://phabricator.wikimedia.org/T164191 [13:20:08] T123867: Repeated reports of wikidatawiki (s5) API going read only - https://phabricator.wikimedia.org/T123867 [13:21:06] !log cache_eqiad: upgrade to jessie 8.8 point release T164703 [13:21:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:14] T164703: Integrate jessie 8.8 point release - https://phabricator.wikimedia.org/T164703 [13:22:13] 06Operations: Restructure our internal repositories further - https://phabricator.wikimedia.org/T158583#3303864 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [13:22:17] (03PS2) 10Hoo man: WikibaseClient: Don't persist Statement usages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355101 (https://phabricator.wikimedia.org/T151717) [13:22:43] (03CR) 10Hoo man: "Rebased" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355101 (https://phabricator.wikimedia.org/T151717) (owner: 10Hoo man) [13:23:27] (03CR) 10Hoo man: [C: 032] WikibaseClient: Don't persist Statement usages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355101 (https://phabricator.wikimedia.org/T151717) (owner: 10Hoo man) [13:24:36] (03Merged) 10jenkins-bot: WikibaseClient: Don't persist Statement usages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355101 (https://phabricator.wikimedia.org/T151717) (owner: 10Hoo man) [13:25:49] (03CR) 10jenkins-bot: WikibaseClient: Don't persist Statement usages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355101 (https://phabricator.wikimedia.org/T151717) (owner: 10Hoo man) [13:26:36] !log hoo@tin Synchronized wmf-config/Wikibase-production.php: WikibaseClient: Don't persist Statement usages (T151717) (duration: 00m 41s) [13:26:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:45] T151717: Usage tracking: record which statement group is used - https://phabricator.wikimedia.org/T151717 [13:27:31] I'm done [13:29:39] :) [13:31:24] (03PS1) 10Muehlenhoff: Use reprepro from jessie-backports [puppet] - 10https://gerrit.wikimedia.org/r/356381 (https://phabricator.wikimedia.org/T158583) [13:31:26] (03PS1) 10Gilles: Add Navigation Timing alerts to Icinga [puppet] - 10https://gerrit.wikimedia.org/r/356382 (https://phabricator.wikimedia.org/T153169) [13:31:52] (03CR) 10Gehel: [C: 04-1] "Looks good! Some minor issues, see comments inline." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/353571 (https://phabricator.wikimedia.org/T161354) (owner: 10Bearloga) [13:32:58] (03CR) 10jerkins-bot: [V: 04-1] Use reprepro from jessie-backports [puppet] - 10https://gerrit.wikimedia.org/r/356381 (https://phabricator.wikimedia.org/T158583) (owner: 10Muehlenhoff) [13:33:03] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Actually hold on this. I am thinking that instead of .future a better approach might be a .compile_flags file that allows arbitrary flags " [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/356379 (owner: 10Giuseppe Lavagetto) [13:33:20] (03PS1) 10Elukey: role::mariadb::analytics::custom_repl_slave: add eventlogging_cleaner.py [puppet] - 10https://gerrit.wikimedia.org/r/356383 (https://phabricator.wikimedia.org/T108850) [13:34:54] (03PS2) 10Elukey: role::mariadb::analytics::custom_repl_slave: add eventlogging_cleaner.py [puppet] - 10https://gerrit.wikimedia.org/r/356383 (https://phabricator.wikimedia.org/T108850) [13:35:38] (03PS1) 10Gehel: logstash - forcemerge to 10 segments instead of 1 [puppet] - 10https://gerrit.wikimedia.org/r/356384 (https://phabricator.wikimedia.org/T166154) [13:36:17] (03PS2) 10Muehlenhoff: Use reprepro from jessie-backports [puppet] - 10https://gerrit.wikimedia.org/r/356381 (https://phabricator.wikimedia.org/T158583) [13:36:27] (03CR) 10jerkins-bot: [V: 04-1] role::mariadb::analytics::custom_repl_slave: add eventlogging_cleaner.py [puppet] - 10https://gerrit.wikimedia.org/r/356383 (https://phabricator.wikimedia.org/T108850) (owner: 10Elukey) [13:37:03] weird I ran pep8 before sending [13:42:47] yeah I've seen pep8 mismatches lately too. I think it's just that python/pep versions are different on my laptop than in CI [13:42:48] <_joe_> akosiaris: yeah that'd be cleaner [13:43:02] (or the config of them?) [13:43:15] _joe_: configs = map(lambda x: "--{}".format(x.lstrip('-') , configs) [13:43:21] how does that look ? [13:43:26] too ugly ? [13:43:28] tox -e pep8 -r [13:43:36] that will rebuild the venv [13:43:45] <_joe_> akosiaris: that's not ruby, unorite? [13:43:50] configs comes from configs.readlines() [13:43:53] ruby ? [13:43:58] what are you talking about ? [13:44:06] it's your project.. you wrote it in python [13:44:08] <_joe_> yeah it LOOKS like Ruby, though [13:45:04] <_joe_> (also, re-assigning configs to itself with map, does it work? I'd say yes but not sure) [13:45:47] elukey: I got the same failures locally. So you must be using a different pep8/flake8 version [13:45:54] _joe_: hove you forgotten your programming skills ? [13:45:56] have* [13:45:58] elukey: the way to run it with the same version CI uses is : tox -e pep8 [13:46:10] _joe_: ETOOMUCHPUPPET ? [13:46:12] :P [13:46:21] <_joe_> akosiaris: I was joking it looks like ruby, while being valid python [13:46:51] well in ruby using a lambda is actually not happening often [13:46:59] you got code blocks or Procs for that [13:47:05] code blocks being the usual choice [13:47:23] that's one things that I really like about ruby [13:47:58] (03CR) 10Gehel: "I need to play a bit more with this, but already 2 comments..." (032 comments) [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/352170 (https://phabricator.wikimedia.org/T158560) (owner: 10DCausse) [13:48:00] <_joe_> you know deep down those code blocks act as lambdas do, more or less, right? [13:48:06] <_joe_> anyways [13:48:18] <_joe_> yeah it's ok but I'd have done something neater :D [13:48:20] hashar: thanks! [13:48:26] actually lambdas are Procs in ruby [13:48:39] lambda.class [13:48:39] => Proc [13:48:56] (03PS3) 10Elukey: role::mariadb::analytics::custom_repl_slave: add eventlogging_cleaner.py [puppet] - 10https://gerrit.wikimedia.org/r/356383 (https://phabricator.wikimedia.org/T108850) [13:49:03] uploading change now [13:51:46] (03PS1) 10Alexandros Kosiaris: Append 'src' to change_dir for future parser detection [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/356385 [13:51:48] (03PS1) 10Alexandros Kosiaris: Allow arbitrary puppet configuration items in PCC [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/356386 [13:52:17] (03CR) 10jerkins-bot: [V: 04-1] Allow arbitrary puppet configuration items in PCC [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/356386 (owner: 10Alexandros Kosiaris) [13:52:24] damn you jenkins [13:53:31] (03PS3) 10Muehlenhoff: Use reprepro from jessie-backports [puppet] - 10https://gerrit.wikimedia.org/r/356381 (https://phabricator.wikimedia.org/T158583) [13:55:30] (03CR) 10Hashar: [C: 031] contint: Fix stretch support in package_builder [puppet] - 10https://gerrit.wikimedia.org/r/356237 (https://phabricator.wikimedia.org/T166611) (owner: 10Paladox) [13:55:33] (03PS2) 10Alexandros Kosiaris: Allow arbitrary puppet configuration items in PCC [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/356386 [13:55:48] (03CR) 10Alexandros Kosiaris: [C: 032] Append 'src' to change_dir for future parser detection [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/356385 (owner: 10Alexandros Kosiaris) [13:56:51] (03CR) 10Muehlenhoff: [C: 032] Use reprepro from jessie-backports [puppet] - 10https://gerrit.wikimedia.org/r/356381 (https://phabricator.wikimedia.org/T158583) (owner: 10Muehlenhoff) [13:58:54] 06Operations, 10Fundraising-Backlog, 10fundraising-tech-ops, 07Technical-Debt: Determine if benefactorevents.wikimedia.org should be hosted on the production cluster or still on Microsoft Azure - https://phabricator.wikimedia.org/T166240#3303995 (10Jgreen) >>! In T166240#3301898, @DStrine wrote: > @Derecks... [13:58:58] (03Abandoned) 10Alexandros Kosiaris: puppet_compiler: Install ruby-rgen [puppet] - 10https://gerrit.wikimedia.org/r/356364 (owner: 10Alexandros Kosiaris) [14:01:18] jouncebot: next [14:01:18] In 3 hour(s) and 58 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170531T1800) [14:02:09] !log upgrading install2002 to reprepro 5.1.1 [14:02:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:59] PROBLEM - OTRS SMTP on mendelevium is CRITICAL: connect to address 10.64.32.174 and port 25: Connection refused [14:05:19] PROBLEM - Check systemd state on mendelevium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:06:08] (03PS1) 10Jcrespo: mariadb: Allow full reimage of db2041,38,37,35 (still on jessie) [puppet] - 10https://gerrit.wikimedia.org/r/356387 [14:06:11] (03CR) 10Elukey: "Looks good to me, but have we tested the PLAINTEXT://:9092 value as default config in labs?" [puppet] - 10https://gerrit.wikimedia.org/r/355796 (https://phabricator.wikimedia.org/T166162) (owner: 10Ottomata) [14:06:32] (03CR) 10Ottomata: "Yup :)" [puppet] - 10https://gerrit.wikimedia.org/r/355796 (https://phabricator.wikimedia.org/T166162) (owner: 10Ottomata) [14:07:24] (03CR) 10Marostegui: [C: 031] mariadb: Allow full reimage of db2041,38,37,35 (still on jessie) [puppet] - 10https://gerrit.wikimedia.org/r/356387 (owner: 10Jcrespo) [14:08:58] there seems to be an increase of qps on enwiki [14:09:15] 06Operations, 10DBA, 10MediaWiki-extensions-SecurePoll, 07Wikimedia-log-errors: Error (Wikimedia\Rdbms\DBQueryError) when creating a SecurePoll poll on testwiki - https://phabricator.wikimedia.org/T166568#3304016 (10Reedy) Extension is enabled there, for one reason or another I've run the ALTER TABLE ther... [14:09:57] 06Operations, 10Traffic, 13Patch-For-Review: varnish backends start returning 503s after ~6 days uptime - https://phabricator.wikimedia.org/T145661#3304018 (10BBlack) This is what we had before (copied from far above): | Bin | SizeRange | StoragePct | StorageSize (1 node) | Disk | --- | --- | --- | --- | B... [14:10:06] was there a deployment last night? [14:10:31] looks so as per SAL yes [14:11:08] what is the state of pt-table-checksum, is it running? [14:11:13] nope [14:11:19] RECOVERY - Check systemd state on mendelevium is OK: OK - running: The system is fully operational [14:11:59] RECOVERY - OTRS SMTP on mendelevium is OK: SMTP OK - 0.153 sec. response time [14:12:49] (03PS1) 10BBlack: cache_upload: rebalance storage bins [puppet] - 10https://gerrit.wikimedia.org/r/356394 (https://phabricator.wikimedia.org/T145661) [14:13:01] no change on the train, it is not that https://tools.wmflabs.org/versions/ [14:14:28] it starts at 3am or so but there is nothing around that time on sal [14:14:39] (03CR) 10Ottomata: [C: 031] "I've only done a skim of code, in general LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/356383 (https://phabricator.wikimedia.org/T108850) (owner: 10Elukey) [14:20:17] I cannot see any grafana graph for any individual mysql server :| [14:20:21] I do not see other code changes https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?orgId=1&from=now-24h&to=now&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=s1&var-role=slave it must be regular requests [14:20:27] ? [14:20:55] WFM [14:20:57] https://grafana-admin.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1089 [14:21:00] this works for you? [14:21:18] interesting, without the admin it works [14:21:18] 06Operations, 10ops-codfw, 13Patch-For-Review: rack/setup/install ores2001-2009 - https://phabricator.wikimedia.org/T165170#3258942 (10MoritzMuehlenhoff) The salt key for ores2004 wasn't accepted (noticed that when rolling out the sudo update), this has been fixed. [14:21:18] :| [14:21:26] https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1089 works [14:21:27] marostegui: works with both for me [14:21:38] for me the admin one doesn't [14:21:51] both for me [14:22:07] weird, I have tried two browsers...anyways [14:22:07] yes [14:22:20] that increase happens at 3am and I cannot see anything [14:22:28] I have been going thru phabricator at that time too [14:22:43] (03CR) 10Ema: [V: 032 C: 032] cache_upload: rebalance storage bins [puppet] - 10https://gerrit.wikimedia.org/r/356394 (https://phabricator.wikimedia.org/T145661) (owner: 10BBlack) [14:23:00] RECOVERY - HP RAID on ms-be1033 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK [14:23:17] 06Operations, 10OTRS, 13Patch-For-Review: Upgrade OTRS to 5.0.19 - https://phabricator.wikimedia.org/T165284#3261913 (10akosiaris) The upgrade is done. Everything seems to be fine at a first look, will leave this open and resolve in a few days. [14:25:31] (03PS10) 10Paladox: Upgrade gerrit to 2.14.1 (DO NOT MERGE) [debs/gerrit] - 10https://gerrit.wikimedia.org/r/350440 [14:25:45] Um, that was meant to be hidden [14:25:50] i did push drafts [14:31:13] 06Operations, 10ops-codfw, 10DBA: db2044 cannot install jessie - requires BIOS firmware upgrade - https://phabricator.wikimedia.org/T166683#3304126 (10jcrespo) [14:34:00] PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 2083856 [14:34:17] !log init7 fixed the issue, ping works from the init7 interface, reenabling the BGP session - T166663 [14:34:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:26] T166663: Init7 routing loop - https://phabricator.wikimedia.org/T166663 [14:38:01] 06Operations, 10netops: Init7 routing loop - https://phabricator.wikimedia.org/T166663#3304151 (10ayounsi) 05Open>03Resolved BGP re-enabled, confirmed working. [14:43:00] PROBLEM - HP RAID on ms-be1033 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [14:44:38] 06Operations, 10ops-codfw, 10DBA: db2044 cannot install jessie - requires BIOS firmware upgrade - https://phabricator.wikimedia.org/T166683#3304167 (10jcrespo) Holding this until the second reinstall fails or succeeds. This is the list of future reimages: https://gerrit.wikimedia.org/r/#/c/356387/ Should we... [14:46:49] 06Operations: Restructure our internal repositories further - https://phabricator.wikimedia.org/T158583#3304175 (10MoritzMuehlenhoff) [14:47:55] 06Operations, 10ops-codfw, 10DBA: db2044 cannot install jessie - requires BIOS firmware upgrade - https://phabricator.wikimedia.org/T166683#3304179 (10jcrespo) a:05jcrespo>03Papaul Confirmed if fails consistently to boot after install and 99.9% sure that a BIOS upgrade will fix it. Please, papaul help us... [14:48:09] (03CR) 10Giuseppe Lavagetto: [C: 032] Allow arbitrary puppet configuration items in PCC [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/356386 (owner: 10Alexandros Kosiaris) [14:48:58] 06Operations, 10ops-codfw, 10DBA: db2044 cannot install jessie - requires BIOS firmware upgrade - https://phabricator.wikimedia.org/T166683#3304126 (10MoritzMuehlenhoff) I think we should preemptively upgrade the BIOS on all servers of that order/batch. We can't rule out that some of the symptoms fixed in th... [14:52:03] (03CR) 10Muehlenhoff: "You should build the package using standard Debian tools, i.e. using dh(1) and a debian/rules file. Hand-crafting this will only lead to s" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/352170 (https://phabricator.wikimedia.org/T158560) (owner: 10DCausse) [14:52:52] (03CR) 10Mforns: "LGTM! as always, some pedantic comments" (0316 comments) [puppet] - 10https://gerrit.wikimedia.org/r/356383 (https://phabricator.wikimedia.org/T108850) (owner: 10Elukey) [14:57:28] (03PS1) 10Ema: varnish: do not chmod VSM files [puppet] - 10https://gerrit.wikimedia.org/r/356401 [15:00:35] (03PS1) 10Alexandros Kosiaris: Update puppet_compiler to 0.1.9 [puppet] - 10https://gerrit.wikimedia.org/r/356402 [15:00:49] (03CR) 10jerkins-bot: [V: 04-1] Update puppet_compiler to 0.1.9 [puppet] - 10https://gerrit.wikimedia.org/r/356402 (owner: 10Alexandros Kosiaris) [15:02:05] (03PS2) 10Alexandros Kosiaris: Update puppet_compiler to 0.1.9 [puppet] - 10https://gerrit.wikimedia.org/r/356402 [15:02:11] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Update puppet_compiler to 0.1.9 [puppet] - 10https://gerrit.wikimedia.org/r/356402 (owner: 10Alexandros Kosiaris) [15:07:14] (03PS3) 10Alexandros Kosiaris: Test the future parser in puppet compiler [puppet] - 10https://gerrit.wikimedia.org/r/322898 [15:31:46] !log merge cache_maps into cache_upload: move LVS IPs T164608 [15:31:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:56] T164608: Merge cache_maps into cache_upload functionally - https://phabricator.wikimedia.org/T164608 [15:33:09] (03PS2) 10Ema: maps->upload: move LVS IPs [puppet] - 10https://gerrit.wikimedia.org/r/353054 (https://phabricator.wikimedia.org/T164608) (owner: 10BBlack) [15:33:27] (03CR) 10Ema: [V: 032 C: 032] maps->upload: move LVS IPs [puppet] - 10https://gerrit.wikimedia.org/r/353054 (https://phabricator.wikimedia.org/T164608) (owner: 10BBlack) [15:36:57] 06Operations, 10ops-eqiad, 10DBA, 10Phabricator, 13Patch-For-Review: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3304400 (10Cmjohnson) replaced the battery with a well used one from a decom'd db. Hopefully this will work for long enough. Server has been powered on [15:37:20] PROBLEM - HP RAID on ms-be1034 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [15:38:35] (03PS4) 10Alexandros Kosiaris: Test the future parser in puppet compiler [puppet] - 10https://gerrit.wikimedia.org/r/322898 [15:39:38] 06Operations, 10ops-eqiad, 10DBA, 10Phabricator, 13Patch-For-Review: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3304407 (10Marostegui) Thanks Chris. The battery is now charging ``` Battery State: Optimal BBU Firmware Status: Charging Status : Charging... [15:40:29] 06Operations, 10ops-eqiad, 10DBA, 10Phabricator, 13Patch-For-Review: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3304409 (10Marostegui) 05Open>03Resolved I will mark this as resolve again and let's see how long it lasts ``` root@db1048:~# megacli -ldinfo -l0 -a0 |... [15:44:52] (03PS1) 10Alexandros Kosiaris: Fix typo with wrong parenthesis [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/356410 [15:46:02] (03PS2) 10Alexandros Kosiaris: Fix wrong parenthesis typo in map() [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/356410 [15:46:45] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Fix wrong parenthesis typo in map() [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/356410 (owner: 10Alexandros Kosiaris) [15:47:20] RECOVERY - HP RAID on ms-be1034 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK [15:49:07] 06Operations, 10ops-eqiad, 10DBA, 10Phabricator, 13Patch-For-Review: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3304451 (10Marostegui) I will revert the DNS patch tomorrow morning once the battery has recharged and all that. [15:54:03] 06Operations, 10DBA, 10Traffic: Substantive HTTP and mediawiki/database traffic coming from a single ip - https://phabricator.wikimedia.org/T166695#3304453 (10jcrespo) [15:56:32] 06Operations, 10DBA, 10Traffic: Substantive HTTP and mediawiki/database traffic coming from a single ip - https://phabricator.wikimedia.org/T166695#3304486 (10jcrespo) {P5513} [15:58:53] !log Updated the Wikidata property suggester with data from last Monday's JSON dump and applied the T132839 workarounds [15:58:58] sjoerddebruin: FYI ^ [15:59:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:02] T132839: [RfC] Property suggester suggests human properties for non-human items - https://phabricator.wikimedia.org/T132839 [15:59:18] hoo: don't expect much changes, but thanks :) [16:00:25] yeah… the number of correlations didn't increase as much this time either [16:00:52] There haven't been any large bot tasks recently afaik [16:01:09] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, 06Services (watching): wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3304494 (10greg) So.... do we have anyone else watching this ticket who can do what @Krink... [16:02:40] (03PS2) 10Volans: Icinga: improve raid handler message [puppet] - 10https://gerrit.wikimedia.org/r/356365 (https://phabricator.wikimedia.org/T166519) [16:04:06] 06Operations, 10DBA, 10Traffic: Substantive HTTP and mediawiki/database traffic coming from a single ip - https://phabricator.wikimedia.org/T166695#3304501 (10jcrespo) [16:04:16] (03CR) 10Volans: [C: 032] Icinga: improve raid handler message [puppet] - 10https://gerrit.wikimedia.org/r/356365 (https://phabricator.wikimedia.org/T166519) (owner: 10Volans) [16:04:52] jynus: I'll use db1094 alarm to test it, removing the ack and seeing how it goes [16:05:04] ofc I'll close the duplicate task if all goes well [16:05:08] 06Operations, 10DBA, 10Traffic: Substantive HTTP and mediawiki/database traffic coming from a single ip - https://phabricator.wikimedia.org/T166695#3304453 (10jcrespo) Please note that **the ticket us public, but the list of ips is not**, do not take private data from the ips list and copy it here. [16:05:16] (03PS1) 10Hoo man: Index article placeholders up to Q16956 on cywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356414 (https://phabricator.wikimedia.org/T162244) [16:06:13] alarm, no page, I asume? [16:06:43] raid alarms don't page IIRC [16:06:54] go ahead, then [16:07:13] * volans waiting puppet to run on tegmen... ;) [16:07:20] thanks [16:11:07] I just scheduled a minor deployment window for 16:30 UTC. Just for the record, the change itself is trivial. [16:11:39] (03Abandoned) 10Gehel: logstash - forcemerge to 10 segments instead of 1 [puppet] - 10https://gerrit.wikimedia.org/r/356384 (https://phabricator.wikimedia.org/T166154) (owner: 10Gehel) [16:13:32] 06Operations, 10DBA, 10MediaWiki-extensions-SecurePoll, 07Wikimedia-log-errors: Error (Wikimedia\Rdbms\DBQueryError) when creating a SecurePoll poll on testwiki - https://phabricator.wikimedia.org/T166568#3304533 (10jcrespo) a:03MaxBioHazard So according to Reedy, Marostegui and demos, this seems like a... [16:14:16] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/1/0: down - Core: cr2-eqiad:xe-4/2/0 (Telia, IC-314533, 24ms) {#11371} [10Gbps wave]BR [16:14:35] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-eqord:xe-1/0/0 (Telia, IC-314533, 29ms) {#3658} [10Gbps wave]BR [16:14:55] PROBLEM - puppet last run on tegmen is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:15:08] XioNoX: around? ^ [16:15:17] yo [16:15:35] for tegmen I'm trying to run puppet... seems stuck, I'm debugging [16:15:47] looking, IC-314533 down it seems [16:17:23] ema: there is a maintenance notification for this circuit, but for like early this morning [16:17:45] XioNoX: ok, maybe there's been a mistake adding it to the calendar or whatever. tl;dr of the impact? [16:18:05] Impact: 2 x 30 minutes interruption [16:19:04] (03PS1) 10Alexandros Kosiaris: extend() args list, not append() [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/356416 [16:19:17] ema: for us, tests show that the traffic is properly routing around the isssue [16:19:48] (03CR) 10Alexandros Kosiaris: [C: 032] extend() args list, not append() [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/356416 (owner: 10Alexandros Kosiaris) [16:19:57] XioNoX: great, will carry on with the maps -> upload work then. Thanks! [16:20:15] they did say though that they did perform their maintenance 5h ago [16:20:19] (and finish) [16:21:16] I'm sending them an email [16:23:30] email sent [16:23:44] thanks for the head's up [16:23:58] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 3 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3304586 (10RobH) I just got a notice from Dasher in the last ten minutes. Tracking [[ https://wwwapps.ups.com/WebTracking/track?track=yes&trac... [16:30:05] hoo: Dear anthropoid, the time has come. Please deploy Article Placeholder search indexing (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170531T1630). [16:30:27] (03CR) 10Hoo man: [C: 032] Index article placeholders up to Q16956 on cywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356414 (https://phabricator.wikimedia.org/T162244) (owner: 10Hoo man) [16:31:54] (03Merged) 10jenkins-bot: Index article placeholders up to Q16956 on cywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356414 (https://phabricator.wikimedia.org/T162244) (owner: 10Hoo man) [16:31:55] RECOVERY - puppet last run on tegmen is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [16:32:06] (03CR) 10jenkins-bot: Index article placeholders up to Q16956 on cywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356414 (https://phabricator.wikimedia.org/T162244) (owner: 10Hoo man) [16:33:35] 06Operations, 13Patch-For-Review: Firewall sets not being loaded post-reboot due to a @resolve race - https://phabricator.wikimedia.org/T148986#2738609 (10fgiunchedi) The `@resolve` "race" is also present in stretch, though more severe since ferm reliably fails when `@resolve` is used. See also https://bugs.de... [16:33:39] !log hoo@tin Synchronized wmf-config/InitialiseSettings.php: Index article placeholders up to Q16956 on cywiki (T162244) (duration: 00m 42s) [16:33:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:48] T162244: [Story] Search index 10,000 article placeholders on cywiki for testing and evaluation purposes - https://phabricator.wikimedia.org/T162244 [16:34:28] (03PS3) 10Filippo Giunchedi: prometheus: allow user 'prometheus' to export metrics too [puppet] - 10https://gerrit.wikimedia.org/r/356023 [16:34:45] RECOVERY - HP RAID on db1094 is OK: reset to test alarm [16:34:50] this is me ^^^ [16:35:27] why does it say ok? [16:35:36] or is it only temporary? [16:35:45] I had to manually reset it to OK to have icinga now alarm it again [16:35:50] ok [16:36:16] (03CR) 10BryanDavis: [C: 032] Always cleanup manifest on stop [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/350362 (https://phabricator.wikimedia.org/T163355) (owner: 10BryanDavis) [16:36:35] PROBLEM - HP RAID on db1094 is CRITICAL: CRITICAL: Slot 1: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2 - Controller: OK - Battery count: 0 [16:36:36] expecting it to alarm about now [16:36:37] (03CR) 10BryanDavis: [C: 032] Sort backend provided types when displaying help [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/350364 (owner: 10BryanDavis) [16:36:37] ACKNOWLEDGEMENT - HP RAID on db1094 is CRITICAL: CRITICAL: Slot 1: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2 - Controller: OK - Battery count: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T166700 [16:36:42] 06Operations, 10ops-eqiad: Degraded RAID on db1094 - https://phabricator.wikimedia.org/T166700#3304640 (10ops-monitoring-bot) [16:37:21] (03Merged) 10jenkins-bot: Always cleanup manifest on stop [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/350362 (https://phabricator.wikimedia.org/T163355) (owner: 10BryanDavis) [16:37:24] 06Operations, 10ops-eqiad: Degraded RAID on db1094 - https://phabricator.wikimedia.org/T166700#3304645 (10Volans) [16:37:26] 06Operations, 10ops-eqiad, 10DBA: Degraded BBU on db1094 (was: Degraded RAID on db1094) - https://phabricator.wikimedia.org/T166518#3304647 (10Volans) [16:37:27] seems to work fine [16:37:33] * godog eyes jenkins [16:38:04] (03Merged) 10jenkins-bot: Sort backend provided types when displaying help [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/350364 (owner: 10BryanDavis) [16:38:11] that worked [16:38:14] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: allow user 'prometheus' to export metrics too [puppet] - 10https://gerrit.wikimedia.org/r/356023 (owner: 10Filippo Giunchedi) [16:38:20] lol [16:39:13] I have here my rabdomant stick in case that didn't work though [16:40:20] 06Operations, 13Patch-For-Review: Firewall sets not being loaded post-reboot due to a @resolve race - https://phabricator.wikimedia.org/T148986#3304665 (10MoritzMuehlenhoff) The stretch issue is tracked separatetly via T166653 [16:43:29] 06Operations, 10ops-codfw, 06cloud-services-team, 13Patch-For-Review: rack/setup/install labtestvirt2003 - https://phabricator.wikimedia.org/T166237#3304694 (10Papaul) [16:43:31] 06Operations, 10ops-codfw: correct label on labtestvirt2002/WMF3810 - https://phabricator.wikimedia.org/T166598#3304692 (10Papaul) 05Open>03Resolved Complete [16:46:22] volans: but I do not understand the final goal [16:46:36] shouldnt't those scripts merge? [16:46:51] they seem to be redundant one another [16:46:59] are we talking raid checks or reimage? :) [16:47:08] the raid check [16:47:14] (03PS2) 10Muehlenhoff: Add Kafka main brokers to network constants [puppet] - 10https://gerrit.wikimedia.org/r/356167 [16:47:20] you sentence can be applied to both :D [16:47:42] but wmf-reimage I know they should, and that you have plans for it [16:48:01] I do not know about those, and you asked me to maintain the other one [16:48:08] so I am confused [16:48:27] So, agree in principle that they should converve, and we had an idea to actually make a bigger project to unify, simplify and improve the whole disk/raid monitoring/alarming [16:48:48] but due to other priorities it will not be worked on soon (as in Q1/2 for sure) [16:48:53] I understand if you say now that it is complex [16:48:59] I just was confused [16:49:11] because it seemed to me you wanted to keep both separate [16:49:55] because if not, I can help too with the merge [16:50:15] they do different things, if the check_raid is able to report a message to icinga that has enough information for the people to work on it without running manual commands, raid_handler can just be the phab reporting and icinga acking part [16:50:46] but it makes no sense to duplicate things- they can share a common body [16:50:56] and then have separate actions [16:51:01] (03CR) 10Muehlenhoff: [C: 032] Add Kafka main brokers to network constants [puppet] - 10https://gerrit.wikimedia.org/r/356167 (owner: 10Muehlenhoff) [16:51:11] for example, now wb detection is ony on one [16:51:17] but it is not reported on the other [16:51:59] I am not saying, please do it, I am asking what is your intention for the future [16:51:59] it is now, adding the error message, given that the error reports write policy is not blabla [16:52:25] so now it says wt detected and so? [16:52:32] but using icinga, right? [16:52:41] my intention is I whish I'll have more time to work on it, but given the other stuff don't expect any major effort from me on this in the next quarter [16:52:55] and I am cool [16:52:56] with that [16:53:10] !log merge cache_maps into cache_upload: finished moving LVS IPs T164608 [16:53:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:20] T164608: Merge cache_maps into cache_upload functionally - https://phabricator.wikimedia.org/T164608 [16:53:20] I just want to known if, given enough time, you envision those merged? [16:53:20] see https://phabricator.wikimedia.org/T166700, I include in the message the Icinga error [16:53:34] so for a megaraid one [16:53:36] (03PS2) 10Muehlenhoff: Add Kafka analytics brokers to network constants [puppet] - 10https://gerrit.wikimedia.org/r/356183 [16:53:38] it will have your new message [16:53:45] and knowing that, I can do things differently so I do not make that path worse [16:53:59] !log restart varnish-backend on cp1074 [16:54:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:15] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.69% of data above the critical threshold [1000.0] [16:54:45] PROBLEM - Nginx local proxy to apache on mw1198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:55:10] jynus: yes to have one tool, not sure about "merged" is probably rewritten, at least most of the check_raid part. It should also do the metric reporting [16:55:13] 06Operations, 10ops-codfw, 06cloud-services-team, 13Patch-For-Review: rack/setup/install labtestvirt2003 - https://phabricator.wikimedia.org/T166237#3304774 (10Papaul) [16:55:24] given that querying the controllers and parsing the data is expensive [16:55:34] so you created the other one from 0? [16:55:35] RECOVERY - Nginx local proxy to apache on mw1198 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 615 bytes in 1.308 second response time [16:55:47] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, 06Services (watching): wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3304776 (10Krinkle) @greg I'm working late this week (evening instead of morning), I could... [16:55:48] better to do it once and output icinga alert, monitoring metrics, etc.. [16:55:54] which one, raid_handler? [16:55:57] yes [16:56:26] yes, but in theory was supposed to just handle the I got an alarm, ack icinga, create a task in phab for the proper dc-ops [16:56:30] ok [16:56:37] now I understand better [16:56:53] the other check, the get_raid_megacli/hpssacli come after [16:56:53] so they could be 2 scripts onf the same framwork [16:57:06] sharing some code [16:57:13] but with different purposes [16:58:01] so the raid_handler can probably be left as is, the part that is "duplicated" in the sense of parsing output is only for megaraid and is [16:58:06] modules/raid/files/get-raid-status-megacli.py [16:58:34] I understand now better [16:58:43] that was only my question [16:58:45] PROBLEM - Check Varnish expiry mailbox lag on cp1099 is CRITICAL: CRITICAL: expiry mailbox lag is 2015336 [16:58:48] the hpssacli is just a quick wrapper: modules/raid/files/get-raid-status-megacli.py [16:58:49] 06Operations, 10ops-codfw, 06cloud-services-team, 13Patch-For-Review: rack/setup/install labtestvirt2003 - https://phabricator.wikimedia.org/T166237#3304793 (10Papaul) @chasemp the others lab servers in the DHCP file are pointing to the Trusty install do you want to install Trusty on labtestvirt2003 or Je... [16:59:01] to get all controllers and overcome NRPE limit in size [16:59:12] copy/pasta fail [16:59:19] modules/raid/files/get-raid-status-hpssacli.sh [16:59:42] that needs redoing, too [16:59:44] I think [16:59:47] not the handler [16:59:48] the check [16:59:55] because its too expensive [16:59:59] (03PS1) 10Gehel: logstash - fix minor typo in documentation [puppet] - 10https://gerrit.wikimedia.org/r/356420 [17:00:32] we could also decide to convert those to crontab+passive check on icinga for example [17:00:43] the whole thing needs a bit of thinking, time and effort [17:00:50] yes, indeed [17:01:03] I thought you already had a perfect goal [17:01:10] and that is why I was asking you [17:01:25] and I thought those new scripts where perfect replacements [17:01:50] I know now it is very WIP, and the handler scripts where single-purpose [17:01:50] (03CR) 10Muehlenhoff: [C: 032] Add Kafka analytics brokers to network constants [puppet] - 10https://gerrit.wikimedia.org/r/356183 (owner: 10Muehlenhoff) [17:02:22] oh no, that was a quick thing at the start to make life easier for the broken disks while investigating for what became cumin :) [17:02:40] (03CR) 10Gehel: [C: 032] logstash - fix minor typo in documentation [puppet] - 10https://gerrit.wikimedia.org/r/356420 (owner: 10Gehel) [17:02:40] it will be a goal for sure, just not sure when, but you can push/vote for it ;) [17:02:45] (03PS2) 10Gehel: logstash - fix minor typo in documentation [puppet] - 10https://gerrit.wikimedia.org/r/356420 [17:02:46] I think my confusion was [17:03:02] because I think I saw some icinga-like output on those new scripts [17:03:23] but maybe I was mixing scripts [17:04:05] RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 0 [17:04:30] right, that was because once I parsed everything it was so simple to add a --nagios option that I did it and in my mind get-raid-status-megacli.py could, with not much effort, replace the megaraid part of check_raid.py IMHO [17:04:49] *at that time [17:05:09] (03CR) 10Elukey: "Thanks Marcel!" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/356383 (https://phabricator.wikimedia.org/T108850) (owner: 10Elukey) [17:05:55] so the final goal would be to have a mini framework that has specialized scripts for each different controller types and that is able to output text, nagios, prometheus(or grafana) with a single pass [17:08:04] (03PS2) 10Muehlenhoff: Add Hadoop masters to network constants [puppet] - 10https://gerrit.wikimedia.org/r/356188 [17:09:05] maybe with some sort of local cache, so that it will be easy to access any of those 3 outputs for example [17:10:26] (03PS1) 10Filippo Giunchedi: base: blacklist acpi_power_meter [puppet] - 10https://gerrit.wikimedia.org/r/356422 (https://phabricator.wikimedia.org/T125205) [17:11:43] paravoid: ^ I was reading the scrollback re: kernel messages, looks like the messages continued, also due to 6b68236591 I think [17:11:49] ema: ^ [17:12:09] (03PS1) 10Gergő Tisza: Revert "Use is_not_bot filter function for eventlogging mysql consumer" [puppet] - 10https://gerrit.wikimedia.org/r/356423 (https://phabricator.wikimedia.org/T67508) [17:12:28] I tried it on ms-be1023 btw [17:12:29] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, 06Services (watching): wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3304910 (10greg) Thanks @Krinkle. Opsen following along: can someone be around at that tim... [17:15:44] (03CR) 10Muehlenhoff: [C: 032] Add Hadoop masters to network constants [puppet] - 10https://gerrit.wikimedia.org/r/356188 (owner: 10Muehlenhoff) [17:19:09] 06Operations, 13Patch-For-Review: Firewall sets not being loaded post-reboot due to a @resolve race on jessie - https://phabricator.wikimedia.org/T148986#3304945 (10fgiunchedi) [17:21:48] (03PS4) 10Elukey: role::mariadb::analytics::custom_repl_slave: add eventlogging_cleaner.py [puppet] - 10https://gerrit.wikimedia.org/r/356383 (https://phabricator.wikimedia.org/T108850) [17:22:19] 06Operations, 10Ops-Access-Requests, 10Analytics, 13Patch-For-Review, 06Services (blocked): Analytics access request - https://phabricator.wikimedia.org/T166391#3304971 (10Pchelolo) @RobH The 3 day period have passed, could you please merge this as we need to gather some data in preparation to a meeting... [17:22:33] godog: oh, so it wasn't (only?) the ipmi check causing those messages but rather hwmon reading under /sys? [17:22:37] 06Operations, 06Release-Engineering-Team (Kanban), 07Security-General: setup mwreleases1001.eqiad.wmnet - https://phabricator.wikimedia.org/T164030#3304973 (10Dzahn) I assume this includes shell access on the machine. This will need an admin group on it for that. Would the existing "releasers-mediawiki" ma... [17:22:49] (03PS2) 10Elukey: adding gwicke and ppchelko to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/355804 (owner: 10RobH) [17:22:51] (03PS3) 10RobH: adding gwicke and ppchelko to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/355804 [17:22:55] RECOVERY - HP RAID on ms-be1033 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK [17:22:59] elukey: i got it? [17:23:06] or you can [17:23:15] it was assigned to me so i assumed it was me doing it ;] [17:23:23] ema: afaics yes [17:23:41] robh: sorry! I thought you were busy and thought to handle it, please go ahead :) [17:23:53] hehe, no worreis, cool ill merge [17:24:18] (03CR) 10RobH: [C: 032] adding gwicke and ppchelko to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/355804 (owner: 10RobH) [17:26:35] 06Operations, 10Ops-Access-Requests, 10Analytics, 13Patch-For-Review, 06Services (blocked): Analytics access request - https://phabricator.wikimedia.org/T166391#3305015 (10RobH) 05Open>03Resolved Merged and puppet is running on affected servers right now. [17:28:48] 06Operations, 06Release-Engineering-Team (Kanban), 07Security-General: setup mwreleases1001.eqiad.wmnet - https://phabricator.wikimedia.org/T164030#3305031 (10demon) >>! In T164030#3304973, @Dzahn wrote: > I assume this includes shell access on the machine. This will need an admin group on it for that. > >... [17:30:47] !log swift eqiad-prod decom ms-be100[128] - T166489 [17:30:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:57] T166489: Decommission ms-be1001 - ms-be1012 - https://phabricator.wikimedia.org/T166489 [17:31:50] that might interest people https://githubengineering.com/dns-infrastructure-at-github/ [17:33:21] (03CR) 10Ema: [C: 031] base: blacklist acpi_power_meter [puppet] - 10https://gerrit.wikimedia.org/r/356422 (https://phabricator.wikimedia.org/T125205) (owner: 10Filippo Giunchedi) [17:33:44] also this: http://kubernetesbyexample.com/ [17:50:26] 06Operations, 13Patch-For-Review: Ops Onboarding for Keith Herron - https://phabricator.wikimedia.org/T166587#3305113 (10Dzahn) >>! In T166587#3302076, @herron wrote: > My GPG key ID is C574276C (keyserver hkp://pool.sks-keyservers.net) Hi, so i imported this key succesfully but then when i try to use it to... [17:55:15] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [18:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170531T1800). [18:01:22] (03PS3) 10ArielGlenn: split up flow dumps into stubs and content passes [dumps] - 10https://gerrit.wikimedia.org/r/355077 (https://phabricator.wikimedia.org/T164262) [18:01:32] no swat yay [18:01:44] (03CR) 10jerkins-bot: [V: 04-1] split up flow dumps into stubs and content passes [dumps] - 10https://gerrit.wikimedia.org/r/355077 (https://phabricator.wikimedia.org/T164262) (owner: 10ArielGlenn) [18:02:35] PROBLEM - puppet last run on ms-be1011 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/puppet-enabled] [18:08:27] (03PS1) 10Dzahn: add admin group releasers-mediawiki to mwreleases1001 [puppet] - 10https://gerrit.wikimedia.org/r/356425 (https://phabricator.wikimedia.org/T164030) [18:09:37] 06Operations, 13Patch-For-Review: Ops Onboarding for Keith Herron - https://phabricator.wikimedia.org/T166587#3305183 (10herron) Looks like I inadvertently generated a signing only key. I've revoked the C574276C key. Let's try this again with new Key ID 0DEC052E. [18:10:07] (03PS1) 10Filippo Giunchedi: raid: bump hpssacli timeout [puppet] - 10https://gerrit.wikimedia.org/r/356426 [18:10:47] (03PS1) 10Filippo Giunchedi: site: add prometheus global instance in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/356427 [18:11:03] (03CR) 10Dzahn: [C: 04-1] "there is currently a syntax error in it. see jenkins-bot -1" [puppet] - 10https://gerrit.wikimedia.org/r/356243 (https://phabricator.wikimedia.org/T166611) (owner: 10Paladox) [18:11:15] 06Operations, 10DBA, 13Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3305195 (10Cmjohnson) @Marostegui is this still an issue? [18:13:36] (03CR) 10Volans: [C: 04-1] "We have command_timeout=60 in nrpe.cfg, I don't think this will work" [puppet] - 10https://gerrit.wikimedia.org/r/356426 (owner: 10Filippo Giunchedi) [18:17:15] PROBLEM - HP RAID on ms-be1036 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [18:19:05] (03PS8) 10Paladox: jenkins: Install java 8 on stretch and jessie [puppet] - 10https://gerrit.wikimedia.org/r/356243 (https://phabricator.wikimedia.org/T166611) [18:19:57] (03PS2) 10Filippo Giunchedi: raid: bump hpssacli timeout [puppet] - 10https://gerrit.wikimedia.org/r/356426 [18:19:59] volans: ^ [18:20:11] (03CR) 10jerkins-bot: [V: 04-1] jenkins: Install java 8 on stretch and jessie [puppet] - 10https://gerrit.wikimedia.org/r/356243 (https://phabricator.wikimedia.org/T166611) (owner: 10Paladox) [18:20:39] godog: do you think those extra 10s will reduce the flappines? :) [18:21:14] I do yeah [18:21:38] better than having the check downtimed, lacking another solution [18:22:22] the other "quick" solution is to move those to crontab + passive checks I guess [18:22:49] the other workaround that could be done is to increase the command_timeout globally for NRPE just on the problematic hosts [18:22:57] a bit meh to me tbh [18:23:35] this whole situation is a bit meh heh :D [18:23:36] but sure, let's try this first ;) [18:23:51] so yeah lipstick on a pig [18:25:39] hpssacli script does some things that may be unnecesary- we could trim that [18:26:07] this might be a deployment blocker - https://phabricator.wikimedia.org/T166713 [18:26:08] (03CR) 10Volans: [C: 032] raid: bump hpssacli timeout [puppet] - 10https://gerrit.wikimedia.org/r/356426 (owner: 10Filippo Giunchedi) [18:26:26] godog: ops... mouse fail, I wanted to hit reply for a +1 [18:27:15] PROBLEM - HP RAID on ms-be1034 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [18:27:23] speaking of which [18:27:26] volans: no worries, I'll merge [18:27:31] thanks! [18:27:45] (03PS3) 10Filippo Giunchedi: raid: bump hpssacli timeout [puppet] - 10https://gerrit.wikimedia.org/r/356426 [18:27:46] I'm converting that to a +1 with comment, feel free to +2 youself [18:28:05] Reedy, could this be due to the recent jsonconfig changes? ^^^^ [18:28:19] (03CR) 10Volans: [C: 031] "LGTM as a quick workaround, if the flapping doesn't go down we need to consider next steps as in one of:" [puppet] - 10https://gerrit.wikimedia.org/r/356426 (owner: 10Filippo Giunchedi) [18:28:56] jynus: in this particular configuration yeah some checks might be unnecessary, more complexity in check_hpssacli though [18:29:13] (03CR) 10Filippo Giunchedi: [V: 032] raid: bump hpssacli timeout [puppet] - 10https://gerrit.wikimedia.org/r/356426 (owner: 10Filippo Giunchedi) [18:29:35] RECOVERY - puppet last run on ms-be1011 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [18:30:03] nice, although I converted it to +1 it actually kept the +2 status... [18:30:38] thanks gerrit to put the reply and merge+all_automation_you_might_have one next to the other :) [18:33:17] (03PS9) 10Paladox: jenkins: Install java 8 on stretch and jessie [puppet] - 10https://gerrit.wikimedia.org/r/356243 (https://phabricator.wikimedia.org/T166611) [18:35:38] (03CR) 10jerkins-bot: [V: 04-1] jenkins: Install java 8 on stretch and jessie [puppet] - 10https://gerrit.wikimedia.org/r/356243 (https://phabricator.wikimedia.org/T166611) (owner: 10Paladox) [18:36:14] (03PS10) 10Paladox: jenkins: Install java 8 on stretch and jessie [puppet] - 10https://gerrit.wikimedia.org/r/356243 (https://phabricator.wikimedia.org/T166611) [18:36:21] sorry for spam ^^ [18:37:10] paladox: you should be able to run the tests on your local machine :D [18:37:15] RECOVERY - HP RAID on ms-be1036 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK [18:37:47] oh, yep. [18:37:48] paladox: sudo apt-get install bundler && rake test [18:37:52] err [18:37:54] bundle exec rake test [18:37:55] that wont work on a mac [18:38:03] it does on mine ! [18:38:09] i did gem install puppet-lint [18:38:14] ah hyeah no apt-get you are right sorry [18:38:29] gem install bundler [18:38:45] oh i will do that [18:38:51] then bundler takes care of installaing the gems that are needed [18:38:58] ok [18:39:10] it install them somewhere, and when you use "bundle exec " it will be run with that set of gems [18:39:15] eg bundle exec rake test [18:39:22] ok [18:39:23] thanks [18:47:24] RECOVERY - HP RAID on ms-be1034 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK [18:48:46] RECOVERY - Check Varnish expiry mailbox lag on cp1099 is OK: OK: expiry mailbox lag is 0 [18:50:26] RECOVERY - HP RAID on ms-be1030 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK [18:50:36] RECOVERY - HP RAID on ms-be1037 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK [18:51:56] RECOVERY - HP RAID on ms-be1029 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK [18:51:56] RECOVERY - HP RAID on ms-be1039 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK [18:51:56] RECOVERY - HP RAID on ms-be1035 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK [18:52:03] 06Operations, 10DBA, 13Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3305474 (10Marostegui) @Cmjohnson yes, check the original task description so you can see that there are a few servers that cannot be installed :-( [18:52:46] RECOVERY - HP RAID on ms-be1028 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK [18:55:16] RECOVERY - HP RAID on ms-be1031 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK [18:56:46] RECOVERY - HP RAID on ms-be1032 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK [18:56:56] RECOVERY - HP RAID on ms-be1038 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK [19:00:04] thcipriani: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170531T1900). [19:00:45] no later [19:01:07] (03CR) 10Hashar: "Some inline comments to help with the review." (036 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356373 (owner: 10Hashar) [19:02:09] (03PS1) 10Framawiki: Enable wgExtraSignatureNamespaces at NS:102 for trwiki* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356437 (https://phabricator.wikimedia.org/T166522) [19:03:30] (03PS2) 10Framawiki: Enable wgExtraSignatureNamespaces at NS:102 for trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356437 (https://phabricator.wikimedia.org/T166522) [19:08:12] (03PS3) 10Framawiki: Enable wgExtraSignatureNamespaces at NS:102 for trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356437 (https://phabricator.wikimedia.org/T166522) [19:09:55] (03CR) 10Chad: [C: 032] Scap clean: Rewrite to just do stuff on masters then sync [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355635 (owner: 10Chad) [19:11:54] (03Merged) 10jenkins-bot: Scap clean: Rewrite to just do stuff on masters then sync [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355635 (owner: 10Chad) [19:11:58] !log gehel@tin Started deploy [wdqs/wdqs@af495a2]: (no justification provided) [19:12:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:07] (03CR) 10jenkins-bot: Scap clean: Rewrite to just do stuff on masters then sync [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355635 (owner: 10Chad) [19:12:32] (03CR) 10Chad: [C: 032] Also clean up ExtensionMessages files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355475 (owner: 10Chad) [19:12:40] (03CR) 10jerkins-bot: [V: 04-1] Also clean up ExtensionMessages files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355475 (owner: 10Chad) [19:13:27] !log gehel@tin Finished deploy [wdqs/wdqs@af495a2]: (no justification provided) (duration: 01m 29s) [19:13:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:50] SMalyshev: wdqs upgrade with fixes to updater done [19:14:18] (03CR) 10Jcrespo: [C: 04-2] "Wait maybe until T166683 is fixed" [puppet] - 10https://gerrit.wikimedia.org/r/356387 (owner: 10Jcrespo) [19:14:46] (03PS3) 10Chad: Also clean up ExtensionMessages files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355475 [19:15:06] (03CR) 10Chad: [C: 032] Also clean up ExtensionMessages files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355475 (owner: 10Chad) [19:15:41] 06Operations, 10DBA, 10Traffic: Substantive HTTP and mediawiki/database traffic coming from a single ip - https://phabricator.wikimedia.org/T166695#3305582 (10jcrespo) This started at exactly 3am and ended at exactly 18pm. All very weird. [19:16:28] (03Merged) 10jenkins-bot: Also clean up ExtensionMessages files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355475 (owner: 10Chad) [19:17:01] (03CR) 10jenkins-bot: Also clean up ExtensionMessages files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355475 (owner: 10Chad) [19:17:03] (03PS2) 10Ottomata: Kafka broker profile and roles for new 'aggregate' (TBD) cluster and 'simple' cluster [puppet] - 10https://gerrit.wikimedia.org/r/356232 (https://phabricator.wikimedia.org/T166162) [19:19:26] (03CR) 10jerkins-bot: [V: 04-1] Kafka broker profile and roles for new 'aggregate' (TBD) cluster and 'simple' cluster [puppet] - 10https://gerrit.wikimedia.org/r/356232 (https://phabricator.wikimedia.org/T166162) (owner: 10Ottomata) [19:20:27] (03PS2) 10Chad: Scap clean: Also drop old patch files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355653 [19:24:15] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, 06Services (watching): wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3305630 (10Krinkle) (Continued investigation using the data we did capture in SAL and Logs... [19:26:32] (03PS3) 10Ottomata: Kafka broker profile and roles for new 'aggregate' (TBD) cluster and 'simple' cluster [puppet] - 10https://gerrit.wikimedia.org/r/356232 (https://phabricator.wikimedia.org/T166162) [19:28:03] (03CR) 10jerkins-bot: [V: 04-1] Kafka broker profile and roles for new 'aggregate' (TBD) cluster and 'simple' cluster [puppet] - 10https://gerrit.wikimedia.org/r/356232 (https://phabricator.wikimedia.org/T166162) (owner: 10Ottomata) [19:28:33] (03CR) 10Chad: [C: 032] Scap clean: Also drop old patch files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355653 (owner: 10Chad) [19:29:13] (03PS7) 10Chad: Setup apache vhost on scap proxies as well [puppet] - 10https://gerrit.wikimedia.org/r/344221 [19:29:49] (03Merged) 10jenkins-bot: Scap clean: Also drop old patch files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355653 (owner: 10Chad) [19:30:00] (03CR) 10jenkins-bot: Scap clean: Also drop old patch files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355653 (owner: 10Chad) [19:30:59] !log demon@tin Synchronized scap/plugins/clean.py: cleanup r us (duration: 00m 42s) [19:31:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:24] thcipriani: I may have just totally broken scap clean there ^^^. It's basically impossible to test outside of prod. I'm on choo choo duties next week, so don't bother running it this week anymore. I'll debug my own issues instead of making you. [19:43:49] (03PS6) 10Chad: Move mwdeploy home to /var/lib where it belongs, it's a system user [puppet] - 10https://gerrit.wikimedia.org/r/323867 (https://phabricator.wikimedia.org/T86971) [19:45:15] (03CR) 10Chad: [C: 031] add admin group releasers-mediawiki to mwreleases1001 [puppet] - 10https://gerrit.wikimedia.org/r/356425 (https://phabricator.wikimedia.org/T164030) (owner: 10Dzahn) [19:45:55] (03PS1) 10Phedenskog: Add Save Timing alerts to Icinga [puppet] - 10https://gerrit.wikimedia.org/r/356449 (https://phabricator.wikimedia.org/T153170) [19:46:05] (03CR) 10Chad: "We should land this, pretty trivial" [puppet] - 10https://gerrit.wikimedia.org/r/323867 (https://phabricator.wikimedia.org/T86971) (owner: 10Chad) [19:46:54] (03CR) 10jerkins-bot: [V: 04-1] Add Save Timing alerts to Icinga [puppet] - 10https://gerrit.wikimedia.org/r/356449 (https://phabricator.wikimedia.org/T153170) (owner: 10Phedenskog) [19:48:53] (03CR) 10Tjones: [C: 04-1] Enable BM25 for Chinese wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350312 (https://phabricator.wikimedia.org/T163829) (owner: 10Tjones) [19:50:17] (03PS2) 10Krinkle: Add Save Timing alerts to Icinga [puppet] - 10https://gerrit.wikimedia.org/r/356449 (https://phabricator.wikimedia.org/T153170) (owner: 10Phedenskog) [19:50:29] (03CR) 10Krinkle: [C: 031] Add Save Timing alerts to Icinga [puppet] - 10https://gerrit.wikimedia.org/r/356449 (https://phabricator.wikimedia.org/T153170) (owner: 10Phedenskog) [19:59:55] PROBLEM - very high load average likely xfs on ms-be1019 is CRITICAL: CRITICAL - load average: 122.36, 101.10, 71.30 [20:00:04] gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170531T2000). [20:00:14] no parsoid deploy today [20:04:55] PROBLEM - very high load average likely xfs on ms-be1019 is CRITICAL: CRITICAL - load average: 106.61, 102.89, 80.18 [20:06:54] no mobileapps deploy today [20:08:55] RECOVERY - very high load average likely xfs on ms-be1019 is OK: OK - load average: 41.40, 74.82, 74.30 [20:13:22] !log mobrovac@tin Started deploy [citoid/deploy@7d69554]: Relaxing date validation - T132308 [20:13:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:33] T132308: Deal with incomplete and non standard dates, i.e. year only, year and month only, or season / quarter - https://phabricator.wikimedia.org/T132308 [20:15:54] !log mobrovac@tin Finished deploy [citoid/deploy@7d69554]: Relaxing date validation - T132308 (duration: 02m 32s) [20:16:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:12] !log gerrit: bringing offline for a few minutes for point release (2.13.4 -> 2.13.8, T158946) [20:17:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:21] T158946: Update gerrit to 2.13.8 - https://phabricator.wikimedia.org/T158946 [20:18:57] 06Operations, 10ops-codfw, 13Patch-For-Review: rack/setup/deploy new codfw mw app servers - https://phabricator.wikimedia.org/T135466#3305835 (10RobH) [20:18:59] 06Operations, 10ops-codfw: rack/setup/deploy mw22[1-5][0-9] switch configuration - https://phabricator.wikimedia.org/T136670#3305833 (10RobH) 05Open>03Resolved All of these are already done: ge-4/0/0 up up mw2239 ge-4/0/1 up up mw2240 ge-4/0/2 up up mw2241 ge-4/0/3... [20:22:45] PROBLEM - puppet last run on labsdb1011 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [20:23:43] Shit. [20:23:46] Shit shit. [20:23:48] RainbowSprinkles: heh, the gerrit downtime message is scary. "follow us while we debug"? it's a planned downtime! [20:23:48] Ok, rolling back [20:24:05] oh whoops, now it seems to be down for real. heh heh. [20:25:11] * mdholloway pokes his head in wondering if something's up with gerrit [20:25:15] It's me [20:25:16] Fixing [20:25:23] cool, thx [20:25:26] PROBLEM - puppet last run on db1069 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [20:25:42] * dbrant awards token [20:25:45] PROBLEM - SSH access on cobalt is CRITICAL: connect to address 208.80.154.81 and port 29418: Connection refused [20:26:25] PROBLEM - puppet last run on kafka2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [20:27:45] PROBLEM - gerrit process on cobalt is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war [20:28:05] PROBLEM - puppet last run on tungsten is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/xhprof],Exec[git_pull_operations/software/xhgui] [20:28:35] PROBLEM - Check systemd state on cobalt is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:28:45] PROBLEM - puppet last run on labsdb1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [20:30:05] PROBLEM - puppet last run on mw1228 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:30:05] PROBLEM - puppet last run on kafka1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [20:31:03] Ok gerrit back [20:31:08] Cascading failures will resolve [20:31:11] Well *that* sucked [20:31:23] Got some logs to analyze now :\ [20:31:35] RECOVERY - Check systemd state on cobalt is OK: OK - running: The system is fully operational [20:31:36] 500 errror [20:31:45] "500 Internal server error" [20:31:45] RECOVERY - gerrit process on cobalt is OK: PROCS OK: 1 process with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war [20:32:07] Authentication unavailable? [20:32:09] wtf.... [20:32:18] same for me [20:32:18] *too many open files* [20:32:21] that's what bailed earlier [20:32:24] Ok, what's new? [20:32:33] New? [20:32:35] PROBLEM - puppet last run on kafka2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [20:32:41] I thought you downgraded back to 2.13.4 [20:32:54] was there a schema upgrade? [20:33:25] "Fix file_name column length.", that's from the release notes for .8. [20:34:38] It's trying to reindex everything for me? [20:34:40] Thanks gerrit. [20:34:40] It seems to me that gerrit is down again (or behind varnish) [20:34:42] I didn't want that [20:34:45] Amir1: I know [20:34:58] Okay [20:34:59] I mean I don't see why else it would open every single git pack file [20:35:02] core.packedGitOpenFiles [20:35:05] PROBLEM - puppet last run on kafka2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [20:35:14] That doesn't help [20:36:38] maybe it's the reindex causing the errors? [20:36:52] i am only getting that from a quick google search [20:37:15] PROBLEM - puppet last run on graphite2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_performance/docroot] [20:38:50] RainbowSprinkles what's ulimit -a [20:38:55] https://groups.google.com/forum/#!topic/repo-discuss/mGwI3yHnGWQ [20:39:45] gerrit flaky for anyone? [20:39:55] We already have it set to 65536, that shouldn't be it [20:40:01] dr0ptp4kt see scrollback. [20:40:05] I'm curious why gerrit is trying to open every freaking file it knows about [20:40:06] wah [20:40:10] * RainbowSprinkles kicks gerrit [20:40:31] Maybe a bug. If so, that seems to happen between 2.13.4 and 2.13.8 right? [20:41:29] i see this, [20:41:31] Enable systemd socket activation. [20:41:31] By setting httpd.inheritChannel to true, the server can be socket activated by systemd or xinetd. [20:41:50] but that coulden't be it, it should be disabled. or at least i hope it is. [20:42:09] Completely unrelated. [20:43:05] PROBLEM - puppet last run on kafka1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [20:43:05] PROBLEM - puppet last run on kafka1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [20:43:05] PROBLEM - puppet last run on stat1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [20:43:35] PROBLEM - Check systemd state on cobalt is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:43:39] that is a good way to find out what is using puppet to deploy from Gerrit :] [20:43:45] PROBLEM - gerrit process on cobalt is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war [20:43:45] PROBLEM - puppet last run on thorium is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_geowiki-scripts],Exec[git_pull_analytics.wikimedia.org] [20:43:46] PROBLEM - puppet last run on labsdb1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [20:44:35] RECOVERY - Check systemd state on cobalt is OK: OK - running: The system is fully operational [20:44:45] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Puppet has 8 failures. Last run 2 minutes ago with 8 failures. Failed resources (up to 3 shown): Exec[git_pull_analytics/discovery-stats],Exec[git_pull_refinery_source],Exec[git_pull_aggregator_code],Exec[git_pull_analytics/reportupdater] [20:44:45] RECOVERY - SSH access on cobalt is OK: SSH OK - GerritCodeReview_2.13.4-13-gc0c5cc4742 (SSHD-CORE-1.2.0) (protocol 2.0) [20:44:45] RECOVERY - gerrit process on cobalt is OK: PROCS OK: 1 process with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war [20:44:51] Hmmmm [20:45:05] PROBLEM - puppet last run on graphite1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_performance/docroot] [20:45:42] RainbowSprinkles it seems it was the indexer [20:45:44] works now [20:45:48] No [20:45:51] It busted again [20:46:00] works for me [20:46:06] Doubled # of open files for gerrit, failed when it hit that again [20:46:09] Just took longer [20:46:23] Wtf happened here. [20:46:38] lol [20:46:43] the text is getting bigger [20:46:45] PROBLEM - puppet last run on analytics1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [20:46:49] the logo is disapeared [20:46:53] Shut up. [20:46:57] You're not helping [20:47:05] PROBLEM - puppet last run on bromine is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 2 minutes ago with 4 failures. Failed resources (up to 3 shown): Exec[git_pull_wikimedia/annualreport],Exec[git_pull_wikimedia/endowment],Exec[git_pull_wikimedia/TransparencyReport-private],Exec[git_pull_wikimedia/TransparencyReport] [20:47:55] PROBLEM - SSH access on cobalt is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:49:45] PROBLEM - puppet last run on stat1003 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 2 minutes ago with 4 failures. Failed resources (up to 3 shown): Exec[git_pull_analytics/limn-edit-data],Exec[git_pull_analytics/limn-language-data],Exec[git_pull_geowiki-scripts],Exec[git_pull_analytics/limn-ee-data] [20:50:35] PROBLEM - puppet last run on labsdb1010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [20:52:29] 06Operations, 13Patch-For-Review: Ops Onboarding for Keith Herron - https://phabricator.wikimedia.org/T166587#3305954 (10Dzahn) @herron Ok, cool. Imported the new key, was able to use it to encrypt. I encrypted a file kherron-racktables.gpg and put it in your home dir on rutherfordium. That contains a login fo... [20:54:35] PROBLEM - Check systemd state on cobalt is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:54:45] PROBLEM - gerrit process on cobalt is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war [20:57:05] RECOVERY - puppet last run on mw1228 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [21:05:29] (03PS1) 10Hashar: nodepool: lower min-ready for trusty [puppet] - 10https://gerrit.wikimedia.org/r/356466 [21:05:45] RECOVERY - gerrit process on cobalt is OK: PROCS OK: 1 process with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war [21:05:45] RECOVERY - SSH access on cobalt is OK: SSH OK - GerritCodeReview_2.13.4-13-gc0c5cc4742 (SSHD-CORE-1.2.0) (protocol 2.0) [21:07:25] (03CR) 10Hashar: "Based on the time to wait for jessie/trusty instances to be available at https://grafana.wikimedia.org/dashboard/db/zuul?panelId=18&fullsc" [puppet] - 10https://gerrit.wikimedia.org/r/356466 (owner: 10Hashar) [21:07:45] RECOVERY - puppet last run on labsdb1001 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [21:08:05] RECOVERY - puppet last run on stat1004 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [21:08:59] greg-g: I'll be back in 10-15min, had a bunch of meetings back to back. Ping me when things are ready? [21:09:05] RECOVERY - puppet last run on kafka1002 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [21:09:27] Krinkle: #redirect thcipriani [21:09:39] :) [21:09:45] RECOVERY - puppet last run on thorium is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [21:09:47] I have to run shortly to play family taxi [21:09:52] I trust you two [21:10:00] oh good :) [21:10:05] RECOVERY - puppet last run on kafka1001 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [21:10:14] Krinkle: I can get everything ready [21:10:17] thcipriani: for now...... [21:10:21] :P [21:11:34] (03Abandoned) 10Gergő Tisza: Revert "Use is_not_bot filter function for eventlogging mysql consumer" [puppet] - 10https://gerrit.wikimedia.org/r/356423 (https://phabricator.wikimedia.org/T67508) (owner: 10Gergő Tisza) [21:11:45] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [21:12:05] RECOVERY - puppet last run on graphite1001 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [21:13:45] RECOVERY - puppet last run on analytics1003 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [21:14:05] RECOVERY - puppet last run on bromine is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [21:14:08] (03PS1) 10Bartosz Dziewoński: Fix typos on Gerrit downtime page [puppet] - 10https://gerrit.wikimedia.org/r/356477 [21:14:10] (03PS1) 10Bartosz Dziewoński: Fix indentation of Gerrit downtime page [puppet] - 10https://gerrit.wikimedia.org/r/356478 [21:15:12] (03CR) 10Paladox: [C: 031] Fix indentation of Gerrit downtime page [puppet] - 10https://gerrit.wikimedia.org/r/356478 (owner: 10Bartosz Dziewoński) [21:15:42] (03CR) 10Paladox: [C: 031] Fix typos on Gerrit downtime page [puppet] - 10https://gerrit.wikimedia.org/r/356477 (owner: 10Bartosz Dziewoński) [21:16:37] (03CR) 10Chad: "I liked it how it was :'(" [puppet] - 10https://gerrit.wikimedia.org/r/356478 (owner: 10Bartosz Dziewoński) [21:17:31] RainbowSprinkles: it had mixed tabs and spaces you heathen D: [21:17:46] RECOVERY - puppet last run on stat1003 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [21:18:10] MatmaRex: Good thing you don't have to maintain that code then ;-) [21:18:20] RainbowSprinkles: (but i don't mind abandoning that if you think it's unnecessary) [21:18:22] heh [21:18:35] RECOVERY - puppet last run on labsdb1010 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [21:18:57] MatmaRex: No it's fine, I'm just being contrarian today :p [21:21:45] RECOVERY - puppet last run on labsdb1011 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [21:22:35] RECOVERY - puppet last run on db1069 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [21:24:25] RECOVERY - puppet last run on kafka2003 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [21:25:45] RECOVERY - puppet last run on labsdb1003 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [21:26:06] Krinkle: whenever you're around, I'm ready to try to sync out the new wikiversion [21:26:15] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [21:26:55] RECOVERY - Check systemd state on mwdebug1001 is OK: OK - running: The system is fully operational [21:27:42] thcipriani: okay, be there in 2min [21:27:50] ok [21:28:15] RECOVERY - puppet last run on kafka1003 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [21:29:59] !log Restored mwdebug1001 to wmf1 with normal nutcracker/memcached and puppet running [21:30:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:17] thcipriani: Ready [21:30:35] RECOVERY - puppet last run on kafka2001 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [21:30:38] Krinkle: ok, going live with wmf.2 everywhere [21:32:10] thcipriani: I've captured two wmf.1 profiles of uncached /wiki/Main_Page on enwiki just now via XMD/mwdebug1002 [21:32:15] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: wikipedias to 1.30.0-wmf.2 [21:32:19] ^ Krinkle live now [21:32:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:15] RECOVERY - puppet last run on kafka2002 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [21:34:20] so last time, for whatever reason, touching and syncing InitialiseSettings.php really kicked off the flood of issues that is https://phabricator.wikimedia.org/T166384 so if we don't see initial performance issues, I'd like to try a no-op sync of IS.php [21:34:25] RECOVERY - puppet last run on graphite2001 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [21:36:31] thcipriani: Interesting. If that is the case, it would suggest that something invalidated by IS mtime triggered the slow down. [21:36:51] Afaik the only thing using that in prod is the mtime on wmf-config/ dir itself in CommonSettings as used for the caching of global expansion by SiteConf [21:37:16] so far I don't see any slowdown, fyi [21:37:17] Which doesn't make any sense. [21:38:00] gilles: How's the app server log? [21:38:11] nothing unusual there [21:38:21] (03Draft1) 10Paladox: Gerrit: Set ulimit's in gerrit.service [debs/gerrit] - 10https://gerrit.wikimedia.org/r/356480 [21:38:23] (03PS2) 10Paladox: Gerrit: Set ulimit's in gerrit.service [debs/gerrit] - 10https://gerrit.wikimedia.org/r/356480 [21:38:35] mutante ^^ long commit msg, explaning things :) [21:38:42] RainbowSprinkles ^^ [21:38:46] FWIW, last time I rolled versions forward, it was quiet then I sync'd out an innocuous change to IS.php and that is when the flood of alerts started. [21:38:57] right [21:39:10] paladox: good @ long message :) [21:39:20] :) [21:39:39] we can do that, just pick a different duration between the deployment and the noop sync. not 12ish minutes, otherwise we won't be able to tell if it was some cache expiry being hit [21:40:15] It's been 8 minutes. [21:40:20] (03PS3) 10Paladox: Gerrit: Set ulimit's in gerrit.service [debs/gerrit] - 10https://gerrit.wikimedia.org/r/356480 [21:40:21] Let's do it now? [21:40:24] sure [21:40:26] (03CR) 10Chad: "Comment on 4096 explicitly, but generally I'm not sure if unlimited is the right option everywhere." (031 comment) [debs/gerrit] - 10https://gerrit.wikimedia.org/r/356480 (owner: 10Paladox) [21:40:30] sure, no op for IS comming up [21:41:04] (03CR) 10Paladox: Gerrit: Set ulimit's in gerrit.service (031 comment) [debs/gerrit] - 10https://gerrit.wikimedia.org/r/356480 (owner: 10Paladox) [21:41:16] (03PS4) 10Paladox: Gerrit: Set ulimit's in gerrit.service [debs/gerrit] - 10https://gerrit.wikimedia.org/r/356480 [21:41:22] done, set it at 65536 :) [21:41:38] 06Operations, 13Patch-For-Review: Ops Onboarding for Keith Herron - https://phabricator.wikimedia.org/T166587#3306069 (10Dzahn) [21:41:43] will need to do it for puppet. and yes i aggree we need to move to puppet [21:41:48] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: touch InitialiseSettings.php (duration: 00m 39s) [21:41:53] {{done}} [21:41:54] will the codfw icinga alert from last time whine in this channel if it reoccurs? [21:41:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:03] gilles: yes [21:42:19] Do you want me to do that now RainbowSprinkles? (gerrit.service in puppet) [21:42:26] still quiet [21:42:48] No, not right now [21:42:49] thcipriani: And the alert is also enabled for eqiad right? (post-dc switchover) [21:42:52] ok [21:42:52] I want to leave it alone right now [21:42:55] ok [21:43:07] hitting enwiki main page on an app server, still nothing [21:43:33] Krinkle: it should be, the alert was HHVM rendering which, evidently, alerts if it takes > 5 seconds [21:44:05] I did see a few eqiad alerts mixed in last time. Only 1 or 2, but they were extant. [21:44:42] Ah, okay, the incident report didn't quote any [21:44:47] That's good to know. [21:45:14] still dead quiet, 13 minutes since deploy [21:45:42] if it's a cache expiring somewhere it's possible that hitting expiry was somewhat random, so it might take longer this time around [21:45:54] 06Operations, 10Collection, 10OfflineContentGenerator, 10Reading-Community-Engagement, and 2 others: Replace OCG in collection extension with Electron - https://phabricator.wikimedia.org/T150872#3306077 (10pmiazga) [21:46:54] Indeed, WANObjectCache has randomized pre-emptive cache warmup as the TTL comes near. [21:46:58] so, the alert that exploded was: PROBLEM - HHVM rendering on [server] is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:47:19] right, and we saw that was mostly codfw [21:47:26] indeed [21:47:28] (03PS1) 10Chad: Configuring git-fat to work with Archiva [software/gerrit] - 10https://gerrit.wikimedia.org/r/356482 [21:48:06] gilles: it will talk in this channel if it's OK or CRIT,but not if happens to be UNKNOWN [21:48:14] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, 06Services (watching): wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3306086 (10Krinkle) [21:49:08] mutante: hey, can I get the access back? [21:49:45] I'm home [21:50:21] (03PS5) 10Alexandros Kosiaris: Test the future parser in puppet compiler [puppet] - 10https://gerrit.wikimedia.org/r/322898 [21:50:23] (03PS1) 10Alexandros Kosiaris: motd::script: Don't use validate_re on an integer [puppet] - 10https://gerrit.wikimedia.org/r/356483 [21:50:36] Amir1: you told me something about food :) [21:50:52] (03PS1) 10Chad: Adding scap3 config [software/gerrit] - 10https://gerrit.wikimedia.org/r/356484 [21:50:54] Yeah about loving German beer [21:51:02] I can be in hangout if you want to [21:51:29] :D [21:51:37] hehe, that was a pass :) [21:51:48] sure, why not. 10 second hangout [21:51:56] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown, 06Services (watching): wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3306101 (10Krinkle) Profiles captured just now, moments before the wmf.1>wmf.2 promotion o... [21:53:19] *crickets* [21:54:12] is it too early to pat the perf team on the back for fixing the perf issues in wmf.2? [21:54:36] and by fixing you mean doing absolutely nothing [21:54:44] thatsthejoke.jpg [21:55:13] hrm, I do see a few "request has exceeded memory limit" things in the log: https://logstash.wikimedia.org/goto/2778c2291bed7a181d00ff6438125fb1 [21:55:27] I think it's still too early to claim victory, who knows what caching duration we might be dealing with if it's an issue like that [21:56:39] (03PS1) 10Legoktm: ExtensionDistributor: Add REL1_29, drop REL1_23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356486 [21:56:48] gilles: https://phabricator.wikimedia.org/T166345#3305630 [21:56:55] Found what made the codfw servers time out [21:56:57] It is PoolCounter [21:57:00] AaronSchulz: ^ [21:57:03] is/was [21:57:54] page id 15580374 is the main page... :o [21:58:05] (03CR) 10Jforrester: [C: 031] "Should we drop 1_23 immediately and add 1_29 later once it's .0?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356486 (owner: 10Legoktm) [21:59:24] (03CR) 10Legoktm: "No, we can add it now so ED will start generating tarballs for 1.29 now since all the git branches exist, and it'll say something like "ne" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356486 (owner: 10Legoktm) [21:59:57] Krinkle: yeah, I mentioned that it seemed to keep being re-parsed (not just a one-off). That probably worsened the timeout rate. [22:01:16] AaronSchulz: After a bunch of entries timing out, there's eventually 1 slow-parse entry of 9 seconds [22:01:18] (updated comment) [22:01:21] it started before the IS touch, so we can rule that out [22:01:33] (03PS1) 10Chad: Add core plugins @ 2.13.8 [software/gerrit] - 10https://gerrit.wikimedia.org/r/356488 [22:01:35] Note these are from last wek [22:01:45] I didn't think of looking at logstash for host: with the one from the alerts [22:01:49] before now [22:02:13] (03PS1) 10Dzahn: Revert "admins: revoke ladsgroups key temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/356489 [22:02:46] (03CR) 10Paladox: Add core plugins @ 2.13.8 (031 comment) [software/gerrit] - 10https://gerrit.wikimedia.org/r/356488 (owner: 10Chad) [22:04:44] doesn't tell us what was triggering the reparse attempts, does it? [22:04:49] Nope [22:05:14] Well, ArticleView is the reason [22:06:46] Which means it was a regular page view where the cache just missed. [22:07:01] Hey, I asked in #wikimedia-dev who is maintaining Wikimedia blog technical parts and I was told that no one is doing it and for merging and deploying https://github.com/wikimedia/wikimediablog-wordpresscom/pull/7 I need to ask Ops [22:07:05] Most likely because it expired or was no longer valid. [22:07:13] the parser cache missed? [22:07:44] Yeah, like when an article is edited and then 30 days later you view it, it'll get reparsed on the fly. [22:08:01] Similarly if you view it in a user language that doesn't match the default. [22:08:15] (03PS2) 10Chad: Configuring git-fat to work with Archiva [software/gerrit] - 10https://gerrit.wikimedia.org/r/356482 [22:08:17] (03PS2) 10Chad: Adding scap3 config [software/gerrit] - 10https://gerrit.wikimedia.org/r/356484 [22:08:19] (03PS2) 10Chad: Add core + core plugins @ 2.13.8 [software/gerrit] - 10https://gerrit.wikimedia.org/r/356488 [22:08:32] Or when a template it used expired the parser cache for the main page, and the jobqueue hasn't reached it yet, then you'll get it reparsed on the fly [22:08:46] if that's the case the first run to grab the poolcounter lock would have parsed it and repopulated the cache. but the other calls minutes later keep timing out on the lock [22:08:56] Indeed. [22:09:14] And only one entry was logged for a succesful parser run from that code path: for 9 seconds, several minutes later [22:09:18] which obviously wasn't the first one [22:09:34] as then it would've logged a monumental parser run of 5 minutes instead [22:09:53] should we nuke the parser cache entry for the main page and see what happens? :) [22:10:37] otherwise we might be waiting a while for it to occur on its own [22:11:01] if we're reasonably sure of this, is there a way to test on a smaller scale :) [22:11:39] well, presumably what we did yesterday would have triggered that, since we pointed memcache locally [22:11:53] and the first render of the main page did take time, but after that it was fine [22:12:03] Ive just purged enwiki Main_Page via regular user interface on-wiki. [22:12:12] Users do this all the time. [22:12:15] PROBLEM - HHVM rendering on mw1196 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:12:20] bingo [22:12:21] well there you go. [22:12:39] I saw it instantly show up on the app server I was watching as well [22:12:41] That... was last then 10 seconds. [22:12:59] yup, happening again [22:13:03] ok, lemme know when you've got all you need an I'll go ahead and roll back. [22:13:05] looks just like last time [22:13:10] Ok. capturing [22:13:39] in a way we got lucky that last time this happened "naturally" so shortly after the deployment [22:13:49] I'm done. [22:13:54] k, rolling back [22:14:05] PROBLEM - HHVM rendering on mw2190 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:14:05] PROBLEM - HHVM rendering on mw2122 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:14:05] PROBLEM - HHVM rendering on mw2108 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:14:05] PROBLEM - HHVM rendering on mw2227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:14:05] PROBLEM - HHVM rendering on mw2112 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:14:06] PROBLEM - HHVM rendering on mw2232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:14:06] RECOVERY - HHVM rendering on mw1196 is OK: HTTP OK: HTTP/1.1 200 OK - 74365 bytes in 3.264 second response time [22:14:15] PROBLEM - HHVM rendering on mw2123 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:14:15] PROBLEM - HHVM rendering on mw2198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:14:15] PROBLEM - HHVM rendering on mw2223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:14:16] PROBLEM - HHVM rendering on mw2189 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:14:16] PROBLEM - HHVM rendering on mw2144 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:14:16] PROBLEM - HHVM rendering on mw2146 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:14:16] PROBLEM - HHVM rendering on mw2138 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:14:28] codfw, don't be so sensitive [22:14:38] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: wikipedias back to 1.30.0-wmf.1 [22:14:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:29] RECOVERY - HHVM rendering on mw2207 is OK: HTTP OK: HTTP/1.1 200 OK - 74301 bytes in 0.122 second response time [22:15:29] RECOVERY - HHVM rendering on mw2225 is OK: HTTP OK: HTTP/1.1 200 OK - 74299 bytes in 0.099 second response time [22:15:29] RECOVERY - HHVM rendering on mw2183 is OK: HTTP OK: HTTP/1.1 200 OK - 74299 bytes in 0.090 second response time [22:15:29] RECOVERY - HHVM rendering on mw2174 is OK: HTTP OK: HTTP/1.1 200 OK - 74299 bytes in 0.096 second response time [22:15:29] RECOVERY - HHVM rendering on mw2099 is OK: HTTP OK: HTTP/1.1 200 OK - 74301 bytes in 0.121 second response time [22:15:35] RECOVERY - HHVM rendering on mw2203 is OK: HTTP OK: HTTP/1.1 200 OK - 74300 bytes in 0.106 second response time [22:15:35] RECOVERY - HHVM rendering on mw2218 is OK: HTTP OK: HTTP/1.1 200 OK - 74300 bytes in 0.105 second response time [22:15:35] RECOVERY - HHVM rendering on mw2168 is OK: HTTP OK: HTTP/1.1 200 OK - 74299 bytes in 0.089 second response time [22:15:35] RECOVERY - HHVM rendering on mw2252 is OK: HTTP OK: HTTP/1.1 200 OK - 74300 bytes in 0.111 second response time [22:15:35] RECOVERY - HHVM rendering on mw2148 is OK: HTTP OK: HTTP/1.1 200 OK - 74299 bytes in 0.102 second response time [22:16:37] (03CR) 10Jforrester: [C: 031] "WFM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356486 (owner: 10Legoktm) [22:16:49] I do apologize, I sort of tongue-in-cheek expected that a simple purge wouldn't trigger it. But then again, probably simplest way to get it over with. [22:17:20] well it worked :) [22:18:19] 06Operations, 13Patch-For-Review: Ops Onboarding for Keith Herron - https://phabricator.wikimedia.org/T166587#3306157 (10Dzahn) [22:18:33] thcipriani: gilles: added profiles to https://phabricator.wikimedia.org/T166345#3306101 [22:18:35] problem is, if we don't capture the one actually trying to parse, we just hit something that waits on poolcounter, don't we? [22:18:39] I do have to leave now do, so I'll investigate tomorrow. [22:18:52] gilles: It's possible, indeed. [22:19:05] I'm hoping my purge/reload request got it, though. [22:19:28] It may not be the url I copied in the comment, but it'll have been with xhprof on either way, so at least it'll be among hte other xhgui entries. [22:19:30] (03PS3) 10Chad: Configuring git-fat to work with Archiva [software/gerrit] - 10https://gerrit.wikimedia.org/r/356482 (https://phabricator.wikimedia.org/T157414) [22:19:31] (03PS3) 10Chad: Adding scap3 config [software/gerrit] - 10https://gerrit.wikimedia.org/r/356484 (https://phabricator.wikimedia.org/T157414) [22:19:33] (03PS3) 10Chad: Add core + core plugins @ 2.13.8 [software/gerrit] - 10https://gerrit.wikimedia.org/r/356488 (https://phabricator.wikimedia.org/T157414) [22:19:51] o/ [22:20:30] toodles, glad there is lead on this issue now, seemingly :) [22:25:42] (03CR) 10Paladox: ">" (031 comment) [debs/gerrit] - 10https://gerrit.wikimedia.org/r/356480 (owner: 10Paladox) [22:26:22] (03PS5) 10Paladox: Gerrit: Set ulimit's in gerrit.service [debs/gerrit] - 10https://gerrit.wikimedia.org/r/356480 [22:26:27] (03CR) 10Dzahn: [C: 032] "Amir is back home, we had a Hangout, he said the right code word, nobody took his laptop from him.. all confirmed." [puppet] - 10https://gerrit.wikimedia.org/r/356489 (owner: 10Dzahn) [22:26:37] (03PS2) 10Dzahn: Revert "admins: revoke ladsgroups key temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/356489 [22:33:59] (03PS3) 10Herron: icinga: give permissions to run commands to herron [puppet] - 10https://gerrit.wikimedia.org/r/356309 (https://phabricator.wikimedia.org/T166587) (owner: 10Dzahn) [22:35:46] (03CR) 10Herron: [C: 032] icinga: give permissions to run commands to herron [puppet] - 10https://gerrit.wikimedia.org/r/356309 (https://phabricator.wikimedia.org/T166587) (owner: 10Dzahn) [22:45:36] (03PS1) 10Chad: Drop test/testrepo from trebuchet-deployed repos [puppet] - 10https://gerrit.wikimedia.org/r/356496 [22:50:08] (03CR) 10Chad: "I'm pretty sure this isn't needed anymore since we're deploying via scap3 now (and the deployment.yaml entry has already disappeared)" [puppet] - 10https://gerrit.wikimedia.org/r/219253 (owner: 10GWicke) [22:54:43] (03PS6) 10Paladox: Gerrit: Set ulimit's in gerrit.service [debs/gerrit] - 10https://gerrit.wikimedia.org/r/356480 [22:56:31] (03CR) 10Paladox: "unlimited is not a correct value in systemd. infinity is the correct value." [debs/gerrit] - 10https://gerrit.wikimedia.org/r/356480 (owner: 10Paladox) [22:56:44] CUSTOM - Check systemd state on cobalt is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:57:08] ^ test of Icinga permissions for Herron.. PASS :) [22:57:23] (03Abandoned) 10Jdlrobson: Enable print styles for Minerva [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353979 (https://phabricator.wikimedia.org/T163287) (owner: 10Jdlrobson) [22:57:47] PROBLEM - puppet last run on d-i-test is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:58:04] 06Operations, 13Patch-For-Review: Ops Onboarding for Keith Herron - https://phabricator.wikimedia.org/T166587#3306270 (10Dzahn) [22:59:53] (03PS7) 10Paladox: Gerrit: Set ulimit's in gerrit.service [debs/gerrit] - 10https://gerrit.wikimedia.org/r/356480 (https://phabricator.wikimedia.org/T158946) [23:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170531T2300). Please do the needful. [23:02:46] Nothing to SWAT. [23:08:07] PROBLEM - HHVM rendering on mw1197 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.075 second response time [23:08:07] PROBLEM - Nginx local proxy to apache on mw1197 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.159 second response time [23:09:07] RECOVERY - HHVM rendering on mw1197 is OK: HTTP OK: HTTP/1.1 200 OK - 74256 bytes in 0.681 second response time [23:09:07] RECOVERY - Nginx local proxy to apache on mw1197 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.263 second response time [23:19:48] (03PS1) 10Chad: Gerrit: Add non-masters to have public DNS entries [puppet] - 10https://gerrit.wikimedia.org/r/356499 [23:20:46] (03CR) 10jerkins-bot: [V: 04-1] Gerrit: Add non-masters to have public DNS entries [puppet] - 10https://gerrit.wikimedia.org/r/356499 (owner: 10Chad) [23:22:21] (03PS2) 10Chad: Gerrit: Add non-masters to have public DNS entries [puppet] - 10https://gerrit.wikimedia.org/r/356499 [23:25:33] (03PS3) 10Chad: Gerrit: Add non-masters to have public DNS entries [puppet] - 10https://gerrit.wikimedia.org/r/356499 [23:26:47] RECOVERY - puppet last run on d-i-test is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [23:28:13] (03CR) 10Chad: "Compiles for both, only change on master is adding of ServerAlias. https://puppet-compiler.wmflabs.org/6601/" [puppet] - 10https://gerrit.wikimedia.org/r/356499 (owner: 10Chad) [23:32:28] mutante: Your call if you wanna do that today or wait for tomorrow ^ [23:32:33] Should only really impact the slave [23:33:59] (03PS1) 10Chico Venancio: New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356503 (https://phabricator.wikimedia.org/T166619) [23:37:19] ACKNOWLEDGEMENT - HP RAID on ms-be1020 is CRITICAL: CRITICAL: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Cache: Permanently Disabled - Cable Error - Battery/Capacitor: Recharging nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T166746 [23:37:23] 06Operations, 10ops-eqiad: Degraded RAID on ms-be1020 - https://phabricator.wikimedia.org/T166746#3306394 (10ops-monitoring-bot) [23:46:09] hu ho [23:46:14] chicocvenancio: ping? [23:47:11] Thanks to have submit the change it's indeed urgent to deploy that. [23:47:30] could you add it to the calendar at https://wikitech.wikimedia.org/wiki/Deployments please? [23:49:53] (03PS2) 10Dereckson: New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356503 (https://phabricator.wikimedia.org/T166619) (owner: 10Chico Venancio) [23:50:22] Dereckson, this is my first patch, let me see how to do that [23:50:58] jouncebot: now [23:50:58] For the next 0 hour(s) and 9 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170531T2300) [23:51:00] chicocvenancio: you'll find at https://wikitech.wikimedia.org/wiki/Deployments a calendar, with currently a "evening swat" slot [23:51:03] this one ^ [23:51:37] You can add there a line like this: [23:51:41] * [config] {{Gerrit|356503}} New throttle rule ({{phabT|166619}}) [23:51:58] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356503 (https://phabricator.wikimedia.org/T166619) (owner: 10Chico Venancio) [23:53:38] (03Merged) 10jenkins-bot: New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356503 (https://phabricator.wikimedia.org/T166619) (owner: 10Chico Venancio) [23:53:50] (03CR) 10jenkins-bot: New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356503 (https://phabricator.wikimedia.org/T166619) (owner: 10Chico Venancio) [23:54:28] Thanks Dereckson, that was quick [23:54:52] Actually, my laptop was closed, and I got an update on my phone for this task, thanks to the high priority, so I logged in again. [23:56:21] live on mwdebug1002.eqiad.wmnet [23:59:24] !log dereckson@tin Synchronized wmf-config/throttle.php: Add throttule rules for 2017-06-01 Fortaleza event (T166619) (duration: 00m 41s) [23:59:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:59:34] T166619: Lift IP account creation limit for event in Fortaleza on 2017-06-01 - https://phabricator.wikimedia.org/T166619 [23:59:51] Here you are.