[00:00:32] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] [00:07:27] 10Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests: LDAP access to the wmf group for Anne Gomez - https://phabricator.wikimedia.org/T170679#3450876 (10atgo) Just figured it out! Thanks @tbayer for the IRL help today. [00:13:41] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [00:14:43] 10Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests: LDAP access to the wmf group for Anne Gomez - https://phabricator.wikimedia.org/T170679#3450883 (10Dzahn) cool:) then we don't need to reopen it. [00:15:31] 10Operations, 10Wikidata, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests: Add Dinka Wikipedia to Wikidata - https://phabricator.wikimedia.org/T170930#3450884 (10Reedy) InterwikiSortOrders are set. `populateSitesTable.php` has been re-run... We had similar problems with atjwiki... I think this sugge... [00:15:33] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [00:15:34] 10Operations, 10monitoring, 10Patch-For-Review: setup netmon1002.wikimedia.org - https://phabricator.wikimedia.org/T159756#3450885 (10Dzahn) [00:26:16] 10Operations, 10Analytics-Kanban, 10Wikimedia-Stream, 10hardware-requests, 10Patch-For-Review: Decommission RCStream (rcs100[12]) - https://phabricator.wikimedia.org/T170157#3420970 (10Reedy) >>! In T170157#3443768, @Ottomata wrote: > Anyway, these hosts are off and ready for decom. Thanks! Umm.... ``... [00:29:01] (03PS1) 10Reedy: Remove rcs100[12] from $wgRCFeeds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366175 (https://phabricator.wikimedia.org/T170157) [00:30:52] (03CR) 10Reedy: [C: 032] Remove rcs100[12] from $wgRCFeeds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366175 (https://phabricator.wikimedia.org/T170157) (owner: 10Reedy) [00:31:54] (03Merged) 10jenkins-bot: Remove rcs100[12] from $wgRCFeeds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366175 (https://phabricator.wikimedia.org/T170157) (owner: 10Reedy) [00:32:11] (03CR) 10jenkins-bot: Remove rcs100[12] from $wgRCFeeds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366175 (https://phabricator.wikimedia.org/T170157) (owner: 10Reedy) [00:34:01] RECOVERY - Check Varnish expiry mailbox lag on cp1099 is OK: OK: expiry mailbox lag is 3317 [00:35:01] !log reedy@tin Synchronized wmf-config/CommonSettings.php: Remove rcs1001 and rcs1002 from CommonSettings wgRCFeeds. Stops a load of logspam T170157 (duration: 00m 48s) [00:35:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:35:12] T170157: Decommission RCStream (rcs100[12]) - https://phabricator.wikimedia.org/T170157 [00:35:36] 10Operations, 10Analytics-Kanban, 10Wikimedia-Stream, 10hardware-requests: Decommission RCStream (rcs100[12]) - https://phabricator.wikimedia.org/T170157#3450902 (10Reedy) [00:41:17] 10Operations, 10Wikimedia-Stream: stream.wikimedia.org: Uneven distribution of client connections on backends - https://phabricator.wikimedia.org/T69957#3450924 (10Krinkle) 05Open>03declined Per T170157. [00:42:10] 10Operations, 10Traffic, 10Wikimedia-Stream: stream.wikimedia.org - redirect http(s) to docs - https://phabricator.wikimedia.org/T70528#3450929 (10Krinkle) 05Open>03Resolved and all redirect to now. [00:42:30] 10Operations, 10Analytics-Kanban, 10Wikimedia-Stream, 10hardware-requests: Decommission RCStream (rcs100[12]) - https://phabricator.wikimedia.org/T170157#3450931 (10RobH) [00:47:22] 10Operations, 10Analytics-Kanban, 10Wikimedia-Stream, 10hardware-requests: decommmission RCStream (rcs100[12]) - https://phabricator.wikimedia.org/T170157#3450935 (10RobH) [00:47:34] 10Operations, 10Analytics-Kanban, 10Wikimedia-Stream, 10hardware-requests: decommmission rcs100[12] - https://phabricator.wikimedia.org/T170157#3420970 (10RobH) [00:48:54] 10Operations, 10hardware-requests: decom netmon1001 - https://phabricator.wikimedia.org/T171018#3450938 (10Dzahn) [00:49:16] 10Operations, 10hardware-requests, 10monitoring: decom netmon1001 - https://phabricator.wikimedia.org/T171018#3450954 (10Dzahn) [00:50:08] (03PS1) 10RobH: decom rcs100[12] [puppet] - 10https://gerrit.wikimedia.org/r/366176 (https://phabricator.wikimedia.org/T170157) [00:52:07] (03CR) 10RobH: [C: 032] decom rcs100[12] [puppet] - 10https://gerrit.wikimedia.org/r/366176 (https://phabricator.wikimedia.org/T170157) (owner: 10RobH) [00:54:40] 10Operations, 10Analytics-Kanban, 10Wikimedia-Stream, 10hardware-requests, 10Patch-For-Review: decommission rcs100[12] - https://phabricator.wikimedia.org/T170157#3450991 (10Reedy) [00:55:25] (03PS1) 10RobH: decom of rcs100[12] production dns [dns] - 10https://gerrit.wikimedia.org/r/366177 (https://phabricator.wikimedia.org/T170157) [00:56:32] (03CR) 10RobH: [C: 032] decom of rcs100[12] production dns [dns] - 10https://gerrit.wikimedia.org/r/366177 (https://phabricator.wikimedia.org/T170157) (owner: 10RobH) [00:58:52] (03PS1) 10Dzahn: netmon: remove librenms from netmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/366178 (https://phabricator.wikimedia.org/T159756) [00:59:23] 10Operations, 10Analytics-Kanban, 10Wikimedia-Stream, 10hardware-requests, 10Patch-For-Review: decommission rcs100[12] - https://phabricator.wikimedia.org/T170157#3451006 (10RobH) a:05RobH>03Cmjohnson [01:00:25] (03PS2) 10Dzahn: netmon: remove librenms from netmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/366178 (https://phabricator.wikimedia.org/T159756) [01:01:49] (03CR) 10Dzahn: [C: 032] netmon: remove librenms from netmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/366178 (https://phabricator.wikimedia.org/T159756) (owner: 10Dzahn) [01:02:51] RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 9523 [01:06:00] (03PS1) 10Dzahn: netmon1001/site: fix name of 'spare' role class [puppet] - 10https://gerrit.wikimedia.org/r/366179 [01:06:29] (03PS2) 10Dzahn: netmon1001/site: fix name of 'spare' role class [puppet] - 10https://gerrit.wikimedia.org/r/366179 [01:10:11] (03CR) 10Dzahn: [C: 032] netmon1001/site: fix name of 'spare' role class [puppet] - 10https://gerrit.wikimedia.org/r/366179 (owner: 10Dzahn) [01:10:51] (03PS1) 10Dzahn: prometheus: netmon1001 -> netmon1002 for snmp exporter [puppet] - 10https://gerrit.wikimedia.org/r/366180 (https://phabricator.wikimedia.org/T171018) [01:11:47] (03PS2) 10Dzahn: prometheus: netmon1001 -> netmon1002 for snmp exporter [puppet] - 10https://gerrit.wikimedia.org/r/366180 (https://phabricator.wikimedia.org/T171018) [01:14:00] (03CR) 10Dzahn: [C: 032] prometheus: netmon1001 -> netmon1002 for snmp exporter [puppet] - 10https://gerrit.wikimedia.org/r/366180 (https://phabricator.wikimedia.org/T171018) (owner: 10Dzahn) [01:18:09] (03PS1) 10Dzahn: scap/dsh: replace netmon1001->netmont1002 for librenms [puppet] - 10https://gerrit.wikimedia.org/r/366181 (https://phabricator.wikimedia.org/T17018) [01:18:30] 10Operations, 10monitoring, 10Patch-For-Review: setup netmon1002.wikimedia.org - https://phabricator.wikimedia.org/T159756#3451092 (10Dzahn) https://gerrit.wikimedia.org/r/366180 - switch prometheus snmp-exporter , netmon1001 -> netmon1002 @fgiunchedi [01:19:16] (03PS2) 10Dzahn: scap/dsh: replace netmon1001->netmont1002 for librenms [puppet] - 10https://gerrit.wikimedia.org/r/366181 (https://phabricator.wikimedia.org/T17018) [01:21:06] (03CR) 10Dzahn: [C: 032] scap/dsh: replace netmon1001->netmont1002 for librenms [puppet] - 10https://gerrit.wikimedia.org/r/366181 (https://phabricator.wikimedia.org/T17018) (owner: 10Dzahn) [01:21:35] 10Operations, 10Wikidata, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests: Add Dinka Wikipedia to Wikidata - https://phabricator.wikimedia.org/T170930#3451104 (10Koavf) Success: https://www.wikidata.org/w/index.php?title=Q9779&type=revision&diff=523421406&oldid=503816827 [01:22:37] 10Operations, 10hardware-requests: hardware request for netmon1001 replacement - https://phabricator.wikimedia.org/T156040#3451113 (10Dzahn) [01:22:40] 10Operations, 10monitoring, 10Patch-For-Review: setup netmon1002.wikimedia.org - https://phabricator.wikimedia.org/T159756#3451112 (10Dzahn) 05Open>03Resolved [01:26:48] 10Operations, 10DBA: Evaluate how hard would be to get aa(wikibooks|wiktionary) and howiki databases deleted - https://phabricator.wikimedia.org/T169928#3451129 (10MF-Warburg) Does "I would rather get mediawiki handle its deleting and all the related checks to be honest." mean you would like to have a function... [01:27:58] !log netmon1001 - stopping all the services, killing snmpwalk, disarming keyholder [01:28:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:28:36] 10Operations, 10hardware-requests, 10monitoring, 10Patch-For-Review: decom netmon1001 - https://phabricator.wikimedia.org/T171018#3451135 (10Dzahn) [01:30:25] 10Operations, 10hardware-requests, 10monitoring, 10Patch-For-Review: decom netmon1001 - https://phabricator.wikimedia.org/T171018#3450938 (10Dzahn) p:05Triage>03Normal [01:44:24] 10Operations, 10hardware-requests, 10monitoring, 10Patch-For-Review: decom netmon1001 - https://phabricator.wikimedia.org/T171018#3451158 (10Dzahn) 05Open>03stalled @ayounsi netmon1001 has been replaced by netmon1002. It seemed you were the only user who actually had data in a home dir there. I tar.gz'... [01:44:35] 10Operations, 10hardware-requests, 10monitoring, 10Patch-For-Review: decom netmon1001 - https://phabricator.wikimedia.org/T171018#3451161 (10Dzahn) 05stalled>03Open [01:51:01] 10Operations, 10monitoring, 10Patch-For-Review: rack/setup/install netmon2001 - https://phabricator.wikimedia.org/T166180#3451162 (10Dzahn) netmon2001 is up and running. netmon1002 and netmon2001 use identical roles: ``` 1791 node /^netmon(1002|2001)\.wikimedia\.org$/ { 1792 role(network::monitor, lib... [01:55:10] 10Operations, 10DBA: Evaluate how hard would be to get aa(wikibooks|wiktionary) and howiki databases deleted - https://phabricator.wikimedia.org/T169928#3413499 (10Legoktm) What exactly is the purpose of deleting wikis? It provides no benefit and is more likely to break things. [02:00:04] (03PS1) 10Dzahn: Revert "rancid: switch active server to netmon2001" [puppet] - 10https://gerrit.wikimedia.org/r/366182 [02:01:22] (03PS2) 10Dzahn: Revert "rancid: switch active server to netmon2001" [puppet] - 10https://gerrit.wikimedia.org/r/366182 [03:03:36] !log l10nupdate@tin LocalisationUpdate failed: git pull of extensions failed [03:03:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:13:30] (03PS3) 10Dzahn: Revert "rancid: switch active server to netmon2001" [puppet] - 10https://gerrit.wikimedia.org/r/366182 [03:16:41] (03CR) 10Dzahn: [C: 032] Revert "rancid: switch active server to netmon2001" [puppet] - 10https://gerrit.wikimedia.org/r/366182 (owner: 10Dzahn) [03:26:52] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 671.65 seconds [03:28:55] (03PS1) 10Dzahn: rancid: change the rsync direction [puppet] - 10https://gerrit.wikimedia.org/r/366185 (https://phabricator.wikimedia.org/T159756) [03:29:44] (03PS2) 10Dzahn: rancid: change the rsync direction [puppet] - 10https://gerrit.wikimedia.org/r/366185 (https://phabricator.wikimedia.org/T159756) [03:33:32] (03CR) 10Dzahn: [C: 032] rancid: change the rsync direction [puppet] - 10https://gerrit.wikimedia.org/r/366185 (https://phabricator.wikimedia.org/T159756) (owner: 10Dzahn) [03:36:42] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 22 probes of 265 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [03:41:42] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 6 probes of 265 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [03:51:02] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 148.19 seconds [03:53:01] bawolff: btw, the gpg signature for the email you sent to mediawiki-l just now came back bad [03:53:23] lol, this is what I get for trying to use gpg [03:53:32] I probably screwed it up somehow, the email really is from me [03:53:39] I figured as much :P [03:53:46] Your email client may have munged something [03:54:06] all the cool kids use gpg. I just wanted to be cool :P [03:54:45] yeah, it's a shame that it's so difficult to use properly :P [03:55:05] The only way I've been able to get gpg signatures to resolve correctly is to use Thunderbird with Enigmail [03:55:17] At least this proves that not everyone uses the xkcd method of validating pgp signatures [03:55:19] Everything other method I've tried has munged stuff [03:55:25] hehe [04:00:36] yeah, it looks like it adjusted the line lengths [04:04:10] otoh, DKIM signatures are now on emails, and trusting the domain/server security is probably a saner trust model than some randoms somewhere vouched for you [04:10:01] oh shit i did it again [04:10:14] I tried to edit the text after i signed it, without thinking [04:11:15] Im just going to stop posting [04:24:22] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:24:22] PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:24:32] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:24:32] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:24:32] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:24:33] PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:24:42] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:24:43] PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:24:43] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:24:43] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:24:43] PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:24:52] PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:24:52] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:24:52] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:24:53] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:25:02] PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:25:02] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:25:12] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:30:53] RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [04:31:02] RECOVERY - MariaDB Slave IO: s3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [04:31:12] RECOVERY - MariaDB Slave SQL: m2 on dbstore1001 is OK: OK slave_sql_state not a slave [04:31:12] RECOVERY - MariaDB Slave IO: m2 on dbstore1001 is OK: OK slave_io_state not a slave [04:31:22] RECOVERY - MariaDB Slave IO: s7 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [04:31:22] RECOVERY - MariaDB Slave IO: x1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [04:31:22] RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [04:31:32] RECOVERY - MariaDB Slave SQL: s7 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [04:31:32] RECOVERY - MariaDB Slave SQL: x1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [04:31:42] RECOVERY - MariaDB Slave IO: s6 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [04:31:42] RECOVERY - MariaDB Slave IO: m3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [04:31:42] RECOVERY - MariaDB Slave IO: s1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [04:31:42] RECOVERY - MariaDB Slave SQL: s2 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [04:31:42] RECOVERY - MariaDB Slave SQL: s6 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [04:31:43] RECOVERY - MariaDB Slave SQL: m3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [04:31:43] RECOVERY - MariaDB Slave SQL: s5 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [04:31:52] RECOVERY - MariaDB Slave IO: s5 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [04:31:52] RECOVERY - MariaDB Slave SQL: s1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [05:09:35] FastLizard4: well, time to see if third time is the charm or not [05:10:18] I think i did it this time :) [05:11:46] iPGMail on my phone likes it [05:11:48] And... [05:12:09] Enigmail/Thunderbird also approves [05:12:16] gg bawolff :P [05:12:51] Of course I look a little like an idiot being paranoid and only doing about 40 characters per line [05:12:55] maybe i should just set up a real email client instead of using gmail web interface [05:13:36] Oh yeah, Gmail's Web doesn't work well with gpg [05:14:18] Thunderbird with the Enigmail extension is what I've seen most often [05:15:46] Next step of course is to get more people than just harej to sign my key associated with my wmf email [05:16:12] RECOVERY - Check systemd state on mw2118 is OK: OK - running: The system is fully operational [05:19:26] I wouldn't worry too much about getting sigs [05:19:44] Jusy put a copy of the public key or the fingerprint on your userpage or something [05:19:52] yeah its there [05:20:31] I you want, I can provide a sig [05:20:42] I'm reasonably sure you are who you claim ro be [05:20:43] *to [05:20:45] :P [05:20:59] <_joe_> I wouldn't assume "reasonably" is enough to sign a key [05:21:19] <_joe_> you always need in-person or at least visual confirmation [05:21:19] (but then again, you may only want @wikimedia.org keys on your sig) [05:21:45] I'll probably try to convince more people to sign my key at wikimania [05:22:21] in an ideal world where you personally know other gpg users, sure [05:23:10] in practice, it just makes establishing a useful web of trust nigh impossible [05:25:15] It usually takes at least theee signatures for a key to become automatically trusted, and unless you're at a university or in a particularly techy company, good luck finding three gpg users, letalone one [05:25:43] RECOVERY - Check systemd state on mw2160 is OK: OK - running: The system is fully operational [05:26:02] RECOVERY - Check systemd state on mw2161 is OK: OK - running: The system is fully operational [05:26:02] RECOVERY - Check systemd state on mw2247 is OK: OK - running: The system is fully operational [05:26:02] RECOVERY - Check systemd state on mw2243 is OK: OK - running: The system is fully operational [05:26:22] RECOVERY - Check systemd state on mw2153 is OK: OK - running: The system is fully operational [05:26:27] <_joe_> !log ran systemctl reset-failed on codfw jobrunners after the jobrunner process was activated by mistake running scap at 21.20 UTC yesterday [05:26:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:31:10] 10Operations, 10ops-codfw: pc2006 crashed - https://phabricator.wikimedia.org/T170520#3451317 (10Marostegui) 05Open>03Resolved Thanks @Papaul I have started MySQL and the replication thread is catching up. Going to close this as resolved, and if it crashes again, let's reopen it. [05:38:32] PROBLEM - MariaDB Slave Lag: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:57:27] (03Abandoned) 10Krinkle: Add MediaWikiInstallPingback to EL purging white-list [puppet] - 10https://gerrit.wikimedia.org/r/366049 (https://phabricator.wikimedia.org/T170986) (owner: 10Mforns) [06:00:12] PROBLEM - puppet last run on analytics1032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:03:22] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=496.60 Read Requests/Sec=913.60 Write Requests/Sec=2.30 KBytes Read/Sec=55213.60 KBytes_Written/Sec=59.60 [06:12:32] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=3.80 Read Requests/Sec=4.20 Write Requests/Sec=21.10 KBytes Read/Sec=25.20 KBytes_Written/Sec=260.00 [06:28:13] RECOVERY - puppet last run on analytics1032 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:37:11] (03PS1) 10Ayounsi: Depool codfw for asw-c-codfw upgrade [dns] - 10https://gerrit.wikimedia.org/r/366198 (https://phabricator.wikimedia.org/T170380) [06:39:37] (03PS1) 10Ayounsi: Route traffic around codfw for asw-c-codfw upgrade [puppet] - 10https://gerrit.wikimedia.org/r/366199 (https://phabricator.wikimedia.org/T170380) [07:28:16] (03PS2) 10Muehlenhoff: Remove specific version annotation for nginx [puppet] - 10https://gerrit.wikimedia.org/r/365650 [07:29:02] PROBLEM - Host oresrdb2002 is DOWN: PING CRITICAL - Packet loss = 100% [07:30:19] hmm, I can't even connect to the mgmt of oresrdb2002 [07:30:54] but it's only the passive host in codfw [07:31:02] RECOVERY - Host oresrdb2002 is UP: PING OK - Packet loss = 0%, RTA = 36.03 ms [07:32:54] (03CR) 10Ema: [C: 031] Route traffic around codfw for asw-c-codfw upgrade [puppet] - 10https://gerrit.wikimedia.org/r/366199 (https://phabricator.wikimedia.org/T170380) (owner: 10Ayounsi) [07:33:14] (03CR) 10Ema: [C: 031] Depool codfw for asw-c-codfw upgrade [dns] - 10https://gerrit.wikimedia.org/r/366198 (https://phabricator.wikimedia.org/T170380) (owner: 10Ayounsi) [07:33:54] (03CR) 10Ayounsi: [C: 032] Depool codfw for asw-c-codfw upgrade [dns] - 10https://gerrit.wikimedia.org/r/366198 (https://phabricator.wikimedia.org/T170380) (owner: 10Ayounsi) [07:33:58] (03CR) 10Ayounsi: [C: 032] Route traffic around codfw for asw-c-codfw upgrade [puppet] - 10https://gerrit.wikimedia.org/r/366199 (https://phabricator.wikimedia.org/T170380) (owner: 10Ayounsi) [07:49:05] !log oblivian@neodymium conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=(restbase-async|citoid) [07:49:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:03] PROBLEM - MD RAID on cp1008 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 1, Spare: 0 [07:56:03] ACKNOWLEDGEMENT - MD RAID on cp1008 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T171028 [07:56:08] 10Operations, 10ops-eqiad: Degraded RAID on cp1008 - https://phabricator.wikimedia.org/T171028#3451439 (10ops-monitoring-bot) [07:57:38] !log Drop migrateuser_medium from s7 - T170310 [07:57:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:10] (03PS1) 10Alexandros Kosiaris: Revert "switch librenms from netmon1001 to netmon1002" [dns] - 10https://gerrit.wikimedia.org/r/366202 [08:01:59] (03CR) 10Alexandros Kosiaris: "This was reverted in https://gerrit.wikimedia.org/r/#/c/366202/ as netmon1002 did not have all the old RRD data and it was needed for row-" [dns] - 10https://gerrit.wikimedia.org/r/364617 (https://phabricator.wikimedia.org/T159756) (owner: 10Dzahn) [08:02:13] (03PS2) 10Alexandros Kosiaris: Revert "switch librenms from netmon1001 to netmon1002" [dns] - 10https://gerrit.wikimedia.org/r/366202 [08:02:17] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Revert "switch librenms from netmon1001 to netmon1002" [dns] - 10https://gerrit.wikimedia.org/r/366202 (owner: 10Alexandros Kosiaris) [08:02:37] (03CR) 10Filippo Giunchedi: "> re: the mailman I/O icinga alerts, they always fire off when there" [puppet] - 10https://gerrit.wikimedia.org/r/364827 (https://phabricator.wikimedia.org/T170462) (owner: 10Herron) [08:04:45] 10Operations, 10ops-eqiad: Degraded RAID on cp1008 - https://phabricator.wikimedia.org/T171028#3451439 (10ema) Looks like `sda` is dead: ``` [Wed Jul 19 07:51:28 2017] mptscsih: ioc0: attempting task abort! (sc=ffff9f1e85c40780) [Wed Jul 19 07:51:28 2017] sd 0:0:0:0: [sda] tag#3 CDB: Write(10) 2a 00 00 c8 56... [08:05:13] 10Operations, 10ops-eqiad: Degraded RAID on cp1008 - https://phabricator.wikimedia.org/T171028#3451459 (10ema) p:05Triage>03Normal [08:05:26] 10Operations, 10ops-eqiad, 10Traffic: Degraded RAID on cp1008 - https://phabricator.wikimedia.org/T171028#3451439 (10ema) [08:09:19] !log disable librenms crons on netmon1002 for a while [08:09:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:33] PROBLEM - Apache HTTP on mw2215 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:22:42] RECOVERY - Apache HTTP on mw2215 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 3.914 second response time [08:22:52] PROBLEM - Apache HTTP on mw2114 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:22:53] PROBLEM - Apache HTTP on mw2138 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:22:53] PROBLEM - Nginx local proxy to apache on mw2110 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:22:53] PROBLEM - Apache HTTP on mw2117 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:22:53] PROBLEM - Apache HTTP on mw2252 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:22:53] PROBLEM - HHVM rendering on mw2116 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:22:53] PROBLEM - Nginx local proxy to apache on mw2116 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:22:54] PROBLEM - Nginx local proxy to apache on mw2106 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:22:54] PROBLEM - Nginx local proxy to apache on mw2227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:22:55] PROBLEM - Nginx local proxy to apache on mw2120 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:22:55] PROBLEM - configured eth on lvs2003 is CRITICAL: eth2 reporting no carrier. [08:22:56] PROBLEM - Nginx local proxy to apache on mw2240 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:23:08] PROBLEM - Apache HTTP on mw2112 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:23:08] PROBLEM - Nginx local proxy to apache on mw2236 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:23:09] PROBLEM - configured eth on lvs2006 is CRITICAL: eth2 reporting no carrier. [08:23:12] PROBLEM - HHVM rendering on mw2130 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:23:12] PROBLEM - Apache HTTP on mw2130 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:23:12] PROBLEM - HHVM rendering on mw2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:23:12] PROBLEM - HHVM rendering on mw2128 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:23:17] PROBLEM - ElasticSearch health check for shards on search.svc.codfw.wmnet is CRITICAL: CRITICAL - elasticsearch http://10.2.1.30:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.2.1.30, port=9200): Read timed out. (read timeout=4) [08:23:17] PROBLEM - HHVM rendering on mw2100 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:23:17] PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - apaches_80 - Could not depool server mw2111.codfw.wmnet because of too many down!: api-https_443 - Could not depool server mw2129.codfw.wmnet because of too many down!: appservers-https_443 - Could not depool server mw2241.codfw.wmnet because of too many down!: rendering_80 - Could not depool server mw2151.codfw.wmnet because of too many down!: rende [08:23:17] uld not depool server mw2150.codfw.wmnet because of too many down!: api_80 - Could not depool server mw2137.codfw.wmnet because of too many down! [08:23:17] PROBLEM - HHVM rendering on mw2099 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:23:22] PROBLEM - HHVM rendering on mw2131 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:23:26] PROBLEM - LVS HTTP IPv4 on rendering.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.21 and port 80: No route to host [08:23:26] PROBLEM - Nginx local proxy to apache on mw2253 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:23:27] PROBLEM - Apache HTTP on mw2257 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:23:27] PROBLEM - Nginx local proxy to apache on mw2113 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:23:27] PROBLEM - HHVM rendering on mw2135 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:23:36] PROBLEM - LVS HTTP IPv4 on api.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:23:36] PROBLEM - Nginx local proxy to apache on mw2258 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:23:42] PROBLEM - Apache HTTP on mw2227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:23:42] PROBLEM - HHVM rendering on mw2117 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:23:42] PROBLEM - configured eth on lvs2002 is CRITICAL: eth2 reporting no carrier. [08:23:42] PROBLEM - HHVM rendering on mw2103 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:23:42] PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - apaches_80 - Could not depool server mw2112.codfw.wmnet because of too many down!: api-https_443 - Could not depool server mw2122.codfw.wmnet because of too many down!: api_80 - Could not depool server mw2122.codfw.wmnet because of too many down!: rendering_80 - Could not depool server mw2151.codfw.wmnet because of too many down!: rendering-https_443 [08:23:43] server mw2150.codfw.wmnet because of too many down!: appservers-https_443 - Could not depool server mw2097.codfw.wmnet because of too many down! [08:23:52] PROBLEM - configured eth on lvs2005 is CRITICAL: eth2 reporting no carrier. [08:23:52] PROBLEM - Nginx local proxy to apache on mw2124 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:23:52] PROBLEM - HHVM rendering on mw2124 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:24:02] PROBLEM - Apache HTTP on mw2220 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:24:06] PROBLEM - LVS HTTPS IPv4 on rendering.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.21 and port 443: No route to host [08:24:06] PROBLEM - Apache HTTP on mw2137 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:24:07] PROBLEM - Apache HTTP on mw2127 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:24:07] PROBLEM - Apache HTTP on mw2099 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:24:07] RECOVERY - Nginx local proxy to apache on mw2227 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.186 second response time [08:24:07] RECOVERY - Nginx local proxy to apache on mw2110 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 615 bytes in 1.201 second response time [08:24:07] RECOVERY - Apache HTTP on mw2114 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 2.381 second response time [08:24:08] RECOVERY - Nginx local proxy to apache on mw2120 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 615 bytes in 1.459 second response time [08:24:08] RECOVERY - Nginx local proxy to apache on mw2105 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.197 second response time [08:24:09] RECOVERY - Nginx local proxy to apache on mw2116 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 615 bytes in 2.284 second response time [08:24:09] RECOVERY - HHVM rendering on mw2116 is OK: HTTP OK: HTTP/1.1 200 OK - 76007 bytes in 2.531 second response time [08:24:10] RECOVERY - Nginx local proxy to apache on mw2106 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 615 bytes in 2.457 second response time [08:24:22] RECOVERY - Apache HTTP on mw2112 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 2.377 second response time [08:24:22] RECOVERY - Apache HTTP on mw2130 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 2.368 second response time [08:24:22] RECOVERY - HHVM rendering on mw2130 is OK: HTTP OK: HTTP/1.1 200 OK - 76007 bytes in 2.530 second response time [08:24:23] RECOVERY - HHVM rendering on mw2128 is OK: HTTP OK: HTTP/1.1 200 OK - 76007 bytes in 2.294 second response time [08:24:23] PROBLEM - Apache HTTP on mw2133 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:24:32] PROBLEM - Nginx local proxy to apache on mw2133 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:24:42] RECOVERY - HHVM rendering on mw2100 is OK: HTTP OK: HTTP/1.1 200 OK - 76007 bytes in 9.304 second response time [08:24:42] PROBLEM - Apache HTTP on mw2233 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:24:42] RECOVERY - HHVM rendering on mw2135 is OK: HTTP OK: HTTP/1.1 200 OK - 76004 bytes in 0.254 second response time [08:24:42] RECOVERY - Nginx local proxy to apache on mw2253 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 615 bytes in 1.190 second response time [08:24:46] RECOVERY - LVS HTTP IPv4 on api.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 23115 bytes in 0.366 second response time [08:24:46] RECOVERY - Apache HTTP on mw2257 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 2.369 second response time [08:24:46] PROBLEM - HHVM rendering on mw2219 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:24:47] RECOVERY - HHVM rendering on mw2131 is OK: HTTP OK: HTTP/1.1 200 OK - 76007 bytes in 9.305 second response time [08:24:52] RECOVERY - Nginx local proxy to apache on mw2113 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 615 bytes in 9.227 second response time [08:24:52] PROBLEM - HHVM rendering on mw2234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:24:52] RECOVERY - Apache HTTP on mw2227 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.110 second response time [08:24:52] PROBLEM - Apache HTTP on mw2232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:24:52] PROBLEM - Apache HTTP on mw2237 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:24:53] RECOVERY - HHVM rendering on mw2103 is OK: HTTP OK: HTTP/1.1 200 OK - 76005 bytes in 0.273 second response time [08:24:54] RECOVERY - HHVM rendering on mw2117 is OK: HTTP OK: HTTP/1.1 200 OK - 76007 bytes in 2.528 second response time [08:25:02] RECOVERY - Nginx local proxy to apache on mw2124 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.200 second response time [08:25:02] RECOVERY - HHVM rendering on mw2124 is OK: HTTP OK: HTTP/1.1 200 OK - 76005 bytes in 0.266 second response time [08:25:03] PROBLEM - Apache HTTP on mw2144 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:25:03] RECOVERY - Apache HTTP on mw2099 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.121 second response time [08:25:03] PROBLEM - Nginx local proxy to apache on mw2222 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:25:03] PROBLEM - Nginx local proxy to apache on mw2241 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:25:03] RECOVERY - Apache HTTP on mw2220 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 1.616 second response time [08:25:12] PROBLEM - Apache HTTP on mw2126 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:25:12] RECOVERY - Nginx local proxy to apache on mw2129 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.197 second response time [08:25:12] RECOVERY - HHVM rendering on mw2115 is OK: HTTP OK: HTTP/1.1 200 OK - 76005 bytes in 0.266 second response time [08:25:12] PROBLEM - HHVM rendering on mw2217 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:25:22] RECOVERY - HHVM rendering on mw2238 is OK: HTTP OK: HTTP/1.1 200 OK - 76004 bytes in 0.252 second response time [08:25:22] RECOVERY - Apache HTTP on mw2235 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.113 second response time [08:25:22] RECOVERY - Apache HTTP on mw2251 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.109 second response time [08:25:22] RECOVERY - Nginx local proxy to apache on mw2235 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.188 second response time [08:25:22] RECOVERY - Apache HTTP on mw2258 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 1.109 second response time [08:25:23] PROBLEM - Nginx local proxy to apache on mw2114 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:25:32] RECOVERY - Nginx local proxy to apache on mw2236 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.188 second response time [08:25:32] RECOVERY - HHVM rendering on mw2240 is OK: HTTP OK: HTTP/1.1 200 OK - 76007 bytes in 9.279 second response time [08:25:32] PROBLEM - HHVM rendering on mw2109 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:25:43] RECOVERY - Apache HTTP on mw2233 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 2.369 second response time [08:25:44] PROBLEM - HHVM rendering on mw2233 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:25:44] RECOVERY - HHVM rendering on mw2219 is OK: HTTP OK: HTTP/1.1 200 OK - 76005 bytes in 0.506 second response time [08:25:44] PROBLEM - Nginx local proxy to apache on mw2238 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:25:44] PROBLEM - Nginx local proxy to apache on mw2220 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:25:52] PROBLEM - HHVM rendering on mw2101 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:25:52] RECOVERY - Apache HTTP on mw2232 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.110 second response time [08:25:53] RECOVERY - HHVM rendering on mw2234 is OK: HTTP OK: HTTP/1.1 200 OK - 76007 bytes in 2.514 second response time [08:26:03] PROBLEM - HHVM rendering on mw2107 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:26:03] PROBLEM - HHVM rendering on mw2139 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:26:12] PROBLEM - Apache HTTP on mw2125 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:26:13] RECOVERY - Apache HTTP on mw2127 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 2.378 second response time [08:26:13] PROBLEM - Apache HTTP on mw2100 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:26:13] PROBLEM - Nginx local proxy to apache on mw2143 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:26:13] PROBLEM - HHVM rendering on mw2108 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:26:22] PROBLEM - Apache HTTP on mw2218 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:26:22] PROBLEM - HHVM rendering on mw2253 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:26:23] RECOVERY - HHVM rendering on mw2224 is OK: HTTP OK: HTTP/1.1 200 OK - 76007 bytes in 9.278 second response time [08:26:32] PROBLEM - Nginx local proxy to apache on mw2231 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:26:32] PROBLEM - Apache HTTP on mw2113 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:26:42] PROBLEM - Nginx local proxy to apache on mw2146 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:26:42] RECOVERY - Nginx local proxy to apache on mw2238 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 615 bytes in 1.189 second response time [08:26:42] RECOVERY - Nginx local proxy to apache on mw2133 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 615 bytes in 9.940 second response time [08:26:52] RECOVERY - Nginx local proxy to apache on mw2220 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 615 bytes in 9.210 second response time [08:26:52] PROBLEM - Nginx local proxy to apache on mw2139 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:26:52] PROBLEM - Nginx local proxy to apache on mw2145 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:27:03] RECOVERY - Nginx local proxy to apache on mw2241 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.197 second response time [08:27:12] RECOVERY - HHVM rendering on mw2107 is OK: HTTP OK: HTTP/1.1 200 OK - 76007 bytes in 9.298 second response time [08:27:12] RECOVERY - HHVM rendering on mw2139 is OK: HTTP OK: HTTP/1.1 200 OK - 76007 bytes in 9.290 second response time [08:27:12] RECOVERY - Apache HTTP on mw2100 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.121 second response time [08:27:12] RECOVERY - HHVM rendering on mw2108 is OK: HTTP OK: HTTP/1.1 200 OK - 76005 bytes in 0.264 second response time [08:27:12] PROBLEM - HHVM rendering on mw2113 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:27:13] RECOVERY - HHVM rendering on mw2217 is OK: HTTP OK: HTTP/1.1 200 OK - 76007 bytes in 2.514 second response time [08:27:22] PROBLEM - Nginx local proxy to apache on mw2121 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:27:22] PROBLEM - Apache HTTP on mw2146 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:27:22] RECOVERY - Nginx local proxy to apache on mw2143 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 615 bytes in 9.225 second response time [08:27:26] PROBLEM - LVS HTTP IPv4 on search.svc.codfw.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 523 bytes in 0.169 second response time [08:27:32] PROBLEM - Nginx local proxy to apache on mw2110 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:27:32] PROBLEM - Nginx local proxy to apache on mw2132 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:27:32] PROBLEM - Nginx local proxy to apache on mw2116 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:27:32] PROBLEM - HHVM rendering on mw2116 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:27:32] PROBLEM - Nginx local proxy to apache on mw2105 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:27:33] RECOVERY - Nginx local proxy to apache on mw2114 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.197 second response time [08:27:33] PROBLEM - HHVM rendering on mw2114 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:27:34] PROBLEM - Nginx local proxy to apache on mw2232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:27:34] RECOVERY - Apache HTTP on mw2113 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 9.146 second response time [08:27:35] PROBLEM - HHVM rendering on mw2232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:27:42] RECOVERY - HHVM rendering on mw2109 is OK: HTTP OK: HTTP/1.1 200 OK - 76005 bytes in 0.266 second response time [08:27:43] PROBLEM - Apache HTTP on mw2097 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:27:43] RECOVERY - Apache HTTP on mw2133 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 2.372 second response time [08:27:52] PROBLEM - Nginx local proxy to apache on mw2100 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:27:53] PROBLEM - HHVM rendering on mw2128 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:27:53] RECOVERY - Nginx local proxy to apache on mw2145 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 615 bytes in 1.196 second response time [08:27:53] PROBLEM - HHVM rendering on mw2125 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:28:02] PROBLEM - IPsec on cp3032 is CRITICAL: Strongswan CRITICAL - ok: 40 connecting: cp2013_v4, cp2013_v6, cp2016_v4, cp2016_v6 [08:28:02] RECOVERY - Nginx local proxy to apache on mw2139 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 615 bytes in 9.214 second response time [08:28:03] PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2014_v4, cp2014_v6, cp2017_v4, cp2017_v6 [08:28:03] PROBLEM - IPsec on cp4010 is CRITICAL: Strongswan CRITICAL - ok: 40 connecting: cp2013_v4, cp2013_v6, cp2016_v4, cp2016_v6 [08:28:03] PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp2014_v4, cp2014_v6, cp2017_v4, cp2017_v6 [08:28:03] PROBLEM - IPsec on cp3008 is CRITICAL: Strongswan CRITICAL - ok: 26 connecting: cp2018_v4, cp2018_v6 [08:28:03] PROBLEM - IPsec on cp4008 is CRITICAL: Strongswan CRITICAL - ok: 40 connecting: cp2013_v4, cp2013_v6, cp2016_v4, cp2016_v6 [08:28:04] PROBLEM - IPsec on cp4021 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp2014_v4, cp2014_v6, cp2017_v4, cp2017_v6 [08:28:04] PROBLEM - IPsec on cp4017 is CRITICAL: Strongswan CRITICAL - ok: 40 connecting: cp2013_v4, cp2013_v6, cp2016_v4, cp2016_v6 [08:28:05] PROBLEM - IPsec on cp4009 is CRITICAL: Strongswan CRITICAL - ok: 40 connecting: cp2013_v4, cp2013_v6, cp2016_v4, cp2016_v6 [08:28:05] PROBLEM - IPsec on cp4006 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp2014_v4, cp2014_v6, cp2017_v4, cp2017_v6 [08:28:12] RECOVERY - Nginx local proxy to apache on mw2258 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 615 bytes in 1.185 second response time [08:28:12] PROBLEM - Nginx local proxy to apache on mw2253 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:28:12] PROBLEM - IPsec on cp4015 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp2014_v4, cp2014_v6, cp2017_v4, cp2017_v6 [08:28:12] PROBLEM - IPsec on cp1067 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp2013_v4, cp2013_v6, cp2016_v4, cp2016_v6 [08:28:12] PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2014_v4, cp2014_v6, cp2017_v4, cp2017_v6 [08:28:13] PROBLEM - HHVM rendering on mw2103 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:28:22] RECOVERY - Apache HTTP on mw2144 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 3.379 second response time [08:28:22] PROBLEM - Apache HTTP on mw2228 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:28:22] PROBLEM - IPsec on cp1068 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp2013_v4, cp2013_v6, cp2016_v4, cp2016_v6 [08:28:22] RECOVERY - Apache HTTP on mw2125 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 9.149 second response time [08:28:23] PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2014_v4, cp2014_v6, cp2017_v4, cp2017_v6 [08:28:23] PROBLEM - IPsec on cp4005 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp2014_v4, cp2014_v6, cp2017_v4, cp2017_v6 [08:28:23] RECOVERY - Nginx local proxy to apache on mw2222 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 615 bytes in 9.217 second response time [08:28:24] PROBLEM - IPsec on cp3047 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp2014_v4, cp2014_v6, cp2017_v4, cp2017_v6 [08:28:24] PROBLEM - IPsec on cp1053 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp2013_v4, cp2013_v6, cp2016_v4, cp2016_v6 [08:28:25] PROBLEM - IPsec on cp1054 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp2013_v4, cp2013_v6, cp2016_v4, cp2016_v6 [08:28:32] RECOVERY - Apache HTTP on mw2218 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 2.368 second response time [08:28:32] RECOVERY - Nginx local proxy to apache on mw2121 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 615 bytes in 9.228 second response time [08:28:32] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2014_v4, cp2014_v6, cp2017_v4, cp2017_v6 [08:28:32] PROBLEM - IPsec on cp3041 is CRITICAL: Strongswan CRITICAL - ok: 40 connecting: cp2013_v4, cp2013_v6, cp2016_v4, cp2016_v6 [08:28:32] PROBLEM - IPsec on cp3031 is CRITICAL: Strongswan CRITICAL - ok: 40 connecting: cp2013_v4, cp2013_v6, cp2016_v4, cp2016_v6 [08:28:33] PROBLEM - IPsec on cp3040 is CRITICAL: Strongswan CRITICAL - ok: 40 connecting: cp2013_v4, cp2013_v6, cp2016_v4, cp2016_v6 [08:28:33] PROBLEM - IPsec on rdb1005 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: rdb2005_v4 [08:28:34] PROBLEM - IPsec on mc1030 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: mc2030_v4 [08:28:37] PROBLEM - IPsec on mc1027 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: mc2027_v4 [08:28:37] PROBLEM - IPsec on cp3048 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp2014_v4, cp2014_v6, cp2017_v4, cp2017_v6 [08:28:37] PROBLEM - IPsec on cp3045 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp2014_v4, cp2014_v6, cp2017_v4, cp2017_v6 [08:28:37] PROBLEM - IPsec on cp4014 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp2014_v4, cp2014_v6, cp2017_v4, cp2017_v6 [08:28:37] PROBLEM - IPsec on mc1029 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: mc2029_v4 [08:28:37] PROBLEM - IPsec on cp3007 is CRITICAL: Strongswan CRITICAL - ok: 26 connecting: cp2018_v4, cp2018_v6 [08:28:49] !log asw-c-codfw restarted 8min ago for switch upgrade - T170380 [08:28:52] RECOVERY - Nginx local proxy to apache on mw2100 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 615 bytes in 2.386 second response time [08:28:52] PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp2014_v4, cp2014_v6, cp2017_v4, cp2017_v6 [08:28:52] RECOVERY - HHVM rendering on mw2128 is OK: HTTP OK: HTTP/1.1 200 OK - 76007 bytes in 2.537 second response time [08:28:52] PROBLEM - IPsec on cp1065 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp2013_v4, cp2013_v6, cp2016_v4, cp2016_v6 [08:28:53] PROBLEM - IPsec on cp3042 is CRITICAL: Strongswan CRITICAL - ok: 40 connecting: cp2013_v4, cp2013_v6, cp2016_v4, cp2016_v6 [08:28:53] PROBLEM - IPsec on cp3035 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp2014_v4, cp2014_v6, cp2017_v4, cp2017_v6 [08:28:53] PROBLEM - IPsec on cp3034 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp2014_v4, cp2014_v6, cp2017_v4, cp2017_v6 [08:28:54] PROBLEM - Apache HTTP on mw2221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:28:54] RECOVERY - Nginx local proxy to apache on mw2146 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 615 bytes in 9.220 second response time [08:28:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:00] T170380: codfw row C switch upgrade - https://phabricator.wikimedia.org/T170380 [08:29:02] PROBLEM - IPsec on mc1028 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: mc2028_v4 [08:29:02] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2014_v4, cp2014_v6, cp2017_v4, cp2017_v6 [08:29:02] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2014_v4, cp2014_v6, cp2017_v4, cp2017_v6 [08:29:02] PROBLEM - IPsec on cp3037 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp2014_v4, cp2014_v6, cp2017_v4, cp2017_v6 [08:29:02] PROBLEM - IPsec on cp1066 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp2013_v4, cp2013_v6, cp2016_v4, cp2016_v6 [08:29:03] PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp2014_v4, cp2014_v6, cp2017_v4, cp2017_v6 [08:29:03] RECOVERY - HHVM rendering on mw2233 is OK: HTTP OK: HTTP/1.1 200 OK - 76007 bytes in 9.275 second response time [08:29:04] PROBLEM - Nginx local proxy to apache on mw2228 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:29:04] PROBLEM - carbon-frontend-relay metric drops on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [100.0] [08:29:05] PROBLEM - IPsec on cp3030 is CRITICAL: Strongswan CRITICAL - ok: 40 connecting: cp2013_v4, cp2013_v6, cp2016_v4, cp2016_v6 [08:29:05] PROBLEM - IPsec on cp3043 is CRITICAL: Strongswan CRITICAL - ok: 40 connecting: cp2013_v4, cp2013_v6, cp2016_v4, cp2016_v6 [08:29:06] PROBLEM - IPsec on cp3033 is CRITICAL: Strongswan CRITICAL - ok: 40 connecting: cp2013_v4, cp2013_v6, cp2016_v4, cp2016_v6 [08:29:06] PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp2014_v4, cp2014_v6, cp2017_v4, cp2017_v6 [08:29:07] PROBLEM - IPsec on cp3038 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp2014_v4, cp2014_v6, cp2017_v4, cp2017_v6 [08:29:12] PROBLEM - IPsec on rdb1007 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: rdb2005_v4 [08:29:12] PROBLEM - IPsec on cp3010 is CRITICAL: Strongswan CRITICAL - ok: 26 connecting: cp2018_v4, cp2018_v6 [08:29:12] PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2014_v4, cp2014_v6, cp2017_v4, cp2017_v6 [08:29:12] PROBLEM - IPsec on cp3039 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp2014_v4, cp2014_v6, cp2017_v4, cp2017_v6 [08:29:12] PROBLEM - IPsec on cp4018 is CRITICAL: Strongswan CRITICAL - ok: 40 connecting: cp2013_v4, cp2013_v6, cp2016_v4, cp2016_v6 [08:29:13] PROBLEM - IPsec on cp4016 is CRITICAL: Strongswan CRITICAL - ok: 40 connecting: cp2013_v4, cp2013_v6, cp2016_v4, cp2016_v6 [08:29:24] RECOVERY - Apache HTTP on mw2126 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.113 second response time [08:29:25] RECOVERY - Apache HTTP on mw2146 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.284 second response time [08:29:25] RECOVERY - HHVM rendering on mw2113 is OK: HTTP OK: HTTP/1.1 200 OK - 76007 bytes in 6.294 second response time [08:29:32] RECOVERY - configured eth on lvs2005 is OK: OK - interfaces up [08:29:32] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 58 ESP OK [08:29:32] RECOVERY - IPsec on cp4005 is OK: Strongswan OK - 54 ESP OK [08:29:33] RECOVERY - IPsec on cp3047 is OK: Strongswan OK - 54 ESP OK [08:29:33] RECOVERY - IPsec on cp1053 is OK: Strongswan OK - 44 ESP OK [08:29:33] RECOVERY - IPsec on cp1054 is OK: Strongswan OK - 44 ESP OK [08:29:33] RECOVERY - Apache HTTP on mw2137 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.117 second response time [08:29:34] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 58 ESP OK [08:29:37] RECOVERY - LVS HTTP IPv4 on search.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.164 second response time [08:29:37] RECOVERY - IPsec on cp3041 is OK: Strongswan OK - 44 ESP OK [08:29:37] RECOVERY - IPsec on cp3031 is OK: Strongswan OK - 44 ESP OK [08:29:37] RECOVERY - IPsec on cp3040 is OK: Strongswan OK - 44 ESP OK [08:29:38] RECOVERY - IPsec on rdb1005 is OK: Strongswan OK - 1 ESP OK [08:29:38] RECOVERY - IPsec on mc1030 is OK: Strongswan OK - 1 ESP OK [08:29:38] RECOVERY - IPsec on mc1027 is OK: Strongswan OK - 1 ESP OK [08:29:38] !log asw-c-codfw back online - T170380 [08:29:39] RECOVERY - IPsec on cp3048 is OK: Strongswan OK - 54 ESP OK [08:29:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:52] RECOVERY - IPsec on cp4007 is OK: Strongswan OK - 54 ESP OK [08:29:52] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 58 ESP OK [08:29:52] RECOVERY - IPsec on cp1055 is OK: Strongswan OK - 44 ESP OK [08:29:56] RECOVERY - LVS HTTPS IPv4 on rendering.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 16149 bytes in 0.282 second response time [08:29:56] RECOVERY - HHVM rendering on mw2240 is OK: HTTP OK: HTTP/1.1 200 OK - 76005 bytes in 0.270 second response time [08:29:56] RECOVERY - Apache HTTP on mw2138 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.145 second response time [08:29:56] RECOVERY - configured eth on lvs2004 is OK: OK - interfaces up [08:29:58] RECOVERY - IPsec on cp1052 is OK: Strongswan OK - 44 ESP OK [08:29:58] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [08:29:58] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 58 ESP OK [08:29:58] RECOVERY - Apache HTTP on mw2221 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.110 second response time [08:29:59] RECOVERY - configured eth on lvs2003 is OK: OK - interfaces up [08:30:12] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [08:30:12] RECOVERY - configured eth on lvs2001 is OK: OK - interfaces up [08:30:12] RECOVERY - IPsec on cp3049 is OK: Strongswan OK - 54 ESP OK [08:30:12] RECOVERY - HHVM rendering on mw2125 is OK: HTTP OK: HTTP/1.1 200 OK - 76005 bytes in 0.302 second response time [08:30:12] RECOVERY - IPsec on cp1065 is OK: Strongswan OK - 44 ESP OK [08:30:13] RECOVERY - IPsec on cp3042 is OK: Strongswan OK - 44 ESP OK [08:30:13] RECOVERY - IPsec on cp3035 is OK: Strongswan OK - 54 ESP OK [08:30:14] RECOVERY - IPsec on cp3034 is OK: Strongswan OK - 54 ESP OK [08:30:14] RECOVERY - Nginx local proxy to apache on mw2228 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.188 second response time [08:30:15] RECOVERY - configured eth on lvs2006 is OK: OK - interfaces up [08:30:15] RECOVERY - IPsec on mc1028 is OK: Strongswan OK - 1 ESP OK [08:30:16] RECOVERY - HHVM rendering on mw2017 is OK: HTTP OK: HTTP/1.1 200 OK - 76005 bytes in 0.283 second response time [08:30:16] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 58 ESP OK [08:30:17] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 58 ESP OK [08:30:18] RECOVERY - ElasticSearch health check for shards on search.svc.codfw.wmnet is OK: OK - elasticsearch status production-search-codfw: status: yellow, number_of_nodes: 36, unassigned_shards: 168, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3087, task_max_waiting_in_queue_millis: 0, cluster_name: production-search-codfw, relocating_shards: 0, active_shards_percent_as_number: 9 [08:30:18] e_shards: 9050, initializing_shards: 37, number_of_data_nodes: 36, delayed_unassigned_shards: 0 [08:30:22] RECOVERY - HHVM rendering on mw2099 is OK: HTTP OK: HTTP/1.1 200 OK - 76005 bytes in 0.293 second response time [08:30:22] RECOVERY - IPsec on cp3037 is OK: Strongswan OK - 54 ESP OK [08:30:22] RECOVERY - IPsec on cp1066 is OK: Strongswan OK - 44 ESP OK [08:30:22] RECOVERY - IPsec on cp3044 is OK: Strongswan OK - 54 ESP OK [08:30:32] RECOVERY - Nginx local proxy to apache on mw2253 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.186 second response time [08:30:32] RECOVERY - HHVM rendering on mw2101 is OK: HTTP OK: HTTP/1.1 200 OK - 76005 bytes in 0.289 second response time [08:30:32] RECOVERY - IPsec on cp3030 is OK: Strongswan OK - 44 ESP OK [08:30:32] RECOVERY - IPsec on cp3032 is OK: Strongswan OK - 44 ESP OK [08:30:32] RECOVERY - IPsec on cp3043 is OK: Strongswan OK - 44 ESP OK [08:30:33] RECOVERY - IPsec on cp3033 is OK: Strongswan OK - 44 ESP OK [08:30:33] RECOVERY - IPsec on cp3036 is OK: Strongswan OK - 54 ESP OK [08:30:34] RECOVERY - IPsec on cp3038 is OK: Strongswan OK - 54 ESP OK [08:30:38] RECOVERY - IPsec on rdb1007 is OK: Strongswan OK - 1 ESP OK [08:30:38] RECOVERY - IPsec on cp3010 is OK: Strongswan OK - 28 ESP OK [08:30:38] RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 58 ESP OK [08:30:38] RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 58 ESP OK [08:30:38] RECOVERY - IPsec on cp4010 is OK: Strongswan OK - 44 ESP OK [08:30:38] RECOVERY - IPsec on cp3039 is OK: Strongswan OK - 54 ESP OK [08:30:52] RECOVERY - IPsec on cp1068 is OK: Strongswan OK - 44 ESP OK [08:31:32] RECOVERY - carbon-frontend-relay metric drops on graphite1001 is OK: OK: Less than 80.00% above the threshold [25.0] [08:32:02] PROBLEM - puppet last run on elastic2013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:32:02] PROBLEM - puppet last run on mw2161 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:32:02] PROBLEM - puppet last run on naos is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:32:02] PROBLEM - puppet last run on cp2015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:32:03] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:32:03] PROBLEM - puppet last run on elastic2031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:32:12] PROBLEM - puppet last run on mw2178 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:32:12] PROBLEM - puppet last run on mw2181 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:32:12] PROBLEM - puppet last run on elastic2017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:32:12] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received [08:32:12] PROBLEM - puppet last run on wtp2017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:32:22] PROBLEM - puppet last run on mw2175 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:32:22] PROBLEM - puppet last run on mw2206 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:32:22] PROBLEM - puppet last run on thumbor2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:32:22] PROBLEM - puppet last run on mw2212 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:32:22] PROBLEM - puppet last run on mw2157 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:32:23] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received [08:32:23] PROBLEM - puppet last run on mw2188 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:32:24] PROBLEM - puppet last run on mw2191 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:32:24] PROBLEM - etcdmirror-conftool-eqiad-wmnet service on conf2002 is CRITICAL: CRITICAL - Expecting active but unit etcdmirror-conftool-eqiad-wmnet is failed [08:32:25] PROBLEM - puppet last run on cp2016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:32:25] PROBLEM - salt-minion processes on mw2202 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [08:32:27] PROBLEM - Etcd replication lag on conf2002 is CRITICAL: connect to address 10.192.32.141 and port 8000: Connection refused [08:32:27] PROBLEM - puppet last run on labtestcontrol2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:32:32] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received [08:32:32] PROBLEM - puppet last run on db2077 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:32:32] PROBLEM - puppet last run on wdqs2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:32:33] PROBLEM - Check systemd state on conf2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:32:33] PROBLEM - puppet last run on wtp2012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:32:33] PROBLEM - puppet last run on mc2031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:32:42] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received [08:32:42] PROBLEM - puppet last run on oresrdb2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:32:42] PROBLEM - puppet last run on scb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:32:42] RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy [08:32:43] PROBLEM - restbase endpoints health on restbase2001 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received [08:32:43] PROBLEM - puppet last run on mw2196 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:32:43] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:32:52] PROBLEM - puppet last run on cp2017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:32:52] PROBLEM - puppet last run on wtp2013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:32:52] PROBLEM - puppet last run on kafka2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:32:52] PROBLEM - puppet last run on wtp2014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:32:52] PROBLEM - puppet last run on ores2006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:32:53] PROBLEM - puppet last run on mw2192 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:32:53] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) [08:32:54] PROBLEM - puppet last run on maps2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:33:02] PROBLEM - restbase endpoints health on restbase2005 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received [08:33:02] PROBLEM - restbase endpoints health on restbase2006 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received [08:33:02] PROBLEM - restbase endpoints health on restbase2007 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received [08:33:02] RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy [08:33:02] PROBLEM - puppet last run on mw2179 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:33:03] PROBLEM - puppet last run on elastic2032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:33:03] PROBLEM - puppet last run on ores2005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:33:04] PROBLEM - puppet last run on mw2184 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:33:04] PROBLEM - puppet last run on mw2170 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:33:12] PROBLEM - restbase endpoints health on restbase2004 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received [08:33:12] PROBLEM - restbase endpoints health on restbase2003 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received [08:33:12] PROBLEM - restbase endpoints health on restbase2008 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received [08:34:12] RECOVERY - restbase endpoints health on restbase2004 is OK: All endpoints are healthy [08:34:13] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy [08:34:52] PROBLEM - restbase endpoints health on restbase2002 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received [08:36:12] RECOVERY - restbase endpoints health on restbase2008 is OK: All endpoints are healthy [08:36:22] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy [08:36:32] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy [08:36:32] RECOVERY - etcdmirror-conftool-eqiad-wmnet service on conf2002 is OK: OK - etcdmirror-conftool-eqiad-wmnet is active [08:36:36] RECOVERY - Etcd replication lag on conf2002 is OK: HTTP OK: HTTP/1.1 200 OK - 148 bytes in 0.075 second response time [08:36:37] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy [08:36:42] RECOVERY - Check systemd state on conf2002 is OK: OK - running: The system is fully operational [08:36:42] RECOVERY - restbase endpoints health on restbase2001 is OK: All endpoints are healthy [08:36:42] RECOVERY - restbase endpoints health on restbase2002 is OK: All endpoints are healthy [08:37:02] RECOVERY - restbase endpoints health on restbase2005 is OK: All endpoints are healthy [08:37:02] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [08:37:02] RECOVERY - restbase endpoints health on restbase2006 is OK: All endpoints are healthy [08:37:02] RECOVERY - restbase endpoints health on restbase2007 is OK: All endpoints are healthy [08:37:12] RECOVERY - restbase endpoints health on restbase2003 is OK: All endpoints are healthy [08:43:33] RECOVERY - salt-minion processes on mw2202 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:48:14] !log uploaded nodepool 0.1.1+wmf8 to apt.wikimedia.org [08:48:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:33] RECOVERY - puppet last run on mw2157 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [08:52:13] RECOVERY - puppet last run on mw2179 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [08:52:42] RECOVERY - puppet last run on thumbor2002 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [08:52:52] RECOVERY - puppet last run on mc2031 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [08:53:02] RECOVERY - puppet last run on cp2017 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [08:53:02] RECOVERY - puppet last run on wtp2013 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [08:53:02] RECOVERY - puppet last run on ores2006 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [08:53:02] RECOVERY - puppet last run on wtp2014 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [08:53:03] RECOVERY - puppet last run on mw2192 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:53:13] RECOVERY - puppet last run on ores2005 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [08:53:22] RECOVERY - puppet last run on mw2161 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [08:53:42] RECOVERY - puppet last run on mw2188 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [08:53:52] RECOVERY - puppet last run on wtp2012 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [08:54:02] RECOVERY - puppet last run on kafka2003 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [08:54:22] RECOVERY - puppet last run on mw2170 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [08:54:32] RECOVERY - puppet last run on elastic2031 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [08:54:42] RECOVERY - puppet last run on mw2212 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [08:55:13] RECOVERY - puppet last run on elastic2032 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [08:55:23] RECOVERY - puppet last run on cp2015 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [08:55:23] RECOVERY - puppet last run on naos is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [08:55:32] RECOVERY - puppet last run on elastic2017 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [08:55:42] RECOVERY - puppet last run on mw2206 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [08:55:42] RECOVERY - puppet last run on mw2175 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [08:55:42] RECOVERY - puppet last run on mw2191 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [08:55:43] RECOVERY - puppet last run on labtestcontrol2003 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [08:55:52] RECOVERY - puppet last run on wdqs2001 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [08:56:02] RECOVERY - puppet last run on scb2001 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [08:56:12] RECOVERY - puppet last run on maps2003 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [08:56:22] RECOVERY - puppet last run on elastic2013 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [08:56:32] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [08:56:33] RECOVERY - puppet last run on mw2178 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [08:56:42] RECOVERY - puppet last run on cp2016 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [08:57:02] RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [08:57:22] RECOVERY - puppet last run on mw2184 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [08:57:33] RECOVERY - puppet last run on mw2181 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [08:57:42] RECOVERY - puppet last run on wtp2017 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [08:57:52] RECOVERY - puppet last run on db2077 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [08:58:02] RECOVERY - puppet last run on oresrdb2002 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [08:58:02] RECOVERY - puppet last run on mw2196 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [09:00:21] (03PS2) 10Jcrespo: analytics-store: Ban Aria/MyISAM tables from WMF infrastructure [puppet] - 10https://gerrit.wikimedia.org/r/365228 [09:01:18] !log restarting nodepool for upgrade 0.1.1-wmf7 -> 0.1.1-wmf8 [09:01:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:39] (03PS1) 10Ayounsi: Revert "Depool codfw for asw-c-codfw upgrade" [dns] - 10https://gerrit.wikimedia.org/r/366211 [09:01:47] (03PS1) 10Ayounsi: Revert "Route traffic around codfw for asw-c-codfw upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/366212 [09:01:54] (03PS2) 10Ayounsi: Revert "Depool codfw for asw-c-codfw upgrade" [dns] - 10https://gerrit.wikimedia.org/r/366211 [09:02:07] 10Operations, 10ops-esams: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T170538#3451557 (10Volans) [09:02:09] 10Operations, 10ops-esams: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T168619#3451559 (10Volans) [09:02:18] (03CR) 10Ayounsi: [C: 032] Revert "Depool codfw for asw-c-codfw upgrade" [dns] - 10https://gerrit.wikimedia.org/r/366211 (owner: 10Ayounsi) [09:02:20] 10Operations, 10ops-esams: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T170539#3451561 (10Volans) [09:02:22] 10Operations, 10ops-esams: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T168619#3370107 (10Volans) [09:02:29] 10Operations, 10ops-esams: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T169993#3451565 (10Volans) [09:02:31] 10Operations, 10ops-esams: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T168619#3370107 (10Volans) [09:03:38] (03CR) 10Ayounsi: [C: 032] Revert "Route traffic around codfw for asw-c-codfw upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/366212 (owner: 10Ayounsi) [09:03:46] !log codfw repooled in dns - T170380 [09:03:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:56] T170380: codfw row C switch upgrade - https://phabricator.wikimedia.org/T170380 [09:05:32] !log oblivian@neodymium conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=(restbase-async|citoid) [09:05:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:38] (03PS6) 10Giuseppe Lavagetto: systemd: add defines to manage systemd units [puppet] - 10https://gerrit.wikimedia.org/r/365900 [09:07:50] (03CR) 10Jcrespo: [C: 032] analytics-store: Ban Aria/MyISAM tables from WMF infrastructure [puppet] - 10https://gerrit.wikimedia.org/r/365228 (owner: 10Jcrespo) [09:07:58] (03PS3) 10Jcrespo: analytics-store: Ban Aria/MyISAM tables from WMF infrastructure [puppet] - 10https://gerrit.wikimedia.org/r/365228 [09:10:13] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: codfw row C switch upgrade - https://phabricator.wikimedia.org/T170380#3451571 (10ayounsi) 05Open>03Resolved Switch upgrade took ~9min and went as expected. Icinga paged about some svc.codfw services unreachable, followup task to be opened. [09:11:27] (03CR) 10jerkins-bot: [V: 04-1] systemd: add defines to manage systemd units [puppet] - 10https://gerrit.wikimedia.org/r/365900 (owner: 10Giuseppe Lavagetto) [09:12:29] <_joe_> the rubocop is punishing my ruby, I guess [09:13:01] (03PS1) 10Elukey: Add new profiles for zookeeper firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/366216 (https://phabricator.wikimedia.org/T114815) [09:15:04] (03PS2) 10Elukey: Add new profiles for zookeeper firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/366216 (https://phabricator.wikimedia.org/T114815) [09:15:16] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "From what I see, the only thing that varies in the profiles is the srange." [puppet] - 10https://gerrit.wikimedia.org/r/366216 (https://phabricator.wikimedia.org/T114815) (owner: 10Elukey) [09:16:18] lol I was about to ask if it made sense, I guess I got the answer [09:16:24] the alternative was hiera [09:16:25] !log finish up codfw cache_text/upload varnish/kernel upgrades [09:16:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:48] <_joe_> elukey: roles need to be specific, profile should not [09:16:55] <_joe_> so this is a textbook case :) [09:17:17] super, you clarified a doubt that I had [09:17:31] * elukey re-works the patch [09:17:36] thanks _joe_! [09:17:52] <_joe_> can you add that info to wherever in the puppet coding page you would've expected it? [09:18:28] (03PS1) 10Hashar: nodepool: pin python-jenkins to 0.4.12+ [puppet] - 10https://gerrit.wikimedia.org/r/366217 (https://phabricator.wikimedia.org/T144106) [09:18:33] it might be there, I'll review it and then if I don't find anything I can add this as example [09:18:47] 10Operations, 10Traffic, 10netops: Investigate lvs IP pages during codfw row C switch upgrade - https://phabricator.wikimedia.org/T171032#3451590 (10fgiunchedi) [09:20:06] (03CR) 10Hashar: "On labnodepool1001.eqiad.wmnet apt-cache policy reports:" [puppet] - 10https://gerrit.wikimedia.org/r/366217 (https://phabricator.wikimedia.org/T144106) (owner: 10Hashar) [09:21:46] (03CR) 10jerkins-bot: [V: 04-1] nodepool: pin python-jenkins to 0.4.12+ [puppet] - 10https://gerrit.wikimedia.org/r/366217 (https://phabricator.wikimedia.org/T144106) (owner: 10Hashar) [09:26:12] PROBLEM - MariaDB Slave Lag: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:26:28] (03PS7) 10Giuseppe Lavagetto: systemd: add defines to manage systemd units [puppet] - 10https://gerrit.wikimedia.org/r/365900 [09:27:02] PROBLEM - MariaDB Slave Lag: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:27:51] (03CR) 10jerkins-bot: [V: 04-1] systemd: add defines to manage systemd units [puppet] - 10https://gerrit.wikimedia.org/r/365900 (owner: 10Giuseppe Lavagetto) [09:27:54] (03PS3) 10Elukey: Add a new profile for zookeeper firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/366216 (https://phabricator.wikimedia.org/T114815) [09:28:03] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:28:12] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:28:14] ^ backups probably [09:28:30] our wednesday friendly reminder that they are running :D [09:28:52] (03CR) 10jerkins-bot: [V: 04-1] Add a new profile for zookeeper firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/366216 (https://phabricator.wikimedia.org/T114815) (owner: 10Elukey) [09:28:53] RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [09:29:02] RECOVERY - MariaDB Slave IO: s3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [09:29:26] yes jenkins I know [09:29:28] sorry [09:30:02] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1065" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366218 [09:30:05] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1065" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366218 [09:30:15] (03CR) 10Marostegui: [C: 04-2] "Wait for the last alter to be done" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366218 (owner: 10Marostegui) [09:32:44] (03PS4) 10Elukey: Add a new profile for zookeeper firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/366216 (https://phabricator.wikimedia.org/T114815) [09:32:44] (03PS8) 10Giuseppe Lavagetto: systemd: add defines to manage systemd units [puppet] - 10https://gerrit.wikimedia.org/r/365900 [09:37:24] (03CR) 10Elukey: "pcc in https://puppet-compiler.wmflabs.org/compiler02/7098/" [puppet] - 10https://gerrit.wikimedia.org/r/366216 (https://phabricator.wikimedia.org/T114815) (owner: 10Elukey) [09:42:40] 10Operations, 10Wikidata, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests: Add Dinka Wikipedia to Wikidata - https://phabricator.wikimedia.org/T170930#3451633 (10Amire80) Yep, seems to work, although there's a longish lag until the link actually appears, but that's fine. Will anybody run the script... [09:43:37] (03PS2) 10Hashar: nodepool: pin python-jenkins to 0.4.12+ [puppet] - 10https://gerrit.wikimedia.org/r/366217 (https://phabricator.wikimedia.org/T144106) [09:43:58] (03PS1) 10Ema: prometheus: add job definition for nginx_exporter [puppet] - 10https://gerrit.wikimedia.org/r/366220 [09:44:32] (03CR) 10Elukey: [C: 031] Deploy statsv with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/363579 (https://phabricator.wikimedia.org/T129139) (owner: 10Filippo Giunchedi) [09:46:30] 10Operations, 10Wikidata, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests: Add Dinka Wikipedia to Wikidata - https://phabricator.wikimedia.org/T170930#3447871 (10Sjoerddebruin) The lag might be related to the [[https://grafana.wikimedia.org/dashboard/db/wikidata-dispatch | dispatch]] in the last few... [09:50:14] (03PS2) 10Filippo Giunchedi: Deploy statsv with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/363579 (https://phabricator.wikimedia.org/T129139) [09:51:51] (03PS4) 10Ema: [1/2] use kmod puppet module instead of File resources [puppet] - 10https://gerrit.wikimedia.org/r/365911 [09:51:59] (03CR) 10Ema: [V: 032 C: 032] [1/2] use kmod puppet module instead of File resources [puppet] - 10https://gerrit.wikimedia.org/r/365911 (owner: 10Ema) [09:55:12] 10Operations, 10User-fgiunchedi: Reduce Swift technical debt - https://phabricator.wikimedia.org/T162792#3451663 (10fgiunchedi) 05Open>03Resolved The goal itself is complete! Of the subtasks of this task though there's still swiftrepl left to tackle [09:55:26] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1065" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366218 (owner: 10Marostegui) [09:55:33] PROBLEM - puppet last run on wtp2014 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/etc/modprobe.d/blacklist-linux44.conf],File[/etc/modprobe.d/blacklist-wmf.conf] [09:56:03] PROBLEM - puppet last run on snapshot1005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/modprobe.d/blacklist-wmf.conf] [09:56:24] (03CR) 10Filippo Giunchedi: [C: 032] Deploy statsv with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/363579 (https://phabricator.wikimedia.org/T129139) (owner: 10Filippo Giunchedi) [09:56:33] (03PS3) 10Filippo Giunchedi: Deploy statsv with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/363579 (https://phabricator.wikimedia.org/T129139) [09:56:33] RECOVERY - puppet last run on wtp2014 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [09:56:42] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] Deploy statsv with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/363579 (https://phabricator.wikimedia.org/T129139) (owner: 10Filippo Giunchedi) [09:56:59] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1065" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366218 (owner: 10Marostegui) [09:57:02] RECOVERY - puppet last run on snapshot1005 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [09:57:08] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1065" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366218 (owner: 10Marostegui) [09:58:14] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1065 - T166204 (duration: 00m 47s) [09:58:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:24] T166204: Convert unique keys into primary keys for some wiki tables on s1 - https://phabricator.wikimedia.org/T166204 [09:59:03] (03PS1) 10Marostegui: db-eqiad.php: Depool db1051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366223 (https://phabricator.wikimedia.org/T166204) [10:00:57] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366223 (https://phabricator.wikimedia.org/T166204) (owner: 10Marostegui) [10:01:07] !log filippo@tin Started deploy [statsv/statsv@0a86be8]: (no justification provided) [10:01:10] !log filippo@tin Finished deploy [statsv/statsv@0a86be8]: (no justification provided) (duration: 00m 03s) [10:01:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:52] PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp2022_v4, cp2022_v6 [10:02:47] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366223 (https://phabricator.wikimedia.org/T166204) (owner: 10Marostegui) [10:02:52] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 58 ESP OK [10:02:59] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366223 (https://phabricator.wikimedia.org/T166204) (owner: 10Marostegui) [10:04:07] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1051 - T166204 (duration: 00m 47s) [10:04:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:15] T166204: Convert unique keys into primary keys for some wiki tables on s1 - https://phabricator.wikimedia.org/T166204 [10:06:26] !log Deploy alter table on s1 - db1051 - T166204 [10:06:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:26] 10Operations, 10ops-esams, 10Patch-For-Review, 10User-fgiunchedi: Decommission esams ms-fe / ms-be - https://phabricator.wikimedia.org/T169518#3451712 (10fgiunchedi) p:05Triage>03Normal [10:09:28] !log ulsfo cache_text/upload: upgrade to varnish 4.1.7-1wm1 and reboot for kernel updates [10:09:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:50] (03PS5) 10Elukey: Add profile::zookeeper::firewall [puppet] - 10https://gerrit.wikimedia.org/r/366216 (https://phabricator.wikimedia.org/T114815) [10:10:07] 10Operations, 10OfflineContentGenerator, 10Reading-Community-Engagement, 10Patch-For-Review, and 2 others: Collate wikimedia pages into a single html wikimedia page that can then be rendered into a single pdf - https://phabricator.wikimedia.org/T150874#3451714 (10fgiunchedi) p:05Triage>03Normal [10:13:13] (03PS2) 10Ema: prometheus: add job definition for nginx_exporter [puppet] - 10https://gerrit.wikimedia.org/r/366220 [10:14:15] (03CR) 10Elukey: "After a chat with Moritz we decided to proceed with with more caution and split the change in two parts:" [puppet] - 10https://gerrit.wikimedia.org/r/366216 (https://phabricator.wikimedia.org/T114815) (owner: 10Elukey) [10:20:52] 10Operations, 10ops-eqiad, 10OCG-General, 10Reading-Web-Backlog: ocg1001 is broken - https://phabricator.wikimedia.org/T170886#3451766 (10ovasileva) [10:21:39] 10Operations, 10ops-eqiad, 10OCG-General, 10Reading-Web-Backlog (Tracking): ocg1001 is broken - https://phabricator.wikimedia.org/T170886#3446253 (10ovasileva) [10:22:16] (03CR) 10Filippo Giunchedi: [C: 031] prometheus: add job definition for nginx_exporter [puppet] - 10https://gerrit.wikimedia.org/r/366220 (owner: 10Ema) [10:22:52] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/366216 (https://phabricator.wikimedia.org/T114815) (owner: 10Elukey) [10:22:54] (03PS1) 10Urbanecm: Extend throttle rule per phabricator request [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366225 (https://phabricator.wikimedia.org/T170844) [10:26:16] (03CR) 10Elukey: [C: 032] Add profile::zookeeper::firewall [puppet] - 10https://gerrit.wikimedia.org/r/366216 (https://phabricator.wikimedia.org/T114815) (owner: 10Elukey) [10:26:22] (03PS6) 10Elukey: Add profile::zookeeper::firewall [puppet] - 10https://gerrit.wikimedia.org/r/366216 (https://phabricator.wikimedia.org/T114815) [10:26:24] (03CR) 10Elukey: [V: 032 C: 032] Add profile::zookeeper::firewall [puppet] - 10https://gerrit.wikimedia.org/r/366216 (https://phabricator.wikimedia.org/T114815) (owner: 10Elukey) [10:28:15] (03PS4) 10ArielGlenn: write out a list of special dump files per dump run that downloaders may want [dumps] - 10https://gerrit.wikimedia.org/r/364729 (https://phabricator.wikimedia.org/T169849) [10:28:23] (03CR) 10Muehlenhoff: [C: 031] "Looks fine" [puppet] - 10https://gerrit.wikimedia.org/r/366217 (https://phabricator.wikimedia.org/T144106) (owner: 10Hashar) [10:28:51] 10Operations, 10Electron-PDFs, 10Patch-For-Review, 10Reading-Web-Backlog (Tracking), 10Services (blocked): pdfrender fails to serve requests since Mar 8 00:30:32 UTC on scb1003 - https://phabricator.wikimedia.org/T159922#3083419 (10ovasileva) @GWicke, @mobrovac - currently all of the options we are serio... [10:30:38] 10Operations: reinstall rcs100[12] with RAID - https://phabricator.wikimedia.org/T140441#3451789 (10fgiunchedi) 05Open>03declined rcs being decom'd in {T170157} [10:30:40] 10Operations, 10Patch-For-Review: Audit/fix hosts with no RAID configured - https://phabricator.wikimedia.org/T136562#3451793 (10fgiunchedi) [10:31:23] 10Operations, 10Patch-For-Review: Audit/fix hosts with no RAID configured - https://phabricator.wikimedia.org/T136562#2338953 (10fgiunchedi) [10:32:46] (03CR) 10jerkins-bot: [V: 04-1] write out a list of special dump files per dump run that downloaders may want [dumps] - 10https://gerrit.wikimedia.org/r/364729 (https://phabricator.wikimedia.org/T169849) (owner: 10ArielGlenn) [10:34:14] (03CR) 10Alexandros Kosiaris: "I am removing my -2 on this once, based on the outcome of some discussions mentioned on the ops meeting on Monday. I will not be around in" [puppet] - 10https://gerrit.wikimedia.org/r/359967 (https://phabricator.wikimedia.org/T159922) (owner: 10GWicke) [10:38:22] (03PS5) 10ArielGlenn: write out a list of special dump files per dump run that downloaders may want [dumps] - 10https://gerrit.wikimedia.org/r/364729 (https://phabricator.wikimedia.org/T169849) [10:40:03] 10Operations, 10netops, 10Patch-For-Review: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387#3451811 (10fgiunchedi) a:03ayounsi [10:42:07] 10Operations, 10ChangeProp, 10Parsing-Team, 10Parsoid, and 6 others: Check concurrency/retry/timeout limits and syncronize those between services - https://phabricator.wikimedia.org/T152073#3451813 (10fgiunchedi) Is there anything left to be done here? [10:42:24] (03PS5) 10Jcrespo: [WIP]prometheus: Convert mysqld-exporter into multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/364396 (https://phabricator.wikimedia.org/T170666) [10:42:30] (03PS1) 10Elukey: role::configcluster: tighten the access to zookeeper [puppet] - 10https://gerrit.wikimedia.org/r/366228 (https://phabricator.wikimedia.org/T114815) [10:43:31] (03CR) 10jerkins-bot: [V: 04-1] [WIP]prometheus: Convert mysqld-exporter into multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/364396 (https://phabricator.wikimedia.org/T170666) (owner: 10Jcrespo) [10:43:52] (03PS1) 10Alexandros Kosiaris: puppetdb: Bump Java Heap max size to 6GB [puppet] - 10https://gerrit.wikimedia.org/r/366229 (https://phabricator.wikimedia.org/T170740) [10:45:46] (03Abandoned) 10Muehlenhoff: Tighten access to zookeeper [puppet] - 10https://gerrit.wikimedia.org/r/356548 (https://phabricator.wikimedia.org/T114815) (owner: 10Muehlenhoff) [10:45:54] (03CR) 10Muehlenhoff: [C: 031] role::configcluster: tighten the access to zookeeper [puppet] - 10https://gerrit.wikimedia.org/r/366228 (https://phabricator.wikimedia.org/T114815) (owner: 10Elukey) [10:46:59] 10Operations, 10Traffic, 10Patch-For-Review: python-varnishapi daemons seeing "Log overrun" constantly - https://phabricator.wikimedia.org/T151643#2823504 (10fgiunchedi) looks like this was fixed, @ema @elukey ? [10:49:55] (03PS6) 10Jcrespo: [WIP]prometheus: Convert mysqld-exporter into multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/364396 (https://phabricator.wikimedia.org/T170666) [10:51:11] (03CR) 10jerkins-bot: [V: 04-1] [WIP]prometheus: Convert mysqld-exporter into multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/364396 (https://phabricator.wikimedia.org/T170666) (owner: 10Jcrespo) [10:52:49] (03PS7) 10Jcrespo: [WIP]prometheus: Convert mysqld-exporter into multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/364396 (https://phabricator.wikimedia.org/T170666) [10:54:25] marostegui: hi :) [10:54:47] o/ [10:54:50] let's go ahead? [10:56:39] marostegui: should i start? [10:56:40] 10Operations, 10Cloud-Services: Ensure we can survive a loss of labservices1001 - https://phabricator.wikimedia.org/T163402#3451840 (10fgiunchedi) [10:56:45] Steinsplitter: yes, please [10:56:55] If you can send me the meta page, so I can follow it progress, that'd be nice [10:57:41] ok :) https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress/AlexanderRahm [10:58:04] Awesome, thanks [10:58:50] !log Global rename of user Moros - T170941 [10:59:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:02] T170941: Global rename of user Moros - https://phabricator.wikimedia.org/T170941 [10:59:09] 10Operations, 10Traffic: Fix lvs1001-6 storage - https://phabricator.wikimedia.org/T136737#3451852 (10fgiunchedi) [11:00:18] (03CR) 10Jcrespo: "@Filippo. What do you think of this solution?" [puppet] - 10https://gerrit.wikimedia.org/r/364396 (https://phabricator.wikimedia.org/T170666) (owner: 10Jcrespo) [11:02:05] 10Operations, 10Puppet, 10Patch-For-Review: Add the puppet CA to the certification authorities trusted by our systems, on demand - https://phabricator.wikimedia.org/T114638#1701599 (10fgiunchedi) Anything left to do here? [11:18:48] (03PS1) 10Amire80: Configure wmgBabelMainCategory for the Dinka Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366233 [11:20:22] PROBLEM - IPsec on cp2001 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4017_v4, cp4017_v6 [11:20:22] PROBLEM - IPsec on cp2007 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4017_v4, cp4017_v6 [11:20:22] PROBLEM - IPsec on cp1066 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp4017_v4, cp4017_v6 [11:20:32] PROBLEM - IPsec on cp1067 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp4017_v4, cp4017_v6 [11:20:33] PROBLEM - IPsec on cp2016 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4017_v4, cp4017_v6 [11:20:33] PROBLEM - IPsec on cp2019 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4017_v4, cp4017_v6 [11:20:42] PROBLEM - IPsec on cp2023 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4017_v4, cp4017_v6 [11:20:42] PROBLEM - IPsec on cp1068 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp4017_v4, cp4017_v6 [11:20:52] PROBLEM - IPsec on cp1053 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp4017_v4, cp4017_v6 [11:20:52] PROBLEM - IPsec on cp1054 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp4017_v4, cp4017_v6 [11:20:52] PROBLEM - IPsec on cp2010 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4017_v4, cp4017_v6 [11:21:02] PROBLEM - IPsec on cp1055 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp4017_v4, cp4017_v6 [11:21:03] PROBLEM - IPsec on cp1052 is CRITICAL: Strongswan CRITICAL - ok: 43 not-conn: cp4017_v6 [11:21:03] PROBLEM - IPsec on cp2004 is CRITICAL: Strongswan CRITICAL - ok: 55 not-conn: cp4017_v6 [11:21:22] RECOVERY - IPsec on cp2001 is OK: Strongswan OK - 56 ESP OK [11:21:23] RECOVERY - IPsec on cp2007 is OK: Strongswan OK - 56 ESP OK [11:21:23] RECOVERY - IPsec on cp1066 is OK: Strongswan OK - 44 ESP OK [11:21:33] RECOVERY - IPsec on cp1067 is OK: Strongswan OK - 44 ESP OK [11:21:42] RECOVERY - IPsec on cp2016 is OK: Strongswan OK - 56 ESP OK [11:21:42] RECOVERY - IPsec on cp2019 is OK: Strongswan OK - 56 ESP OK [11:21:42] RECOVERY - IPsec on cp2023 is OK: Strongswan OK - 56 ESP OK [11:21:43] RECOVERY - IPsec on cp1068 is OK: Strongswan OK - 44 ESP OK [11:21:52] RECOVERY - IPsec on cp1053 is OK: Strongswan OK - 44 ESP OK [11:21:52] RECOVERY - IPsec on cp1054 is OK: Strongswan OK - 44 ESP OK [11:21:53] RECOVERY - IPsec on cp2010 is OK: Strongswan OK - 56 ESP OK [11:22:03] RECOVERY - IPsec on cp1055 is OK: Strongswan OK - 44 ESP OK [11:22:03] RECOVERY - IPsec on cp1052 is OK: Strongswan OK - 44 ESP OK [11:22:03] RECOVERY - IPsec on cp2004 is OK: Strongswan OK - 56 ESP OK [11:23:26] (03CR) 10Gergő Tisza: [C: 031] Configure wmgBabelMainCategory for the Dinka Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366233 (owner: 10Amire80) [11:24:42] cp4017 took a bit longer to reboot, excuse the icinga spam above ^ [11:25:27] (03CR) 10KartikMistry: [C: 031] Configure wmgBabelMainCategory for the Dinka Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366233 (owner: 10Amire80) [11:34:31] 10Operations, 10DBA: Global rename of user Moros - https://phabricator.wikimedia.org/T170941#3451982 (10Steinsplitter) 05Open>03Resolved [11:35:55] (03CR) 10ArielGlenn: [C: 032] fix construction of path for api calls [dumps] - 10https://gerrit.wikimedia.org/r/365564 (https://phabricator.wikimedia.org/T170741) (owner: 10ArielGlenn) [11:36:59] (03PS2) 10ArielGlenn: make sure page range explain query can never be for 0 pages [dumps] - 10https://gerrit.wikimedia.org/r/365585 [11:37:08] !log kartik@tin Started deploy [cxserver/deploy@1029833]: Update cxserver to d28ad0c [11:37:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:58] !log rebooting acamar (DNS recursor) for kernel update [11:38:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:34] (03CR) 10ArielGlenn: [C: 032] make sure page range explain query can never be for 0 pages [dumps] - 10https://gerrit.wikimedia.org/r/365585 (owner: 10ArielGlenn) [11:39:02] PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4015_v4, cp4015_v6 [11:39:03] PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4015_v4, cp4015_v6 [11:39:03] PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp4015_v4, cp4015_v6 [11:39:03] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4015_v4, cp4015_v6 [11:39:12] PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp4015_v4, cp4015_v6 [11:39:13] PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4015_v4, cp4015_v6 [11:39:22] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4015_v4, cp4015_v6 [11:39:23] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4015_v4, cp4015_v6 [11:39:32] PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp4015_v4, cp4015_v6 [11:39:33] PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp4015_v4, cp4015_v6 [11:39:33] PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp4015_v4, cp4015_v6 [11:39:46] ah cp hosts rebooting, nice [11:40:02] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 58 ESP OK [11:40:03] RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 58 ESP OK [11:40:03] RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 72 ESP OK [11:40:03] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 58 ESP OK [11:40:09] !log kartik@tin Finished deploy [cxserver/deploy@1029833]: Update cxserver to d28ad0c (duration: 03m 01s) [11:40:12] elukey: yeah, in ulsfo they occasionally take a bit longer to reboot than the icinga ipsec check timeouts unfortunately [11:40:12] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 72 ESP OK [11:40:13] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 58 ESP OK [11:40:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:23] PROBLEM - puppet last run on ms-be2021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:40:23] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 58 ESP OK [11:40:23] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 58 ESP OK [11:40:32] RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 72 ESP OK [11:40:33] RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 72 ESP OK [11:40:33] RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 72 ESP OK [11:40:46] PROBLEM - LVS HTTP IPv4 on eventbus.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:41:00] this just paged [11:41:02] ema: I trust you completely man [11:41:15] ouch [11:41:29] checking.. I was about to re-pool kafka2003 [11:41:34] ok [11:41:41] <_joe_> what's up? [11:41:42] <_joe_> ok [11:41:46] RECOVERY - LVS HTTP IPv4 on eventbus.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1488 bytes in 4.206 second response time [11:41:59] * volans checking icinga logs [11:43:05] * elukey check pybal logs on lvs2003 [11:43:13] might be related to the acamar reboot? [11:43:19] which is in codfw [11:43:23] elukey: ALERT: lvs2003;PyBal backends health check;CRITICAL;SOFT;1;PYBAL CRITICAL - eventbus_8085 - Could not depool server kafka2002.codfw.wmnet because of too many down! [11:43:37] this was at 11:38:32 [11:44:23] then it went back ok at 11:39:22 [11:44:32] at 38:05 I can see [eventbus_8085 ProxyFetch] WARN: kafka2001.codfw.wmnet (enabled/up/pooled): Fetch failed, 30.000 s [11:44:38] then the same for 2002 [11:45:08] <_joe_> so a kafka slowness? [11:45:31] could be realted to acamar reboot too [11:45:33] yep [11:45:38] checking on kafka200 [11:46:59] I don't see weird logs for kafka on 2001 [11:47:32] maybe eventbus [11:47:39] <_joe_> check the access logs [11:47:52] <_joe_> yeah I meant the srvice that is being checked by pybal [11:48:08] pybal-wise, we're now not doing DNS requests anymore essentially (confirmed on lvs2001 during acamar's reboot) [11:48:49] so I wouldn't think it's pybal getting confused by acamar's reboot but rather the service? [11:50:16] there are a lot of Node 2001 connection failed -- refreshing metadata in eventbus logs but occurring since days ago, so not related to this outage [11:51:27] so https://grafana.wikimedia.org/dashboard/file/server-board.json?var-server=kafka2001&refresh=1m&orgId=1 shows gaps in metrics [11:51:47] it smells like a host level issue [11:51:54] maybe because of acamar? [11:52:15] acamar went down at 11:38:01 and Linux started to boot at 11:40:16 with system services coming up a minute later or so [11:52:42] it needs about a minute to initialise it's firmware for some reason... [11:52:42] it aligns with the outage [11:53:25] I am going to repool 2003 and rebalance main kafka codfw now [11:54:23] does it hardcore acamar somewhere? acamar was depooled from service=pdns_recursor during the reboot [11:55:30] not that I am aware of, but everything is possible :D [11:55:30] moritzm: and it still is depooled BTW [11:55:48] ema: yeah, I'm repooling it now [11:55:56] NTP clocks also have resynced mostly [11:59:17] the only explicit reference to acamar that I found was systemd/timesyncd.conf [11:59:43] that's fine, all the jessie/stretch servrs have that [11:59:52] yeah [12:00:09] and multiple NTP servers are configured, so it's simply querying achernar instead [12:01:08] have we checked that acamar was effectively depooled on lvs2003? [12:01:35] pybal says it was, yeah [12:02:00] good [12:02:50] same hole in metrics for kafka200[123] [12:02:54] (server-board.json) [12:03:08] try the prometheus one too ;) [12:03:22] given that nothing is on fire anymore I'll go for lunch [12:04:11] https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?orgId=1&var-server=kafka2001&var-datasource=codfw%20prometheus%2Fops [12:04:19] interesting, no hole [12:06:00] no idea [12:07:20] !log ariel@tin Started deploy [dumps/dumps@f95292e]: fix api call bug, page range query min pages [12:07:23] !log ariel@tin Finished deploy [dumps/dumps@f95292e]: fix api call bug, page range query min pages (duration: 00m 03s) [12:07:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:36] 10Operations, 10Traffic, 10Patch-For-Review: python-varnishapi daemons seeing "Log overrun" constantly - https://phabricator.wikimedia.org/T151643#3452088 (10elukey) Logged on cp4006 and I can see `Jul 19 12:03:09 cp4006 varnishstatsd[2413]: Log overrun` in the syslog, so nope :( [12:08:42] RECOVERY - puppet last run on ms-be2021 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [12:08:59] let me double-check whether pybal does the right thing when depooling acamar (logs said it did, but how about ipvsadm?) [12:09:08] (03PS3) 10Phuedx: pagePreviews: Enable for anons/as pref on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365939 (https://phabricator.wikimedia.org/T167365) [12:09:17] !log ema@neodymium conftool action : set/pooled=no; selector: name=acamar.wikimedia.org [12:09:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:31] yeah, all good [12:10:31] !log ema@neodymium conftool action : set/pooled=yes; selector: name=acamar.wikimedia.org [12:10:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:17] elukey: I see some kafka200[123] alerts in icinga now [12:12:13] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:12:30] ema: don't see them in All unhandled, where are you checking? [12:12:45] elukey: they were brief [1/3], then gone [12:13:33] weirdly enough, they seemed to happen after my depool of acamar [12:14:09] depool at 12:09, and look at this: [12:14:17] Jul 19 12:09:37 lvs2003 pybal[45967]: [eventbus_8085 ProxyFetch] WARN: kafka2001.codfw.wmnet (enabled/up/pooled): Fetch failed, 6.060 s [12:15:00] kafka2* use dns-rec-ln.codfw.wmnet in their resolv.conf, maybe the name resolution code in eventbus doesn't handle failing servers gracefully? [12:16:08] ema: nice! [12:16:32] 10Operations, 10Wikidata, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, and 2 others: Create Dinka Wikipedia - https://phabricator.wikimedia.org/T168518#3366966 (10Ladsgroup) Why Wikidata is added, is there anything the team/volunteers/developers should do? [12:16:41] moritzm: this is probably it, I am going to open a phab task [12:18:14] 10Operations, 10Wikidata, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, and 2 others: Create Dinka Wikipedia - https://phabricator.wikimedia.org/T168518#3452106 (10Urbanecm) @Ladsgroup Because of T170930 I think. [12:19:22] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [12:21:23] 10Operations, 10Traffic, 10Patch-For-Review: python-varnishapi daemons seeing "Log overrun" constantly - https://phabricator.wikimedia.org/T151643#3452120 (10ema) Yeah, all other daemons have been fixed but `varnishstatsd` seems to still be affected by this issue. [12:23:54] ema, elukey: let's withhold further recdns reboots until that's fixed? or we could proceed with one of the recdns servers in esams, if kafka/evenrbus servers primarily resolve via the dns-rec-lbs in their local DC, this might not fail [12:24:18] 10Operations, 10Analytics, 10EventBus: Eventbus does not handle gracefully changes in DNS recursors - https://phabricator.wikimedia.org/T171048#3452124 (10elukey) [12:24:46] moritzm: I think that esams is fine [12:24:56] just created the task [12:25:54] 10Operations, 10Wikidata, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests: Add Dinka Wikipedia to Wikidata - https://phabricator.wikimedia.org/T170930#3447871 (10Ladsgroup) I'm running it. [12:26:41] moritzm: or we could swap lines in resolv.conf on affected kafka boxes? [12:27:45] maybe let's first validate whether this doesn't trip in esams? maybe it also bails when changes are made to the secondary dns-rec loadbalancer [12:27:55] +1 [12:28:41] let, me start with depooling maerlant, maybe this triggers it already [12:29:11] moritzm: perhaps wait a sec before rebooting after depool [12:29:17] yep, will do [12:29:25] !log jmm@puppetmaster1001 conftool action : set/pooled=no; selector: maerlant.wikimedia.org [12:29:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:05] ok so, confirmed that maerlant is gone from dns-rec-lb.esams [12:30:16] elukey: now let's see if kafka boxes still try to use it? [12:31:39] it doesn't look like [12:31:41] ema: I am lost sorry, where is dns-rec-lb.esams listed? [12:31:52] I mean in the resolv.conf of the kafkas [12:32:03] I thought there was only eqiad/codfw [12:32:27] ha, of course [12:32:33] ah, in fact [12:32:35] seems I misread that [12:32:42] * ema was dreaming of kafka30*s [12:33:21] proceeding with the maerlant reboot, then [12:33:38] moritzm: yep, there's no traffic towards maerlant's port 53 at all, go ahead [12:33:50] elukey: sorry for the confusion! [12:34:12] !log rebooting maerlant (DNS recursor) for kernel update [12:34:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:58] ema: all good! I wish we had kafka3*..ehm no that's not true :D [12:35:10] ETOOMANYKAFKASALREADY [12:35:14] :) [12:35:40] wait for the Internet of Kafkas, in 10 years it in every light bulb [12:35:51] 10Operations, 10Wikidata, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests: Add Dinka Wikipedia to Wikidata - https://phabricator.wikimedia.org/T170930#3452160 (10Reedy) >>! In T170930#3452154, @Ladsgroup wrote: > I'm running it. Running what? :P [12:36:46] moritzm: wow it's up already! [12:37:02] yep [12:37:17] hardware init is much faster for that host [12:37:23] 10Operations, 10Wikidata, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests: Add Dinka Wikipedia to Wikidata - https://phabricator.wikimedia.org/T170930#3452163 (10Ladsgroup) >>! In T170930#3451633, @Amire80 wrote: > Will anybody run the script that migrates all the existing classic links to Wikidata?... [12:38:04] 10Operations, 10Traffic, 10Wikimedia-Planet, 10Patch-For-Review: mixed-content issues on planet.wikimedia.org - https://phabricator.wikimedia.org/T141480#2500163 (10fgiunchedi) I tried loading en.planet.wikimedia.org and got no mixed content alerts today, there were some osm linked that failed to load due... [12:38:10] I'll wait for NTP to resync, then nescio can follow [12:38:43] (03PS3) 10Ema: [2/2] use kmod puppet module instead of File resources [puppet] - 10https://gerrit.wikimedia.org/r/365912 [12:38:50] (03CR) 10Ema: [V: 032] [2/2] use kmod puppet module instead of File resources [puppet] - 10https://gerrit.wikimedia.org/r/365912 (owner: 10Ema) [12:40:06] 10Operations, 10Wikidata, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests: Add Dinka Wikipedia to Wikidata - https://phabricator.wikimedia.org/T170930#3452182 (10Ladsgroup) Done now: https://din.wikipedia.org/wiki/K%C3%ABc%C3%ABweek:Contributions/Dexbot We can resolve this I think. [12:40:12] !log running foreachwiki updateRestrictions.php T166184 [12:40:15] 10Operations, 10Wikidata, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, and 2 others: Create Dinka Wikipedia - https://phabricator.wikimedia.org/T168518#3452184 (10Ladsgroup) [12:40:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:21] T166184: Run updateRestrictions.php on WMF wikis - https://phabricator.wikimedia.org/T166184 [12:40:40] 10Operations, 10Wikidata, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, and 2 others: Create Dinka Wikipedia - https://phabricator.wikimedia.org/T168518#3366966 (10Ladsgroup) The subtask is resolved. Removing #wikidata [12:40:50] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10Patch-For-Review, 10User-Urbanecm: Create Dinka Wikipedia - https://phabricator.wikimedia.org/T168518#3452188 (10Ladsgroup) [12:43:50] (03CR) 10Elukey: "The change LGTM, but afaics there is no jmx configuration/port-opened, so it would be difficult to get the differences in performance befo" [puppet] - 10https://gerrit.wikimedia.org/r/366229 (https://phabricator.wikimedia.org/T170740) (owner: 10Alexandros Kosiaris) [12:45:33] 10Operations, 10Patch-For-Review: check status of multiple systemd units - https://phabricator.wikimedia.org/T134890#3452211 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi I think the alerting on `systemd status` merged by @akosiaris is good enough for this [12:49:40] !log jmm@puppetmaster1001 conftool action : set/pooled=yes; selector: maerlant.wikimedia.org [12:49:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:42] (03PS8) 10Jcrespo: prometheus: Convert mysqld-exporter into multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/364396 (https://phabricator.wikimedia.org/T170666) [12:55:01] !log jmm@puppetmaster1001 conftool action : set/pooled=no; selector: nescio.wikimedia.org [12:55:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:17] 10Operations, 10MediaWiki-API, 10Traffic, 10Patch-For-Review: Varnish does not cache Action API responses when logged in - https://phabricator.wikimedia.org/T155314#2940086 (10fgiunchedi) This seems to WAI, only responses for anons are cached in varnish, anything left to do? [12:55:20] 10Operations, 10Traffic, 10Patch-For-Review: python-varnishapi daemons seeing "Log overrun" constantly - https://phabricator.wikimedia.org/T151643#3452265 (10ema) Incidentally, while looking at entirely different stuff on esams recdns hosts, I've noticed that the vast majority of our DNS traffic there is due... [12:56:08] !log rebooting nescio (DNS recursor) for kernel update [12:56:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:16] 10Operations, 10RESTBase, 10RESTBase-Cassandra, 10Patch-For-Review, 10Services (watching): rename cassandra cluster - https://phabricator.wikimedia.org/T112257#3452271 (10fgiunchedi) p:05Normal>03Low [13:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170719T1300). [13:00:04] Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:19] I'm here [13:00:35] so many of them [13:00:55] zeljkof: I will take care of this swat :} [13:01:04] hashar, can you purge https://en.wikipedia.org/static/images/project-logos/enwikiquote.png please? [13:01:50] Urbanecm: yup done [13:02:03] hashar: great, just wanted to ask if the scap force is strong with you today, or should I swat ;) [13:02:10] hashar, thanks [13:02:30] (03CR) 10Hashar: [C: 032] Extend throttle rule per phabricator request [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366225 (https://phabricator.wikimedia.org/T170844) (owner: 10Urbanecm) [13:03:40] (03Merged) 10jenkins-bot: Extend throttle rule per phabricator request [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366225 (https://phabricator.wikimedia.org/T170844) (owner: 10Urbanecm) [13:04:48] 10Operations, 10Puppet, 10Patch-For-Review: Puppet certificate missing subjectAltName - https://phabricator.wikimedia.org/T158757#3046345 (10fgiunchedi) Looks like this is still the case for certs issued by puppet CA, the `dns_alt_names` lists steps to turn this on though https://docs.puppet.com/puppet/lates... [13:05:09] !log hashar@tin Synchronized wmf-config/throttle.php: Extend throttle rule - T170844 (duration: 00m 48s) [13:05:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:19] T170844: Request a temporary lift of the account creation cap on a specific IP for an outreach event on 2017-07-20 - https://phabricator.wikimedia.org/T170844 [13:05:53] (03CR) 10Hashar: [C: 032] Update wikiversity's logos to 2017 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365966 (https://phabricator.wikimedia.org/T160491) (owner: 10Urbanecm) [13:06:05] hashar: i've got a beta cluster-only change to +2 after the swat [13:06:09] could you ping me when you're done [13:06:11] ? [13:06:13] (03PS3) 10Hashar: Update wikiversity's logos to 2017 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365966 (https://phabricator.wikimedia.org/T160491) (owner: 10Urbanecm) [13:06:15] (03CR) 10jenkins-bot: Extend throttle rule per phabricator request [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366225 (https://phabricator.wikimedia.org/T170844) (owner: 10Urbanecm) [13:06:20] phuedx: which one ? [13:06:22] I am going to +2 it right now [13:06:29] https://gerrit.wikimedia.org/r/#/c/365939/ [13:06:31] oh, ta [13:06:34] hashar: ^^ [13:06:57] (03PS4) 10Hashar: pagePreviews: Enable for anons/as pref on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365939 (https://phabricator.wikimedia.org/T167365) (owner: 10Phuedx) [13:07:06] (03CR) 10Hashar: [C: 032] pagePreviews: Enable for anons/as pref on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365939 (https://phabricator.wikimedia.org/T167365) (owner: 10Phuedx) [13:08:03] (03Merged) 10jenkins-bot: Update wikiversity's logos to 2017 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365966 (https://phabricator.wikimedia.org/T160491) (owner: 10Urbanecm) [13:08:18] (03Merged) 10jenkins-bot: pagePreviews: Enable for anons/as pref on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365939 (https://phabricator.wikimedia.org/T167365) (owner: 10Phuedx) [13:08:35] (03CR) 10jenkins-bot: Update wikiversity's logos to 2017 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365966 (https://phabricator.wikimedia.org/T160491) (owner: 10Urbanecm) [13:09:01] Urbanecm: and updating the wikiversity logos [13:09:08] phuedx: +2ed /merged :) [13:09:08] Ack [13:09:40] !log hashar@tin Synchronized static/images/project-logos: Update wikiversity logos to 2017 - T160491 (duration: 00m 48s) [13:09:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:52] T160491: Update Wikiversity logos - https://phabricator.wikimedia.org/T160491 [13:10:47] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: Update wikiversity logos to 2017 - T160491 (duration: 00m 46s) [13:10:56] (03CR) 10Hashar: [C: 032] Change logo on nl.wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366058 (https://phabricator.wikimedia.org/T170984) (owner: 10Urbanecm) [13:10:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:56] hashar: thanks! [13:11:43] !log jmm@puppetmaster1001 conftool action : set/pooled=yes; selector: nescio.wikimedia.org [13:11:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:53] (03Merged) 10jenkins-bot: Change logo on nl.wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366058 (https://phabricator.wikimedia.org/T170984) (owner: 10Urbanecm) [13:12:08] phuedx: the jenkins job that updates beta seems to be borked. Will fix it after swat [13:16:29] Urbanecm: and the nlwikinews logo is landing [13:16:43] ack [13:16:56] (03CR) 10jenkins-bot: pagePreviews: Enable for anons/as pref on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365939 (https://phabricator.wikimedia.org/T167365) (owner: 10Phuedx) [13:17:04] !log hashar@tin Synchronized static/images/project-logos/nlwikinews.png: Change logo on nl.wikinews - T170984 (duration: 00m 47s) [13:17:11] phuedx: looks like the job that updates beta cluster is running now :) [13:17:14] (03PS1) 10Filippo Giunchedi: prometheus: check for vhtcp.stats file [puppet] - 10https://gerrit.wikimedia.org/r/366251 (https://phabricator.wikimedia.org/T157353) [13:17:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:14] T170984: Change logo on nl.wikinews - https://phabricator.wikimedia.org/T170984 [13:17:28] (03PS1) 10Jcrespo: mariadb: Disable mariadb main instance starting for multisource [puppet] - 10https://gerrit.wikimedia.org/r/366252 (https://phabricator.wikimedia.org/T169514) [13:17:44] (03CR) 10Hashar: [C: 032] Change timezone on nl.wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366077 (https://phabricator.wikimedia.org/T170985) (owner: 10Urbanecm) [13:17:49] (03PS2) 10Hashar: Change timezone on nl.wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366077 (https://phabricator.wikimedia.org/T170985) (owner: 10Urbanecm) [13:18:08] (03CR) 10jerkins-bot: [V: 04-1] prometheus: check for vhtcp.stats file [puppet] - 10https://gerrit.wikimedia.org/r/366251 (https://phabricator.wikimedia.org/T157353) (owner: 10Filippo Giunchedi) [13:18:15] (03PS2) 10Filippo Giunchedi: prometheus: check for vhtcp.stats file [puppet] - 10https://gerrit.wikimedia.org/r/366251 (https://phabricator.wikimedia.org/T157353) [13:18:36] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Disable mariadb main instance starting for multisource [puppet] - 10https://gerrit.wikimedia.org/r/366252 (https://phabricator.wikimedia.org/T169514) (owner: 10Jcrespo) [13:18:41] (03PS3) 10Muehlenhoff: nodepool: pin python-jenkins to 0.4.12+ [puppet] - 10https://gerrit.wikimedia.org/r/366217 (https://phabricator.wikimedia.org/T144106) (owner: 10Hashar) [13:18:46] (03CR) 10jenkins-bot: Change logo on nl.wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366058 (https://phabricator.wikimedia.org/T170984) (owner: 10Urbanecm) [13:22:27] 10Operations, 10Traffic: Strongswan Icinga check: do not report issues about depooled hosts - https://phabricator.wikimedia.org/T148976#3452356 (10ema) Icinga [[ https://www.icinga.com/docs/icinga1/latest/en/extcommands2.html|external commands]] include `SCHEDULE_SVC_DOWNTIME`, which seems handy. We could perh... [13:22:46] (03CR) 10Hashar: [C: 032] Change timezone on nl.wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366077 (https://phabricator.wikimedia.org/T170985) (owner: 10Urbanecm) [13:23:32] (03CR) 10Muehlenhoff: [C: 032] nodepool: pin python-jenkins to 0.4.12+ [puppet] - 10https://gerrit.wikimedia.org/r/366217 (https://phabricator.wikimedia.org/T144106) (owner: 10Hashar) [13:24:29] !log Optimize EditConflict_8860941_15423246 and Echo_7731316 on dbstore1002 - T168303 [13:24:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:39] (03Merged) 10jenkins-bot: Change timezone on nl.wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366077 (https://phabricator.wikimedia.org/T170985) (owner: 10Urbanecm) [13:24:40] T168303: dbstore1002 /srv filling up - https://phabricator.wikimedia.org/T168303 [13:24:53] 10Operations, 10Analytics, 10EventBus, 10User-Elukey: Eventbus does not handle gracefully changes in DNS recursors - https://phabricator.wikimedia.org/T171048#3452361 (10elukey) [13:25:35] Urbanecm: syncing last (tz change for nlwikinews) [13:25:42] then I guess I am going to purge all logos [13:26:06] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: Change timezone on nl.wikinews to Europe/Berlin - T170985 (duration: 00m 44s) [13:26:11] (03CR) 10jenkins-bot: Change timezone on nl.wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366077 (https://phabricator.wikimedia.org/T170985) (owner: 10Urbanecm) [13:26:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:16] T170985: Change timezone on nl.wikinews - https://phabricator.wikimedia.org/T170985 [13:26:34] jynus: is mysqld_exporter_multiinstance.pp used at all in https://gerrit.wikimedia.org/r/#/c/364396 ? [13:27:29] if it is not, it may be a mistake [13:29:02] through the role::mysqld_exporter_instance.pp which is not a role, but a resouce? [13:29:20] !log Purged all 1685 project-logos ( find static/images/project-logos -maxdepth 1 -type f| sed -e 's%^%https://en.wikipedia.org/%'|mwscript purgeList.php --wiki=enwiki ) [13:29:21] Urbanecm: I have purged all the logos [13:29:26] hashar, thanks [13:29:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:54] jynus: it should but it isn't? I see prometheus::mysqld_exporter { $title: [13:31:01] (03PS3) 10Ema: prometheus: add job definition for nginx_exporter [puppet] - 10https://gerrit.wikimedia.org/r/366220 [13:31:06] (03CR) 10Ema: [V: 032 C: 032] prometheus: add job definition for nginx_exporter [puppet] - 10https://gerrit.wikimedia.org/r/366220 (owner: 10Ema) [13:31:14] godog: ah, I see it now [13:31:38] (03CR) 10Elukey: [C: 032] role::configcluster: tighten the access to zookeeper [puppet] - 10https://gerrit.wikimedia.org/r/366228 (https://phabricator.wikimedia.org/T114815) (owner: 10Elukey) [13:31:44] s/mysqld_exporter/mysqld_exporter_multiinstance/ [13:31:46] (03PS2) 10Elukey: role::configcluster: tighten the access to zookeeper [puppet] - 10https://gerrit.wikimedia.org/r/366228 (https://phabricator.wikimedia.org/T114815) [13:31:48] (03CR) 10Elukey: [V: 032 C: 032] role::configcluster: tighten the access to zookeeper [puppet] - 10https://gerrit.wikimedia.org/r/366228 (https://phabricator.wikimedia.org/T114815) (owner: 10Elukey) [13:32:51] jynus: ah ok, but anyway the existing prometheus::mysqld_exporter could be extended with a different listening port and the rest is practically the same? [13:32:57] !log Limit the access to the conf* zookeeper ports via ferm rules - https://gerrit.wikimedia.org/r/366228 [13:33:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:26] moritzm: ready to run puppet on conf2001 when you are good [13:33:54] adding the logging rules [13:34:24] godog, the plan is to delete the original [13:34:34] elukey: done [13:34:41] but this way, I can test it on a single host [13:34:43] Urbanecm: so I think it is all good . Thank you! [13:34:48] !log European SWAT completed [13:34:52] and the remove the old one, rename it to the original name [13:34:53] 10Operations, 10Traffic, 10Community-Liaisons (Jul-Sep 2017): Communicate dropping IE8-on-XP support (a security change) to affected editors and other community members - https://phabricator.wikimedia.org/T163251#3452402 (10Johan) @BBlack Sorry for late reply, been OoO. Difficult to say, given that this isn... [13:34:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:12] 10Operations, 10Traffic, 10Community-Liaisons (Jul-Sep 2017), 10User-Johan: Communicate dropping IE8-on-XP support (a security change) to affected editors and other community members - https://phabricator.wikimedia.org/T163251#3452403 (10Johan) a:03Johan [13:35:23] moritzm: all right! Running puppet on conf2001 [13:35:35] there is lots of pending refactoring because those horrible role::mysql::group [13:35:50] which now must accept multiple groups on the same node [13:36:20] hashar, you're welcome [13:36:21] godog: I wanted to know if it was in the right direction [13:36:32] moritzm: ferm just restarted [13:36:38] I am not sure of the high level stuff [13:36:48] the role as a resource seems strange to me [13:37:15] jynus: it is probably a profile, not a role [13:37:16] ther is no other role as a resource, other roles have multiple instances but the number is fixed [13:37:27] this role or profile or whatever [13:37:31] will be applied n times [13:37:50] I would like to apply it automatically on instance, but my instances are modules [13:38:03] so I may have to create a role for mysql intances [13:38:04] elukey: all looking good so far, nothing dropped by now [13:38:15] and put there the mysql and the prometheus stuff [13:38:23] moritzm: \o/ [13:38:33] PROBLEM - Hive Server on analytics1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hive.service.server.HiveServer2 [13:38:38] ouch [13:38:39] godog: just feel ftee to comment anything you want [13:38:48] checking an1003 [13:38:49] and I will see what I can do [13:39:11] I will start by fixing the mistakes [13:39:30] jynus: ok I will! [13:39:35] thank you [13:39:43] this comment was already very useful [13:40:02] I am a bit lost about the model [13:40:19] other services had multi-instance, but this is a bit different [13:40:41] in what way? [13:41:10] java.lang.OutOfMemoryError: Java heap space [13:41:34] elukey: mhh I think I might have broken it, was running a webrequest query earlier [13:41:38] yeah :D [13:41:40] godog: the top level manifest does have a fixed and small number of instances [13:41:56] here the top-level manifests has an arbitrary number of instances [13:42:01] up to 7 [13:42:22] there will be classes with 7, others with 2 [13:42:31] then I saw something strange [13:42:43] I was going to create a prometheus::mysql::common [13:42:48] as a 1 time include [13:43:06] but then I saw /var/lib/prometheus not being created anywhere [13:43:24] is it created by the package? by the user creation? [13:43:36] other manifest use it but I do not see it defined anywhere [13:43:48] by the package yeah [13:44:00] ok, then not needed to be created on puppet [13:44:13] which makes a common class not needed [13:44:30] plus fiesystem permissions are less problematic [13:44:33] RECOVERY - Hive Server on analytics1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hive.service.server.HiveServer2 [13:44:34] as I intend to delete passwords [13:44:47] so take this as a first step [13:45:13] but please tell me if the idea is ok, as I was a bit lost as I told you [13:45:42] that is why I created a parallel file structure [13:45:43] !log restart hive-server on analytics1003 - Java OOM issue due to a huge query [13:45:48] ya hey just saw that [13:45:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:59] huge query caused hive server to crash? [13:45:59] being able to break things without affectin existing monitoring [13:45:59] huh. [13:46:17] !log Compress database rowiki on dbstore1002 - T168303 [13:46:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:27] T168303: dbstore1002 /srv filling up - https://phabricator.wikimedia.org/T168303 [13:47:30] jynus: yeah I see what you mean, I'll comment on the review but I think adapting the existing define would be easy [13:53:37] 10Operations, 10ops-eqiad, 10OCG-General, 10Reading-Web-Backlog (Tracking): ocg1001 is broken - https://phabricator.wikimedia.org/T170886#3446253 (10Joe) Please note that if no one ran the script that removes entries corresponding to ocg1001 from the cache, we're still serving failing requests as mediawiki... [13:53:37] !log oblivian@puppetmaster1001 conftool action : set/pooled=inactive; selector: cluster=pdf,name=ocg1001.eqiad.wmnet [13:53:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:50] (03CR) 10Filippo Giunchedi: "The exporter part I think can be done by changing the existing class into a define and use 'default' for the instance name, plus using bas" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/364396 (https://phabricator.wikimedia.org/T170666) (owner: 10Jcrespo) [13:55:18] <_joe_> !log running clear-host-cache.js for ocg1001 decommission T170886 [13:55:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:27] T170886: ocg1001 is broken - https://phabricator.wikimedia.org/T170886 [13:56:27] elukey: I ran the query again with month=6 fyi [13:56:47] godog: not sure if it was you, still checking :) [14:01:28] 10Operations, 10Commons, 10media-storage: Unable to restore file that has a very large file size - https://phabricator.wikimedia.org/T131832#2179992 (10fgiunchedi) @Steinsplitter have you seen this happening recently? [14:01:36] (03PS1) 10Elukey: profile::druid::zookeper: tighten the access to zookeeper [puppet] - 10https://gerrit.wikimedia.org/r/366256 [14:02:18] 10Operations, 10Commons, 10media-storage: Unable to restore file that has a very large file size - https://phabricator.wikimedia.org/T131832#3452492 (10Steinsplitter) >>! In T131832#3452489, @fgiunchedi wrote: > @Steinsplitter have you seen this happening recently? Haven't done undeletions for a while now,... [14:02:26] (03PS2) 10Elukey: profile::ac::druid::zookeper: tighten the access to zookeeper [puppet] - 10https://gerrit.wikimedia.org/r/366256 [14:02:45] 10Operations, 10ops-eqiad, 10OCG-General, 10Reading-Web-Backlog (Tracking): ocg1001 is broken - https://phabricator.wikimedia.org/T170886#3452493 (10Joe) I ran the script and it worked fine, but we also need to add a redis slave on ocg1003, as right now we lost the replica of the redis master which is on o... [14:04:25] 10Operations, 10Commons, 10media-storage: Unable to restore file that has a very large file size - https://phabricator.wikimedia.org/T131832#3452496 (10fgiunchedi) p:05High>03Normal no problem, maybe @Poyekhali ? [14:05:39] 10Operations, 10MediaWiki-API, 10Traffic, 10Patch-For-Review: Varnish does not cache Action API responses when logged in - https://phabricator.wikimedia.org/T155314#3452500 (10Tgr) >>! In T155314#3452263, @fgiunchedi wrote: > This seems to WAI, only responses for anons are cached in varnish, anything left... [14:06:05] 10Operations: Upgrade mc1* cluster to Linux 4.4 - https://phabricator.wikimedia.org/T143695#2576158 (10fgiunchedi) Indeed, looks like we have 12x mc machines in eqiad still with 3.19 [14:07:00] (03CR) 10Muehlenhoff: [C: 031] profile::ac::druid::zookeper: tighten the access to zookeeper [puppet] - 10https://gerrit.wikimedia.org/r/366256 (owner: 10Elukey) [14:07:42] 10Operations, 10MediaWiki-API, 10Traffic, 10Patch-For-Review: Varnish does not cache Action API responses when logged in - https://phabricator.wikimedia.org/T155314#3452507 (10Tgr) [14:07:54] (03CR) 10Elukey: [C: 032] profile::ac::druid::zookeper: tighten the access to zookeeper [puppet] - 10https://gerrit.wikimedia.org/r/366256 (owner: 10Elukey) [14:08:55] 10Operations: Upgrade mc1* cluster to Linux 4.4 - https://phabricator.wikimedia.org/T143695#3452510 (10MoritzMuehlenhoff) These are all waiting to be decommisioned, see T164341 [14:11:24] 10Operations, 10MediaWiki-API, 10Traffic, 10Patch-For-Review: Varnish does not cache Action API responses when logged in - https://phabricator.wikimedia.org/T155314#3452522 (10Tgr) This is not a huge deal right now as clients have to explicitly ask for API responses to be cached in Varnish (although some t... [14:18:01] 10Operations, 10Analytics-Kanban, 10Wikimedia-Stream, 10hardware-requests, 10Patch-For-Review: decommission rcs100[12] - https://phabricator.wikimedia.org/T170157#3452538 (10Ottomata) > Because of in CommonSettings.. Aye yai yai. @Reedy very sorry about that, thanks for cleaning up my mess. [14:18:08] 10Operations: Upgrade mc1* cluster to Linux 4.4 - https://phabricator.wikimedia.org/T143695#3452539 (10fgiunchedi) 05Open>03Invalid Hosts need decom anyways, resolving, thanks @Muehlenhoff ! [14:25:59] (03PS3) 10Filippo Giunchedi: prometheus: check for vhtcp.stats file [puppet] - 10https://gerrit.wikimedia.org/r/366251 (https://phabricator.wikimedia.org/T157353) [14:28:49] 10Operations, 10Cloud-Services, 10Cloud-VPS, 10Wikimedia-Incident: labservices1001 down, suspected overheating - https://phabricator.wikimedia.org/T152340#3452581 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi I don't think we've seen this reoccuring [14:31:48] !log installing imagemagick security updates [14:31:51] 10Operations, 10DBA: Set up backup strategy for es clusters - https://phabricator.wikimedia.org/T79922#3452586 (10fgiunchedi) [14:31:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:10] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/366251 (https://phabricator.wikimedia.org/T157353) (owner: 10Filippo Giunchedi) [14:34:10] (03PS4) 10Filippo Giunchedi: prometheus: check for vhtcp.stats file [puppet] - 10https://gerrit.wikimedia.org/r/366251 (https://phabricator.wikimedia.org/T157353) [14:36:37] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: check for vhtcp.stats file [puppet] - 10https://gerrit.wikimedia.org/r/366251 (https://phabricator.wikimedia.org/T157353) (owner: 10Filippo Giunchedi) [14:37:20] 10Operations, 10Patch-For-Review: Tracking and Reducing cron-spam from root@ - https://phabricator.wikimedia.org/T132324#3452610 (10fgiunchedi) [14:37:22] 10Operations, 10Traffic, 10Patch-For-Review: prometheus-vhtcpd-stats cronspamming if vhtcpd is not running yet - https://phabricator.wikimedia.org/T157353#3452607 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi We're checking for file presence too now [14:42:13] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I did test this patch and hiera will break if we change the names of the classes. So I'm inclined to revert the name changes and configure" [puppet] - 10https://gerrit.wikimedia.org/r/359447 (owner: 10Faidon Liambotis) [14:44:36] !log Restarting Jenkins [14:44:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:35] 10Operations, 10Electron-PDFs, 10Patch-For-Review, 10Reading-Web-Backlog (Tracking), 10Services (blocked): pdfrender fails to serve requests since Mar 8 00:30:32 UTC on scb1003 - https://phabricator.wikimedia.org/T159922#3452630 (10GWicke) @ovasileva, thanks for the update. Anything that uses a generic b... [14:46:57] ciao mobrovac, zk firewall rules in place, let me know if you see any issue [14:47:15] oh cool elukey! [14:47:44] i'll monitor and let you know if any issues pop up! [14:48:47] 10Operations, 10monitoring, 10netops: internal network packet loss alerting - https://phabricator.wikimedia.org/T83196#3452641 (10fgiunchedi) +#netops as this sounds similar to {T169860} [14:48:48] (03PS4) 10Giuseppe Lavagetto: Add discovery DNS entry for service recommendation-api [dns] - 10https://gerrit.wikimedia.org/r/364458 (https://phabricator.wikimedia.org/T165760) [14:49:31] 10Operations, 10Mail: apply to Yahoo for whitelisting bulk mail - https://phabricator.wikimedia.org/T83120#3452646 (10fgiunchedi) [14:51:14] 10Operations: Extend Wikimedia APT repository with more pinning alternatives - https://phabricator.wikimedia.org/T78948#3452652 (10fgiunchedi) 05Open>03declined Tracked in {T158583} [14:52:29] 10Operations: track package updates available for apt.wikimedia.org - https://phabricator.wikimedia.org/T84235#3452656 (10fgiunchedi) 05Open>03declined See also {T158583} [14:53:43] 10Operations, 10monitoring, 10netops: internal network packet loss alerting - https://phabricator.wikimedia.org/T83196#3452661 (10ayounsi) 05Open>03Resolved a:03ayounsi It's fine to close. We already have better latency/loss monitoring than I guess in 2013/2015. T169860 might improve that even more. [14:58:36] 10Operations, 10Domains, 10Traffic, 10Patch-For-Review: wikiknihy.cz - transfer to Wikimedia Czech Republic? - https://phabricator.wikimedia.org/T127573#2047253 (10fgiunchedi) I've ran again a hive query to get some idea about this domain, 670 hits in june ``` 0: jdbc:hive2://analytics1003.eqiad.wmnet:100... [15:03:22] PROBLEM - Check Varnish expiry mailbox lag on cp1099 is CRITICAL: CRITICAL: expiry mailbox lag is 2033937 [15:05:33] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target) timed out before a response was received: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) timed out before a response was received [15:05:33] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target) timed out before a response was received: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) timed out before a response was received [15:05:51] (03CR) 10Giuseppe Lavagetto: [C: 032] Add discovery DNS entry for service recommendation-api [dns] - 10https://gerrit.wikimedia.org/r/364458 (https://phabricator.wikimedia.org/T165760) (owner: 10Giuseppe Lavagetto) [15:05:52] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target) timed out before a response was received: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) timed out before a response was received [15:06:12] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target) timed out before a response was received: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) timed out before a response was received [15:17:14] (03CR) 10Umherirrender: "Answer of legal is documented on phab: "we clear this change as well"" [puppet] - 10https://gerrit.wikimedia.org/r/363851 (https://phabricator.wikimedia.org/T89741) (owner: 10Umherirrender) [15:17:24] (03PS5) 10Umherirrender: Add ar_content_format and ar_content_model to labs views [puppet] - 10https://gerrit.wikimedia.org/r/363851 (https://phabricator.wikimedia.org/T89741) [15:19:10] (03CR) 10Jcrespo: "> The exporter part I think can be done by changing the existing class into a define and use 'default' for the instance name, plus using b" [puppet] - 10https://gerrit.wikimedia.org/r/364396 (https://phabricator.wikimedia.org/T170666) (owner: 10Jcrespo) [15:21:32] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "The solution is correct in line of principle, however the current patch would break existing installations." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/365053 (owner: 10Andrew Bogott) [15:28:25] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy [15:28:34] (03PS2) 10Filippo Giunchedi: kafkatee: send 4xx to logstash as well [puppet] - 10https://gerrit.wikimedia.org/r/365247 [15:28:55] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy [15:28:55] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy [15:29:15] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy [15:30:08] 10Operations, 10media-storage: upload.wikimedia.org needs a Wikimedia 404 error page - https://phabricator.wikimedia.org/T37053#3452760 (10fgiunchedi) [15:32:12] 10Operations, 10Mail: apply to Yahoo for whitelisting bulk mail - https://phabricator.wikimedia.org/T83120#3452772 (10Dzahn) I don't know at all , all i did almost 4 years ago was import a Bugzilla ticket, and that is meanwhile imported to T58414 So this is a duplicate. [15:32:50] 10Operations, 10Mail: Get mail relay out of Yahoo! blacklist: apply to Yahoo for whitelisting bulk mail - https://phabricator.wikimedia.org/T58414#616526 (10Dzahn) [15:32:52] 10Operations, 10Mail: apply to Yahoo for whitelisting bulk mail - https://phabricator.wikimedia.org/T83120#3452779 (10Dzahn) [15:34:00] (03CR) 10Filippo Giunchedi: [C: 032] kafkatee: send 4xx to logstash as well [puppet] - 10https://gerrit.wikimedia.org/r/365247 (owner: 10Filippo Giunchedi) [15:36:39] (03PS9) 10Jcrespo: prometheus: Convert mysqld-exporter into multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/364396 (https://phabricator.wikimedia.org/T170666) [15:37:39] (03CR) 10jerkins-bot: [V: 04-1] prometheus: Convert mysqld-exporter into multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/364396 (https://phabricator.wikimedia.org/T170666) (owner: 10Jcrespo) [15:38:01] 10Operations, 10Traffic, 10Wikimedia-Planet, 10Patch-For-Review: mixed-content issues on planet.wikimedia.org - https://phabricator.wikimedia.org/T141480#3452829 (10Dzahn) Yes, not all planet feeds use https yet. [15:39:59] (03PS10) 10Jcrespo: prometheus: Convert mysqld-exporter into multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/364396 (https://phabricator.wikimedia.org/T170666) [15:41:40] (03CR) 10Jcrespo: "So, to clarify, this is a testable first version, then we have to fix the mariadb::group stuff and automatic resource collection, and then" [puppet] - 10https://gerrit.wikimedia.org/r/364396 (https://phabricator.wikimedia.org/T170666) (owner: 10Jcrespo) [15:51:18] 10Operations, 10discovery-system: Add python-etcd the ability to discover the etcd hosts using SRV records - https://phabricator.wikimedia.org/T102109#3452874 (10Joe) 05Open>03Resolved a:03Joe [16:01:15] PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 2081872 [16:04:39] !log robh@puppetmaster1001 conftool action : set/pooled=no; selector: name=mw1196.eqiad.wmnet [16:04:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:56] !log mw1196 has hardware failure and is being decommissioned [16:05:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:13] 10Operations: Include 5xx numbers in fluorine fatalmonitor - https://phabricator.wikimedia.org/T116627#1754262 (10fgiunchedi) FWIW 5xx and 4xx are available directly from logstash now `type:webrequest` [16:10:48] !log robh@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=mw1196.eqiad.wmnet [16:10:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:38] 10Operations, 10Traffic, 10Zero: Security: Is it safe to enable Zero spoofing - https://phabricator.wikimedia.org/T120631#3452969 (10fgiunchedi) [16:16:18] !log Compressing innodb on dbstore1002 for the following wikis: viwiki ukwiki kowiki huwiki hewiki frwiktionary fawiki eswiki cawiki arwiki - T168303 [16:16:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:28] T168303: dbstore1002 /srv filling up - https://phabricator.wikimedia.org/T168303 [16:20:01] (03PS1) 10RobH: decommission of mw1196 [puppet] - 10https://gerrit.wikimedia.org/r/366284 (https://phabricator.wikimedia.org/T170441) [16:21:32] whenever i turn down a switch port, i have a slight fear the port isn't labeled right and I'll have turned down a vital server. [16:21:39] (valid fears that have happened in the past) [16:21:57] could anybody (robh, godog?) check what is the status of wdqs pool? I suspect 1002 is not in the pool (it should be 1001 and 1002 in the pool and 1003 out for maintenance)? [16:22:38] {"wdqs1002.eqiad.wmnet": {"pooled": "inactive", "weight": 10}, "tags": "dc=eqiad,cluster=wdqs,service=wdqs"} [16:22:50] its set inactive it seems [16:22:54] ill set it active and it should go back in [16:23:00] robh hmm could you reset it? [16:23:01] thanks [16:23:35] !log robh@puppetmaster1001 conftool action : set/pooled=active; selector: name=wdqs1002.eqiad.wmnet [16:23:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:49] hrmm [16:23:50] {"wdqs1002.eqiad.wmnet": {"pooled": "inactive", "weight": 10}, "tags": "dc=eqiad,cluster=wdqs,service=wdqs"} [16:23:50] when i get [16:23:52] wtf [16:23:58] this is AFTER i set to pooled active [16:24:19] hmm... I checked and maintenance is off... [16:24:26] server seems to be up and running [16:24:31] let me see if pybal is happy with it [16:24:34] at least from the inside [16:26:45] ok, well, lvs1003 [16:26:47] Jul 19 16:25:49 lvs1003 pybal[13046]: [wdqs_80 ProxyFetch] WARN: wdqs1003.eqiad.wmnet (enabled/partially up/not pooled): Fetch failed, 1.881 s [16:27:00] so 1003 is in the pool [16:27:00] 1003 is out for maintenance, that's ok [16:27:06] oh what? [16:27:08] its depooled automatically [16:27:14] but its set to be active, so pybal is trying [16:27:17] it shouldn't be in the ppol [16:27:18] it should be set inactive i assume [16:27:22] yeah [16:27:23] so im going to set it inactive now [16:27:36] !log robh@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=wdqs1003.eqiad.wmnet [16:27:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:53] !log robh@puppetmaster1001 conftool action : set/pooled=no; selector: name=wdqs1003.eqiad.wmnet [16:28:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:20] SMalyshev: im not 100% sure if it has to be both pooled=no and inactive, so i set both. [16:28:31] no, 1001 should be active [16:28:34] and 1002 [16:28:38] but not 1003 [16:28:53] indeed, 1003 is now inactive [16:28:53] ah, ok, that's ok [16:29:00] but i tried to set 1002 to set/pooled=active [16:29:04] and that doesnt remove the inactive flag it seems [16:29:12] because that would be sensible? [16:30:20] weird why it doesn't accept 1002? [16:31:10] im not 100% my syntax is correct [16:31:16] i know its set/pooled/=inactive [16:31:25] sorry set/pooled=inactive [16:31:31] i assume set/pooled=active is the reverse [16:31:47] Jul 19 16:31:29 lvs1003 pybal[13046]: [logstash-syslog-tcp_10514] INFO: Server logstash1002.eqiad.wmnet (enabled/partially up/not pooled) is up [16:31:57] pybal sees it as partially there [16:32:15] Jul 19 16:32:10 lvs1003 pybal[13046]: [wdqs_80 ProxyFetch] WARN: wdqs1003.eqiad.wmnet (disabled/partially up/not pooled): Fetch failed, 0.004 s [16:32:22] though it keeps testing 1003 and i dunno why [16:32:28] i told it to be inactive [16:32:37] wiki says confctl select name=foo.example.net set/pooled=yes [16:32:49] maybe that's the syntax? [16:32:51] Jul 19 16:32:41 lvs1003 pybal[13046]: [logstash-json-tcp_11514 IdleConnection] WARN: logstash1001.eqiad.wmnet (enabled/down/not pooled): Connection to 10.64.0.122:11514 failed. [16:32:55] 1001 is not responding either [16:33:08] Jul 19 16:32:58 lvs1003 pybal[13046]: [logstash-json-tcp_11514 IdleConnection] WARN: logstash1002.eqiad.wmnet (enabled/down/pooled): Connection to 10.64.32.137:11514 failed. [16:33:14] its trying to send to both 1001 and 1002 and failing [16:33:21] SMalyshev: were those recently reloaded and not fully back online perhaps? [16:33:29] it seems they are refusing pybal [16:33:34] robh: no, 1001 is serving traffic right now [16:33:44] 1002 should be fine too, I don't see any issue with it [16:33:52] not according to pybal [16:34:01] hmm that's weird [16:34:06] Jul 19 16:33:47 lvs1003 pybal[13046]: [logstash-json-tcp_11514 IdleConnection] WARN: logstash1001.eqiad.wmnet (enabled/down/not pooled): Connection to 10.64.0.122:11514 failed. [16:34:12] pybal has 1001 and 1002 as down [16:34:20] logstash? [16:34:25] oh sorry [16:34:26] fuck [16:34:40] haha [16:34:45] my grep is fubar, and i need to fix [16:35:00] ok, drop the refusing connections comments [16:35:06] i dont see any issues with wdq100[12] [16:35:20] those were logstash and i had run the wrong command for grepping out the massive amound of pybal log stuff [16:35:26] all i see now is Jul 19 16:34:14 lvs1003 pybal[13046]: [wdqs_80 ProxyFetch] WARN: wdqs1003.eqiad.wmnet (disabled/partially up/not pooled): Fetch failed, 0.034 s [16:35:28] but I still don't see any traffic coming to 1002 [16:35:41] and indeed, 1002 shows inactive in confctl [16:35:54] !log robh@puppetmaster1001 conftool action : set/pooled=yes; selector: name=wdqs1002.eqiad.wmnet [16:36:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:08] well, we did that already [16:36:11] and now it stuck? [16:36:14] robh@puppetmaster1001:~$ sudo -i confctl select name=wdqs1002.eqiad.wmnet get [16:36:14] {"wdqs1002.eqiad.wmnet": {"pooled": "yes", "weight": 10}, "tags": "dc=eqiad,cluster=wdqs,service=wdqs"} [16:36:21] ah, cool [16:36:27] SMalyshev: i did that exact same command though, you can see in the log! ;] [16:36:30] let's see if that produces any traffic [16:36:49] oh wait, i said set pooled active, since it already had set pool yes [16:36:50] oh well [16:37:01] Jul 19 16:35:53 lvs1003 pybal[13046]: [wdqs_80] INFO: New enabled server wdqs1002.eqiad.wmnet, weight 10 [16:37:08] 10Operations, 10Services (doing), 10User-mobrovac: nodejs 6.11 - https://phabricator.wikimedia.org/T170548#3453047 (10Arlolra) >>! In T170548#3442747, @MoritzMuehlenhoff wrote: > @Arlolra : ruthenium has been updated to 6.11 Any reason that's not v6.11.1? The .1 patch is a security release. [16:37:09] robh: yup, seeing queries on 1002 now! [16:37:12] thanks! [16:37:21] sorry for my adding some confusion in the middle there iwth the bad log paste! [16:37:33] glad to help [16:37:55] at least i pasted my log outputs so it was obvious i had bad info ;] [16:41:42] (03PS1) 10RobH: decom of mw1196 [dns] - 10https://gerrit.wikimedia.org/r/366285 (https://phabricator.wikimedia.org/T170441) [16:43:29] 10Operations, 10ops-eqiad, 10hardware-requests, 10Patch-For-Review: Decommission mw1196 - https://phabricator.wikimedia.org/T170441#3453074 (10RobH) [16:43:35] (03CR) 10RobH: [C: 032] decommission of mw1196 [puppet] - 10https://gerrit.wikimedia.org/r/366284 (https://phabricator.wikimedia.org/T170441) (owner: 10RobH) [16:44:05] (03CR) 10RobH: [C: 032] decom of mw1196 [dns] - 10https://gerrit.wikimedia.org/r/366285 (https://phabricator.wikimedia.org/T170441) (owner: 10RobH) [16:45:47] 10Operations, 10ops-eqiad, 10hardware-requests, 10Patch-For-Review: Decommission mw1196 - https://phabricator.wikimedia.org/T170441#3453079 (10RobH) p:05Triage>03Normal a:05RobH>03Cmjohnson This system is ready for disk wipe and decom step remainder. [16:51:06] 10Operations, 10DBA, 10Performance-Team, 10Availability (Multiple-active-datacenters): Perform testing for TLS effect on connection rate - https://phabricator.wikimedia.org/T171071#3453087 (10aaron) [16:51:11] 10Operations, 10hardware-requests, 10monitoring, 10Patch-For-Review: decom netmon1001 - https://phabricator.wikimedia.org/T171018#3453101 (10Dzahn) 05Open>03stalled stalled for a moment please, librenms had to go back to it temp. [16:54:55] 10Operations, 10hardware-requests, 10monitoring, 10Patch-For-Review: decom netmon1001 - https://phabricator.wikimedia.org/T171018#3453121 (10RobH) a:03Dzahn Please feel free to assign to me once this is ready for the non-interrupt steps, thanks! [17:02:48] 10Operations, 10Services (doing), 10User-mobrovac: nodejs 6.11 - https://phabricator.wikimedia.org/T170548#3453140 (10mobrovac) My guess would be that it got packaged around the same time of the release of v6.11.1 (it came out 2017-07-11). This update resolves one security patch - CVE-2017-1000381 - which wo... [17:03:25] RECOVERY - Check Varnish expiry mailbox lag on cp1099 is OK: OK: expiry mailbox lag is 54 [17:04:59] 10Operations, 10DBA, 10Performance-Team, 10Availability (Multiple-active-datacenters): Apache <=> mariadb SSL/TLS for cross-datacenter writes - https://phabricator.wikimedia.org/T134809#3453151 (10aaron) Roll-out should probably be something like: a) DB_MASTER connections for testwiki/mediawikiwiki (group... [17:13:38] (03CR) 10Faidon Liambotis: [C: 04-1] systemd: add defines to manage systemd units (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/365900 (owner: 10Giuseppe Lavagetto) [17:15:25] PROBLEM - HHVM rendering on mw1297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:16:15] RECOVERY - HHVM rendering on mw1297 is OK: HTTP OK: HTTP/1.1 200 OK - 76316 bytes in 0.271 second response time [17:16:43] 10Operations, 10Services (doing), 10User-mobrovac: nodejs 6.11 - https://phabricator.wikimedia.org/T170548#3453184 (10dpatrick) [17:22:40] 10Operations, 10Services (doing), 10User-mobrovac: nodejs 6.11 - https://phabricator.wikimedia.org/T170548#3453246 (10MoritzMuehlenhoff) The security fixes from 6.11.1 are of course all part of our package already, see the included changelog. [17:23:45] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - logstash-json-tcp_11514 - Could not depool server logstash1001.eqiad.wmnet because of too many down!: logstash-log4j_4560 - Could not depool server logstash1001.eqiad.wmnet because of too many down!: logstash-json-udp_11514_udp - Could not depool server logstash1001.eqiad.wmnet because of too many down!: logstash-syslog-tcp_10514 - Could not depool s [17:23:46] eqiad.wmnet because of too many down!: logstash-syslog-udp_10514_udp - Could not depool server logstash1001.eqiad.wmnet because of too many down! [17:23:46] PROBLEM - PyBal backends health check on lvs1009 is CRITICAL: PYBAL CRITICAL - logstash-syslog-tcp_10514 - Could not depool server logstash1001.eqiad.wmnet because of too many down!: logstash-json-tcp_11514 - Could not depool server logstash1001.eqiad.wmnet because of too many down!: logstash-log4j_4560 - Could not depool server logstash1001.eqiad.wmnet because of too many down!: logstash-json-udp_11514_udp - Could not depool s [17:23:46] eqiad.wmnet because of too many down!: logstash-syslog-udp_10514_udp - Could not depool server logstash1002.eqiad.wmnet because of too many down! [17:23:56] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - logstash-json-tcp_11514 - Could not depool server logstash1001.eqiad.wmnet because of too many down!: logstash-log4j_4560 - Could not depool server logstash1001.eqiad.wmnet because of too many down!: logstash-json-udp_11514_udp - Could not depool server logstash1002.eqiad.wmnet because of too many down!: logstash-syslog-tcp_10514 - Could not depool s [17:23:57] eqiad.wmnet because of too many down!: logstash-syslog-udp_10514_udp - Could not depool server logstash1001.eqiad.wmnet because of too many down! [17:23:57] PROBLEM - PyBal backends health check on lvs1010 is CRITICAL: PYBAL CRITICAL - logstash-json-tcp_11514 - Could not depool server logstash1001.eqiad.wmnet because of too many down!: logstash-log4j_4560 - Could not depool server logstash1001.eqiad.wmnet because of too many down!: logstash-json-udp_11514_udp - Could not depool server logstash1001.eqiad.wmnet because of too many down!: logstash-syslog-tcp_10514 - Could not depool s [17:23:57] eqiad.wmnet because of too many down!: logstash-syslog-udp_10514_udp - Could not depool server logstash1001.eqiad.wmnet because of too many down! [17:24:45] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [17:24:45] RECOVERY - PyBal backends health check on lvs1009 is OK: PYBAL OK - All pools are healthy [17:24:55] RECOVERY - PyBal backends health check on lvs1010 is OK: PYBAL OK - All pools are healthy [17:25:36] 10Operations, 10Continuous-Integration-Infrastructure, 10Discovery, 10Discovery-Analysis, 10Release-Engineering-Team (Watching / External): Setup a mirror for R language dependencies (CRAN) - https://phabricator.wikimedia.org/T170995#3453250 (10mpopov) @hashar Thank you for making this ticket and emailin... [17:25:55] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [17:28:53] 10Operations, 10Electron-PDFs, 10Patch-For-Review, 10Reading-Web-Backlog (Tracking), 10Services (blocked): pdfrender fails to serve requests since Mar 8 00:30:32 UTC on scb1003 - https://phabricator.wikimedia.org/T159922#3453258 (10ovasileva) @GWicke - about 90% sure that we'll go with electron. If ther... [17:32:50] (03CR) 10Bearloga: "Do you want to add a dependency on Ibd8b76f2ffd1cfaab6fdcc84117042eb668ed598 and then use r::cran, etc.?" [puppet] - 10https://gerrit.wikimedia.org/r/363337 (https://phabricator.wikimedia.org/T153856) (owner: 10Hashar) [17:33:57] 10Operations, 10Performance-Team: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837#3453266 (10Krinkle) (Draft / brain dump) * performance.wikimedia.org: simple frontend, low-priority, can go on a VM. * coal: high-throughput, high-priority, high-risk (hard to rep... [17:36:24] 10Operations, 10DC-Ops: Information missing from racktables - https://phabricator.wikimedia.org/T150651#3453271 (10RobH) [17:36:26] 10Operations, 10ops-ulsfo, 10DC-Ops: determine model/serial info for kvm-ulsfo - https://phabricator.wikimedia.org/T170613#3453269 (10RobH) 05Open>03Resolved did this yesterday [17:37:23] 10Operations, 10DC-Ops: Information missing from racktables - https://phabricator.wikimedia.org/T150651#2792080 (10RobH) a:05RobH>03None I'm pretty sure we (Faidon, Chris, Papaul, and myself) fixed these in the last week. Faidon has a more recent access list than I do though, @faidon? [17:39:25] 10Operations, 10scap2, 10Scap (Scap3-MediaWiki-MVP): Depool proxies temporarily while scap is ongoing to avoid taxing those nodes - https://phabricator.wikimedia.org/T125629#3453277 (10demon) 05Open>03Resolved [17:40:01] (03CR) 10Ottomata: "Unless ops policy/preferences has changed, I dunno if this is gonna fly! :o" [puppet] - 10https://gerrit.wikimedia.org/r/366170 (https://phabricator.wikimedia.org/T153856) (owner: 10Bearloga) [17:42:52] 10Operations, 10Phabricator, 10Release-Engineering-Team (Kanban): setup/install phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T163938#3453304 (10RobH) So it seems this system is offline, and loaded into the installer. It had been fully installed before, so I'm now investigating. [17:43:55] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - logstash-json-tcp_11514 - Could not depool server logstash1002.eqiad.wmnet because of too many down!: logstash-log4j_4560 - Could not depool server logstash1002.eqiad.wmnet because of too many down!: logstash-json-udp_11514_udp - Could not depool server logstash1001.eqiad.wmnet because of too many down!: logstash-syslog-tcp_10514 - Could not depool s [17:43:56] eqiad.wmnet because of too many down!: logstash-syslog-udp_10514_udp - Could not depool server logstash1001.eqiad.wmnet because of too many down! [17:44:05] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - logstash-syslog-tcp_10514 - Could not depool server logstash1001.eqiad.wmnet because of too many down!: logstash-log4j_4560 - Could not depool server logstash1002.eqiad.wmnet because of too many down!: logstash-json-tcp_11514 - Could not depool server logstash1002.eqiad.wmnet because of too many down! [17:44:05] PROBLEM - PyBal backends health check on lvs1010 is CRITICAL: PYBAL CRITICAL - logstash-syslog-tcp_10514 - Could not depool server logstash1002.eqiad.wmnet because of too many down!: logstash-json-udp_11514_udp - Could not depool server logstash1002.eqiad.wmnet because of too many down!: logstash-syslog-udp_10514_udp - Could not depool server logstash1002.eqiad.wmnet because of too many down! [17:44:42] jouncebot: next [17:44:42] In 0 hour(s) and 15 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170719T1800) [17:44:44] jouncebot: now [17:44:44] No deployments scheduled for the next 0 hour(s) and 15 minute(s) [17:44:55] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [17:45:08] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [17:45:08] RECOVERY - PyBal backends health check on lvs1010 is OK: PYBAL OK - All pools are healthy [17:53:34] 10Operations, 10Phabricator, 10Release-Engineering-Team (Kanban): setup/install phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T163938#3453367 (10RobH) a:05RobH>03Cmjohnson So this tries to load into the installer, and fails for sdb: ``` ┌─────────────┤ [!!] Partition disks ├─... [17:53:44] 10Operations, 10Phabricator, 10Release-Engineering-Team (Kanban): replace sdb and then setup/install phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T163938#3453369 (10RobH) [17:54:49] the problems with logstash crashing is me, I'm reverting sending 4xx to logstash [17:54:56] 10Operations, 10Phabricator, 10Release-Engineering-Team (Kanban): replace sdb and then setup/install phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T163938#3215445 (10RobH) This could also simply be a bad or lose cable on the drive bay! Basic troubleshooting of swapping sda and sdb to see if the e... [17:57:15] jouncebot: next [17:57:15] In 0 hour(s) and 2 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170719T1800) [17:57:35] i have a few patches for SWAT, i'm still waiting for the to merge before doing cherry-picks [17:58:14] (03PS1) 10Filippo Giunchedi: Revert "kafkatee: send 4xx to logstash as well" [puppet] - 10https://gerrit.wikimedia.org/r/366294 [17:58:58] jouncebot refresh [17:59:00] I refreshed my knowledge about deployments. [18:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170719T1800). [18:00:40] Where's stuff to deploy when you wanna deploy stuff? :| [18:01:09] i'll be cherry-picking https://gerrit.wikimedia.org/r/#/c/366261/ and https://gerrit.wikimedia.org/r/#/c/366290/ to wmf.9 and wmf.10 [18:01:12] (03CR) 10Filippo Giunchedi: [C: 032] Revert "kafkatee: send 4xx to logstash as well" [puppet] - 10https://gerrit.wikimedia.org/r/366294 (owner: 10Filippo Giunchedi) [18:01:18] Niharika: ^ i'm just waiting for the latter to merge [18:02:32] Got it. [18:06:48] (03PS6) 10ArielGlenn: write out a list of special dump files per dump run that downloaders may want [dumps] - 10https://gerrit.wikimedia.org/r/364729 (https://phabricator.wikimedia.org/T169849) [18:07:05] (03PS1) 10Chad: group1 to wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366297 [18:07:28] (03CR) 10Chad: [C: 04-2] group1 to wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366297 (owner: 10Chad) [18:11:37] Niharika: ok, sorry it took so long. updated: https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170719T1800 [18:11:45] i take it you're deploying today? :) [18:12:18] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, and 2 others: rack/setup/install restbase-dev100[456] - https://phabricator.wikimedia.org/T166181#3453457 (10Eevans) a:05Eevans>03Cmjohnson [18:12:46] MatmaRex: I am! [18:12:55] No worries, you're the only client today. [18:13:22] (03PS7) 10ArielGlenn: write out a list of special dump files per dump run that downloaders may want [dumps] - 10https://gerrit.wikimedia.org/r/364729 (https://phabricator.wikimedia.org/T169849) [18:15:34] Niharika: I've got a couple of things that could go out... (batch size changes to a maintenance script) [18:15:53] Reedy: Add them! [18:16:16] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, and 2 others: rack/setup/install restbase-dev100[456] - https://phabricator.wikimedia.org/T166181#3453505 (10mobrovac) @RobH, @Cmjohnson are we still on track for offlining `restbase-dev100[1-3]` and rsyncing the data/moving the disks to `restbase-d... [18:17:39] Niharika: notes on mine - ro.wp is not on wmf.10 yet, so i can't test that, only wmf.9; and both patches have to be deployed to have any effect. [18:19:04] MatmaRex: Okay, you can check the ro.wp patch on mwdebug1002 then. I pulled it for wmf9. [18:20:03] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, and 2 others: rack/setup/install restbase-dev100[456] - https://phabricator.wikimedia.org/T166181#3453521 (10RobH) >>! In T166181#3453505, @mobrovac wrote: > @RobH, @Cmjohnson are we still on track for offlining `restbase-dev100[1-3]` and rsyncing t... [18:20:03] I pulled it for wmf10 as well now. [18:21:22] Niharika: yupp, looks good! [18:21:33] MatmaRex: Okay then, gonna sync it. [18:23:31] !log niharika29@tin Synchronized php-1.30.0-wmf.9/includes/collation/IcuCollation.php: IcuCollation: Fix diacritic characters for Romanian (ro) headings https://gerrit.wikimedia.org/r/#/c/366295/ (duration: 00m 47s) [18:23:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:28] (03CR) 10ArielGlenn: [C: 032] write out a list of special dump files per dump run that downloaders may want [dumps] - 10https://gerrit.wikimedia.org/r/364729 (https://phabricator.wikimedia.org/T169849) (owner: 10ArielGlenn) [18:24:35] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [18:24:45] I have this for SWAT [18:24:46] https://gerrit.wikimedia.org/r/#/c/362876/ [18:24:49] anyone around? [18:24:58] super simple, not testable [18:24:58] !log niharika29@tin Synchronized php-1.30.0-wmf.10/includes/collation/IcuCollation.php: IcuCollation: Fix diacritic characters for Romanian (ro) headings https://gerrit.wikimedia.org/r/#/c/366296/ (duration: 00m 46s) [18:25:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:10] !log ariel@tin Started deploy [dumps/dumps@63705de]: write list of special dump files with no dump job content [18:25:12] MatmaRex: Both synced. [18:25:14] !log ariel@tin Finished deploy [dumps/dumps@63705de]: write list of special dump files with no dump job content (duration: 00m 02s) [18:25:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:25] Amir1: Yeah, add it to the calendar. [18:25:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:31] It's Christmas! [18:25:39] :D [18:26:04] Niharika: thanks. looks like production will need the second one too to clear the cache. apparently this was the first time anyone hit that codepath on the mwdebug1002 machine, so it cached it right [18:28:34] 10Operations: Include 5xx numbers in fluorine fatalmonitor - https://phabricator.wikimedia.org/T116627#3453569 (10mmodell) Interesting... |429|47.8% |403|31.4% |400|10.8% |405|9.8% |416|0.2% [18:29:13] Zuul's dealing with a lot of stuff right now. [18:30:01] (03PS8) 10Niharika29: Remove 'din' from wmgExtraLanguageNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362876 (https://phabricator.wikimedia.org/T168523) (owner: 10D3r1ck01) [18:30:06] (03CR) 10Niharika29: [C: 032] Remove 'din' from wmgExtraLanguageNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362876 (https://phabricator.wikimedia.org/T168523) (owner: 10D3r1ck01) [18:33:56] 10Operations, 10Performance-Team, 10monitoring: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837#3453628 (10faidon) [18:34:30] (03Merged) 10jenkins-bot: Remove 'din' from wmgExtraLanguageNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362876 (https://phabricator.wikimedia.org/T168523) (owner: 10D3r1ck01) [18:36:20] (03CR) 10jenkins-bot: Remove 'din' from wmgExtraLanguageNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362876 (https://phabricator.wikimedia.org/T168523) (owner: 10D3r1ck01) [18:36:47] Amir1: Your change is on mwdebug1002. Anything to test? [18:37:16] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [18:37:20] yeah, I think so but it'll be very brief [18:38:44] Niharika: it's okay [18:38:56] Amir1: Okay, syncing then. [18:40:16] !log niharika29@tin Synchronized wmf-config/InitialiseSettings.php: Remove 'din' from wmgExtraLanguageNames [mediawiki-config] - https://gerrit.wikimedia.org/r/362876 (https://phabricator.wikimedia.org/T168523) (duration: 00m 47s) [18:40:25] Amir1: Done. [18:40:26] 10Operations, 10Phabricator, 10Release-Engineering-Team (Kanban): replace sdb and then setup/install phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T163938#3453652 (10RobH) dell dispatch SR951188562 for the replacement hard disk, shipping to eqiad. [18:40:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:15] RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 0 [18:42:21] Thanks! [18:44:08] MatmaRex: https://gerrit.wikimedia.org/r/#/c/366298 is on mwdebug1002 (wmf9) [18:44:16] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:45:02] Niharika: looks good [18:46:22] MatmaRex: Ack. Syncing. [18:47:04] !log niharika29@tin Synchronized php-1.30.0-wmf.9/includes/collation/IcuCollation.php: Update FIRST_LETTER_VERSION for rowiki changes https://gerrit.wikimedia.org/r/#/c/366298/ (duration: 00m 46s) [18:47:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:19] MatmaRex: Done for wmf9. Waiting on Zuul for other one. [18:48:04] Niharika: looks good in production on https://ro.wikipedia.org/wiki/Categorie:Îmbrăcăminte ! thanks! [18:49:00] MatmaRex: Awesome. Syncing the wmf10 one too. [18:49:19] (i have to go for a minute, but you shouldn't need me for anything more) [18:49:30] !log niharika29@tin Synchronized php-1.30.0-wmf.10/includes/collation/IcuCollation.php: Update FIRST_LETTER_VERSION for rowiki changes https://gerrit.wikimedia.org/r/#/c/366299/ (duration: 00m 46s) [18:49:34] No worries. It's done. [18:49:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:11] (03PS1) 10ArielGlenn: setup for dumpsdata hosts to serve dumps work area via nfs to snapshots [puppet] - 10https://gerrit.wikimedia.org/r/366308 (https://phabricator.wikimedia.org/T169849) [18:56:10] 10Operations, 10Dumps-Generation, 10Patch-For-Review: Architecture and puppetize setup for dumpsdata boxes - https://phabricator.wikimedia.org/T169849#3453752 (10ArielGlenn) That puppet patchset above is missing a lot, has lots of dup code, and guaranteed not to pass jenkins either, but it's got a draft of m... [18:57:29] Feature request: Make Zuul interface show an estimated time to merge for core and extension patches so the deployers can +2 them well in advance. (20 minutes and counting right now) [18:58:18] (03PS1) 10Dzahn: librenms: active_server param, don't pull data from multi servers [puppet] - 10https://gerrit.wikimedia.org/r/366310 (https://phabricator.wikimedia.org/T159756) [19:00:04] RainbowSprinkles: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170719T1900). Please do the needful. [19:00:26] RainbowSprinkles: Gimme a few more minutes. Waiting on Zuul. :( [19:01:00] Mine just merged [19:01:10] Can both just be synced straight out. Because maintenance scripts :) [19:01:15] Awesome. [19:01:19] Okay, will do. [19:02:40] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, and 2 others: rack/setup/install restbase-dev100[456] - https://phabricator.wikimedia.org/T166181#3453787 (10Cmjohnson) @Eevans We are definitely good for tomorrow. Do we need to do both days? [19:04:03] !log niharika29@tin Synchronized php-1.30.0-wmf.9/maintenance/updateRestrictions.php: Set batch size to 1000 in updateRestrictions https://gerrit.wikimedia.org/r/#/c/366302/ (duration: 00m 46s) [19:04:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:48] !log niharika29@tin Synchronized php-1.30.0-wmf.10/maintenance/updateRestrictions.php: Set batch size to 1000 in updateRestrictions https://gerrit.wikimedia.org/r/#/c/366301/ (duration: 00m 47s) [19:05:52] Reedy: Synced both. [19:05:57] !log mobrovac@tin Started deploy [restbase/deploy@c5938f4] (staging): (no justification provided) [19:05:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:58] thanks :D [19:05:59] RainbowSprinkles: The conch is all yours. [19:06:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:42] * RainbowSprinkles always has the conch [19:06:45] * RainbowSprinkles has backup conches [19:06:47] hehe [19:07:39] !log mobrovac@tin Finished deploy [restbase/deploy@c5938f4] (staging): (no justification provided) (duration: 01m 42s) [19:07:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:52] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, and 2 others: rack/setup/install restbase-dev100[456] - https://phabricator.wikimedia.org/T166181#3453814 (10RobH) @Cmjohnson we didn't setup the rsync in advance, so we'll have to use a USB HDD/SSD to copy over some data before the migration. We wa... [19:07:56] (03PS2) 10Dzahn: librenms: active_server param, don't pull data from multi servers [puppet] - 10https://gerrit.wikimedia.org/r/366310 (https://phabricator.wikimedia.org/T159756) [19:08:15] (03CR) 10jerkins-bot: [V: 04-1] setup for dumpsdata hosts to serve dumps work area via nfs to snapshots [puppet] - 10https://gerrit.wikimedia.org/r/366308 (https://phabricator.wikimedia.org/T169849) (owner: 10ArielGlenn) [19:08:42] !log mobrovac@tin Started deploy [restbase/deploy@c5938f4]: Expose the translation API end points and fix SwaggerUI - T107914 T170729 [19:08:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:52] T107914: Consider options for longer-term content translation API end points - https://phabricator.wikimedia.org/T107914 [19:10:55] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, and 2 others: rack/setup/install restbase-dev100[456] - https://phabricator.wikimedia.org/T166181#3453829 (10Cmjohnson) @robh @eevans let's move the 2nd day until Monday or Tuesday. migration from a disk will be slow and to make sure that all is work... [19:11:07] (03CR) 10jerkins-bot: [V: 04-1] librenms: active_server param, don't pull data from multi servers [puppet] - 10https://gerrit.wikimedia.org/r/366310 (https://phabricator.wikimedia.org/T159756) (owner: 10Dzahn) [19:14:11] 10Operations, 10Traffic, 10Patch-For-Review: python-varnishapi daemons seeing "Log overrun" constantly - https://phabricator.wikimedia.org/T151643#3453850 (10ema) So it looks like the varnishstatsd overruns occur mostly in ulsfo: ``` $ sudo grep -F 'Log overru' /srv/syslog/syslog.log | grep cp4 -c 20737 18:... [19:14:22] (03PS2) 10Jdlrobson: Undeploy Cards [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357858 (https://phabricator.wikimedia.org/T167452) [19:14:25] (03Abandoned) 10Jdlrobson: Undeploy Cards [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357858 (https://phabricator.wikimedia.org/T167452) (owner: 10Jdlrobson) [19:16:44] !log mobrovac@tin Finished deploy [restbase/deploy@c5938f4]: Expose the translation API end points and fix SwaggerUI - T107914 T170729 (duration: 08m 02s) [19:16:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:54] T107914: Consider options for longer-term content translation API end points - https://phabricator.wikimedia.org/T107914 [19:18:47] (03CR) 10Chad: [C: 032] group1 to wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366297 (owner: 10Chad) [19:19:14] (03PS1) 10Jdlrobson: Stop RelatedArticles A/B test and clean up config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366314 (https://phabricator.wikimedia.org/T169948) [19:20:25] (03Merged) 10jenkins-bot: group1 to wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366297 (owner: 10Chad) [19:20:35] (03CR) 10jenkins-bot: group1 to wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366297 (owner: 10Chad) [19:23:47] (03PS2) 10ArielGlenn: setup for dumpsdata hosts to serve dumps work area via nfs to snapshots [puppet] - 10https://gerrit.wikimedia.org/r/366308 (https://phabricator.wikimedia.org/T169849) [19:25:04] (03CR) 10jerkins-bot: [V: 04-1] setup for dumpsdata hosts to serve dumps work area via nfs to snapshots [puppet] - 10https://gerrit.wikimedia.org/r/366308 (https://phabricator.wikimedia.org/T169849) (owner: 10ArielGlenn) [19:27:06] !log demon@tin Started scap: group1 to wmf.10 + symlink swap [19:27:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:39] (03CR) 10Andrew Bogott: Puppetmaster: Fix apache config ssldir (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/365053 (owner: 10Andrew Bogott) [19:30:05] (03PS3) 10Dzahn: librenms: active_server param, don't pull data from multi servers [puppet] - 10https://gerrit.wikimedia.org/r/366310 (https://phabricator.wikimedia.org/T159756) [19:31:00] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, and 2 others: rack/setup/install restbase-dev100[456] - https://phabricator.wikimedia.org/T166181#3453893 (10Eevans) To summarize an IRC discussion: If preserving the data will complicate matters such that we won't have the cluster back on-line unti... [19:33:23] (03CR) 10Andrew Bogott: "I guess it doesn't work either way" [puppet] - 10https://gerrit.wikimedia.org/r/365053 (owner: 10Andrew Bogott) [19:48:43] !log demon@tin Finished scap: group1 to wmf.10 + symlink swap (duration: 21m 37s) [19:48:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:36] (03PS3) 10Smalyshev: Add more units for conversion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362606 (https://phabricator.wikimedia.org/T168582) [19:50:10] Shit, wikibase is fucked in wmf.10 [19:50:26] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: Abort wmf.10 [19:50:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:12] Heya aude, about? [19:51:21] Fatal error: Class undefined: \Wikibase\Lib\JsonUnitStorage in /srv/mediawiki/php-1.30.0-wmf.10/includes/libs/ObjectFactory.php on line 143 [19:51:26] Immediately on wmf.10 roll out [19:52:01] (disappeared when rolled back) [19:54:08] (03CR) 10Dzahn: [C: 032] librenms: active_server param, don't pull data from multi servers [puppet] - 10https://gerrit.wikimedia.org/r/366310 (https://phabricator.wikimedia.org/T159756) (owner: 10Dzahn) [19:54:16] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [19:54:45] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [19:56:38] (03CR) 10Dzahn: "on netmon2001:" [puppet] - 10https://gerrit.wikimedia.org/r/366310 (https://phabricator.wikimedia.org/T159756) (owner: 10Dzahn) [19:57:49] !log Restarting Cassandra; restbase-dev1001-a to apply additional data_file_directory (T170276) [19:57:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:00] T170276: Test/evaluate JBOD support - https://phabricator.wikimedia.org/T170276 [19:59:55] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [20:00:04] gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170719T2000). [20:00:48] 10Operations, 10Services (doing), 10User-mobrovac: nodejs 6.11 - https://phabricator.wikimedia.org/T170548#3454023 (10mobrovac) [20:01:15] no parsoid deploy today [20:01:45] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:01:49] Nothing for ORES today [20:02:55] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [20:03:52] !log mobrovac@tin Started deploy [restbase/deploy@3bb90c9] (staging): (no justification provided) [20:04:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:25] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [20:04:32] (03PS1) 10Reedy: Fix minor code style issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366321 [20:05:25] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:05:31] !log mobrovac@tin Finished deploy [restbase/deploy@3bb90c9] (staging): (no justification provided) (duration: 01m 39s) [20:05:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:46] !log mobrovac@tin Started deploy [restbase/deploy@3bb90c9]: (no justification provided) [20:05:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:12] 10Operations, 10hardware-requests: hardware request for netmon1001 replacement - https://phabricator.wikimedia.org/T156040#3454077 (10Dzahn) [20:07:15] RECOVERY - MariaDB Slave Lag: s3 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 89954.15 seconds [20:07:16] 10Operations, 10monitoring, 10Patch-For-Review: setup netmon1002.wikimedia.org - https://phabricator.wikimedia.org/T159756#3454076 (10Dzahn) 05Resolved>03Open [20:07:21] (03PS1) 10Reedy: $wgDefaultUserOptions['proofreadpage-showheaders'] = 1 for tawikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366323 (https://phabricator.wikimedia.org/T169478) [20:08:10] PROBLEM - Test LDAP for query on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/ldap - 257 bytes in 0.102 second response time [20:09:14] that paged me [20:09:25] yeah I'm not sure what's up [20:09:35] madhuvishy: is toolschecker acting up or legit? [20:09:43] (03PS1) 10Dzahn: librenms: rsync rrd data from netmon1001 to netmon1002 [puppet] - 10https://gerrit.wikimedia.org/r/366324 (https://phabricator.wikimedia.org/T159756) [20:10:02] chasemp: i'm looking [20:10:27] Caught exception: {'info': '(unknown error code)', 'desc': 'Connect error'} [20:10:57] I can't seem to get into instances atm...is that just me? madhuvishy mutante? [20:11:04] Me too! [20:11:05] PROBLEM - Getent speed check on labstore1005 is CRITICAL: CRITICAL: getent group tools.admin failed [20:11:05] let me try [20:11:36] ok...ldap must be having issues? [20:11:42] that is also an ldap check there on labstore1005 [20:11:45] yeah [20:11:53] moritzm: about? [20:12:00] yea, not getting on an instance so far [20:12:28] (03CR) 10jerkins-bot: [V: 04-1] librenms: rsync rrd data from netmon1001 to netmon1002 [puppet] - 10https://gerrit.wikimedia.org/r/366324 (https://phabricator.wikimedia.org/T159756) (owner: 10Dzahn) [20:12:46] !log serpens:~# service slapd restart [20:12:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:12] !log seaborgium:~# service slapd restart [20:13:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:31] eh, internal server error on icinga :/ [20:14:38] when i wanted to check if LDAP login works there [20:14:49] i'll look at icinga [20:15:05] !log mobrovac@tin Finished deploy [restbase/deploy@3bb90c9]: (no justification provided) (duration: 09m 19s) [20:15:18] https://toolsadmin.wikimedia.org/ 500 error too [20:15:22] 500 Internal Server Error [20:15:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:28] bd808: ^ is that ldap issues? [20:15:46] (03CR) 10Krinkle: [C: 031] "The core change was merged, we haven't heard anything about a php startup perf regression, but also haven't looked for it yet. Will look i" [puppet] - 10https://gerrit.wikimedia.org/r/353228 (https://phabricator.wikimedia.org/T107128) (owner: 10Tim Starling) [20:15:58] chasemp: probably. striker is pretty closely tied to ldap [20:16:01] * bd808 looks [20:16:06] PROBLEM - Getent speed check on labstore1004 is CRITICAL: CRITICAL: getent group tools.admin failed [20:16:26] 10Operations, 10Africa-Wikimedia-Developers, 10Release-Engineering-Team, 10Composer: Setup a Composer Repository (Packagist) for MediaWiki Extensions - https://phabricator.wikimedia.org/T170897#3454169 (10D3r1ck01) [20:16:30] * andrewbogott is here, I think [20:16:37] andrewbogott: just noticed it within the last 5 [20:16:48] chasemp: yeah "Can't contact LDAP server" [20:16:51] we have two canaries failed, toolschecker and also the getent ldap check [20:17:02] I restarted slapd on serpens and seaborgium [20:17:07] I bet I have it only pointed at the eqiad server too... [20:17:15] ...thought maybe the mem leak cron was not working right or something? [20:17:19] logstash is down too, fyi [20:17:48] logstash shouldn't be affected by ldap... [20:18:06] I can log on to seaborgium at least... [20:18:07] so it's up [20:18:08] it has to check my credentials, though [20:18:17] mobrovac: ah. right [20:18:19] no I think logstash is unrelated, due to logstash crashing on input [20:18:21] slapd is runnign on both [20:18:44] godog: logstash was working fine before all these other outages started being reported [20:18:49] as in, 10 mins ago it was fine [20:18:56] I can run ldap queries from tools-bastion-02 via cli [20:19:22] bd808: say logged in there I can't seem to get into instances? [20:19:32] (03CR) 10Jforrester: [C: 031] $wgDefaultUserOptions['proofreadpage-showheaders'] = 1 for tawikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366323 (https://phabricator.wikimedia.org/T169478) (owner: 10Reedy) [20:19:44] bd808: I'm trying to login there now, can you see why it's bouncing me? [20:19:46] !log restaring slapd on seaborgium [20:19:54] tools-bastion-02.tools.eqiad.wmflabs that is [20:19:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:00] andrewbogott: I did try that fyi a few minutes ago [20:20:13] mobrovac: logstash the process hasn't been fine earlier today too, before ldap that is [20:20:20] chasemp: when I try to sudo I get "sudo: ldap_start_tls_s(): Connect error" [20:20:24] ok oh [20:20:49] so maybe that's the problem? The queries I was doing were not ldaps [20:21:13] has anyting changed at all with ldap any time recently? [20:21:15] (03CR) 10Jforrester: [C: 031] "Maybe we should lint this repo. ;-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366321 (owner: 10Reedy) [20:21:16] I can't think of it [20:21:30] godog: any ideas? [20:21:37] I'm getting a 500 from icinga too, I'm checking einsteinium at least [20:21:55] godog: i was checking that too, i dont see it in error log of apache?? [20:21:56] chasemp: afair no recent changes in ldap [20:22:02] (03CR) 10Reedy: "jerkins is a dick though :(" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366321 (owner: 10Reedy) [20:22:10] godog: i didnt want to just comment out the LDAP auth part to test... [20:22:29] maybe we change to simple auth and a random pass? [20:22:39] PROBLEM - check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/toolscron - 185 bytes in 0.004 second response time [20:23:58] mutante: easier to local a port via ssh [20:24:01] The ldap logs themselves aren't complaining at all... [20:24:20] I can hit serpens and seaborgium and serpens from labnet via telnet, and ldap is not available to all prod thigns I've tested so it seems like not an issue isolated to instances [20:24:21] (03PS1) 10Eevans: WIP: Configure Cassandra for restbase-dev[4-6] [puppet] - 10https://gerrit.wikimedia.org/r/366334 (https://phabricator.wikimedia.org/T171104) [20:24:42] (03CR) 10Jforrester: [C: 031] "Inorite." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366321 (owner: 10Reedy) [20:24:59] logging out of gerrit and back in works fine, fwiw [20:25:03] and it's also ldap [20:25:27] our icinga instance me and paladox runs uses ldap aswell and it appears to be ok its a bit dodgy but works [20:25:29] I can hit the ports 389 and 636, but it seems that TLS handshake is failing on 636? [20:25:45] ldap is /mostly/ responding to things, simple unsecure ldap queries work fine [20:25:51] e.g. ldapsearch -x uid=andrew [20:25:58] Jul 19 20:24:46 serpens slapd[26963]: SASL [conn=29007] Failure: no secret in database [20:26:01] (03CR) 10Eevans: [C: 04-1] "We still need IPs for this" [puppet] - 10https://gerrit.wikimedia.org/r/366334 (https://phabricator.wikimedia.org/T171104) (owner: 10Eevans) [20:26:20] 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 14 others: Purge Varnish cache when a banner is saved - https://phabricator.wikimedia.org/T154954#3454232 (10DStrine) [20:27:11] I don't see anything in public or private puppet repo that looks suspicious [20:27:32] maybe a cert expired? [20:27:50] only thing I can think of atm [20:28:03] andrewbogott arent there alerts for certs before expiration though? [20:28:06] it happened within a few seconds on both /I think/ [20:28:11] there /should/ be [20:28:31] anyone called moritz yet? [20:28:37] andrewbogott: no, could you? [20:28:39] yep [20:28:56] (i'm around too) [20:29:38] Zppix thats not for operations [20:29:43] ldap is not working there [20:29:46] so i carn't login [20:29:48] hey [20:29:51] what's going on? [20:30:09] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [20:30:22] the cert that ldap is handing out is valid until Nov 28 11:30:33 2017 GMT [20:30:41] paravoid: ldap authed binds failing [20:30:50] paravoid: best thnking atm is the ldap certs expired in both ldap servers ...at the same time?? [20:30:56] paravoid: http://checker.tools.wmflabs.org/ldap was first, and Icinga also has 500 error [20:30:58] see _security [20:30:58] they're not expired [20:30:59] I left moritz a message on his cell [20:32:52] mutante gerrit creates an account from ldap [20:33:06] so it's not like icinga.wikimedia.org or any other application using ldap. [20:33:08] what does that checker.tools/ldap script run [20:33:09] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [20:33:35] paladox: ah, right [20:34:29] PROBLEM - MegaRAID on db1016 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough [20:35:27] cteam: atm it seems like while access to instances is being restricted since we can't do ldap key checks tools seem up and instances seem to be chugging along fwiw, we are blindfolded even though the car is in motion :) [20:35:46] mutante: https://github.com/wikimedia/puppet/blob/production/modules/toollabs/files/toolschecker.py#L79 [20:35:54] grafana-admin too, error 500 where Apache uses LDAP auth [20:44:29] (03PS3) 10ArielGlenn: setup for dumpsdata hosts to serve dumps work area via nfs to snapshots [puppet] - 10https://gerrit.wikimedia.org/r/366308 (https://phabricator.wikimedia.org/T169849) [20:45:56] (03PS1) 10Mobrovac: Add the Scap configuration [software/cassandra-metrics-collector] - 10https://gerrit.wikimedia.org/r/366404 (https://phabricator.wikimedia.org/T137371) [20:46:39] (03CR) 10jerkins-bot: [V: 04-1] setup for dumpsdata hosts to serve dumps work area via nfs to snapshots [puppet] - 10https://gerrit.wikimedia.org/r/366308 (https://phabricator.wikimedia.org/T169849) (owner: 10ArielGlenn) [20:51:41] (03PS1) 10Faidon Liambotis: base: add new WMF 2017-2020 Certificate Authority [puppet] - 10https://gerrit.wikimedia.org/r/366422 [20:53:05] (03PS1) 10Faidon Liambotis: Update ldap-labs.{eqiad,codfw}.wikimedia certs [puppet] - 10https://gerrit.wikimedia.org/r/366428 [20:53:55] (03CR) 10Faidon Liambotis: [C: 032] base: add new WMF 2017-2020 Certificate Authority [puppet] - 10https://gerrit.wikimedia.org/r/366422 (owner: 10Faidon Liambotis) [20:54:36] (03CR) 10Faidon Liambotis: [V: 032 C: 032] Update ldap-labs.{eqiad,codfw}.wikimedia certs [puppet] - 10https://gerrit.wikimedia.org/r/366428 (owner: 10Faidon Liambotis) [21:00:25] PROBLEM - puppet last run on labcontrol1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:00:31] (03PS2) 10Aftab: $wgDefaultUserOptions['proofreadpage-showheaders'] = 1 for tawikisource & bnwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366323 (https://phabricator.wikimedia.org/T169478) (owner: 10Reedy) [21:00:55] (03PS4) 10ArielGlenn: setup for dumpsdata hosts to serve dumps work area via nfs to snapshots [puppet] - 10https://gerrit.wikimedia.org/r/366308 (https://phabricator.wikimedia.org/T169849) [21:03:09] (03PS1) 10Mobrovac: Cassandra: Switch metrics-collector to use Scap3 [puppet] - 10https://gerrit.wikimedia.org/r/366459 (https://phabricator.wikimedia.org/T137371) [21:04:26] (03PS2) 10Mobrovac: Cassandra: Switch metrics-collector to use Scap3 [puppet] - 10https://gerrit.wikimedia.org/r/366459 (https://phabricator.wikimedia.org/T137371) [21:05:00] (03PS1) 10Faidon Liambotis: Update ldap-corp.{eqiad,codfw}.wikimedia certs [puppet] - 10https://gerrit.wikimedia.org/r/366462 [21:05:12] (03CR) 10Faidon Liambotis: [V: 032 C: 032] Update ldap-corp.{eqiad,codfw}.wikimedia certs [puppet] - 10https://gerrit.wikimedia.org/r/366462 (owner: 10Faidon Liambotis) [21:05:28] (03CR) 10jerkins-bot: [V: 04-1] Cassandra: Switch metrics-collector to use Scap3 [puppet] - 10https://gerrit.wikimedia.org/r/366459 (https://phabricator.wikimedia.org/T137371) (owner: 10Mobrovac) [21:05:31] !log krypton - run puppet, restart apache, fixed grafana-admin [21:05:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:54] (03PS3) 10Mobrovac: Cassandra: Switch metrics-collector to use Scap3 [puppet] - 10https://gerrit.wikimedia.org/r/366459 (https://phabricator.wikimedia.org/T137371) [21:08:25] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 234 bytes in 0.007 second response time [21:14:29] (03CR) 10Eevans: [C: 031] "Insofar as I can remember how to shave this particular yak, it LGTM." (031 comment) [software/cassandra-metrics-collector] - 10https://gerrit.wikimedia.org/r/366404 (https://phabricator.wikimedia.org/T137371) (owner: 10Mobrovac) [21:16:26] (03PS13) 10Reedy: Add composer test for coding standards, and try to pass [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271936 (https://phabricator.wikimedia.org/T162835) (owner: 10Jforrester) [21:17:55] (03CR) 10jerkins-bot: [V: 04-1] Add composer test for coding standards, and try to pass [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271936 (https://phabricator.wikimedia.org/T162835) (owner: 10Jforrester) [21:18:25] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.030 second response time [21:19:55] RECOVERY - MariaDB Slave Lag: s2 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 89904.48 seconds [21:21:18] (03CR) 10Eevans: [C: 031] "LGTM (AFAICT)" [puppet] - 10https://gerrit.wikimedia.org/r/366459 (https://phabricator.wikimedia.org/T137371) (owner: 10Mobrovac) [21:21:34] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: most of group1 back on wmf.10 [21:21:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:51] (03CR) 10Mobrovac: "PCC - https://puppet-compiler.wmflabs.org/compiler02/7107/" [puppet] - 10https://gerrit.wikimedia.org/r/366459 (https://phabricator.wikimedia.org/T137371) (owner: 10Mobrovac) [21:23:55] !log Ran puppet and restarted apache on thorium (Runs hue, yarn, and pivot) [21:24:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:50] (03CR) 10Eevans: [C: 031] "> PCC - https://puppet-compiler.wmflabs.org/compiler02/7107/" [puppet] - 10https://gerrit.wikimedia.org/r/366459 (https://phabricator.wikimedia.org/T137371) (owner: 10Mobrovac) [21:24:56] (03PS1) 10Mobrovac: Add the Scap3 configuration [software/logstash-logback-encoder] - 10https://gerrit.wikimedia.org/r/366466 (https://phabricator.wikimedia.org/T116340) [21:25:50] !log Ran puppet and restarted apache on logstash100[1..3] [21:25:51] (03CR) 10Eevans: [C: 031] "> PCC - https://puppet-compiler.wmflabs.org/compiler02/7107/" [puppet] - 10https://gerrit.wikimedia.org/r/366459 (https://phabricator.wikimedia.org/T137371) (owner: 10Mobrovac) [21:25:58] !log logstash1001/1002 - restarted apache for CA change (logstash/kibana back) [21:26:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:24] mutante: logstash is back \o/ [21:26:36] :) yep [21:27:14] Heh, did at same time [21:27:40] oh heh [21:28:16] PROBLEM - haproxy failover on dbproxy1005 is CRITICAL: CRITICAL check_failover servers up 2 down 1 [21:28:59] (03PS1) 10Chad: wikidatawiki back to wmf.9 for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366467 (https://phabricator.wikimedia.org/T171107) [21:29:33] !log tungsten - restarted apache for CA change (xhgui) [21:29:35] RECOVERY - puppet last run on labcontrol1004 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [21:29:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:30] did db1016 went down? [21:30:46] it was complaining before about disk [21:31:13] !log running puppet & restarting gerrit/apache on cobalt/gerrit2001 [21:31:19] seems up to me [21:31:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:27] is m1 down? [21:31:30] RainbowSprinkles: i was going to ask abut gerrit service restart, thank you [21:31:31] or is it ldamp [21:31:48] RainbowSprinkles: is it true that it would affect new users just not existing ones [21:31:56] New logins [21:31:59] Existing ones were fine [21:32:09] i could still login after logging out, fwiw [21:32:14] Oh, I couldn't. [21:32:16] Hmm [21:32:29] Anyway, all better now! [21:32:37] ok :) yes [21:33:00] jynus: ldap has been futzy. having to run puppet & restart services (like apache) that talk to ldap [21:33:22] no, I am talking about dbproxy1005 error [21:33:35] RECOVERY - Getent speed check on labstore1005 is OK: OK: getent group returns within a second [21:33:35] PROBLEM - puppet last run on graphite1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_performance/docroot] [21:33:55] graphite1001 will self-resolve soon, bad timing [21:34:05] !log labstore1004/1005 puppet agent --test && service nslcd restart [21:34:05] graphite web ui already works but now this puppet issues [21:34:12] jynus: I misunderstood you when you said "or is it ldamp" [21:34:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:21] Thought you typo'd :) [21:34:23] yes, it could be related [21:34:31] but I am wrong [21:34:35] RECOVERY - Getent speed check on labstore1004 is OK: OK: getent group returns within a second [21:34:37] it is not m1 [21:34:41] it is m5 what has issues [21:34:48] mutante: puppet must've pulled *right* as I restarted gerrit [21:34:55] That's a really annoying cascading failure [21:35:12] bd808, chasemp m5 went down accoring to the proxy [21:35:24] that means most of the valuable dbs for cloud services [21:35:25] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.327 second response time [21:35:35] RECOVERY - puppet last run on graphite1001 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [21:35:37] !log graphite1001 - restarted apache, ran puppet [21:35:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:05] RECOVERY - check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.013 second response time [21:36:11] probably due to connection overload [21:36:17] (03CR) 10jerkins-bot: [V: 04-1] Add composer test for coding standards, and try to pass [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271936 (https://phabricator.wikimedia.org/T162835) (owner: 10Jforrester) [21:36:26] I am going to pool db1009 back unless you tell me otherwise [21:37:17] jynus: I'm not understanding, m5 is like keystone db and openstack things? [21:37:19] chasemp, madhuvishy, mdholloway [21:37:22] yes [21:37:41] basically we went read only [21:37:51] hello? [21:37:54] probably as a result of ldap issues? [21:38:01] mdholloway: I think mis-ping sorry [21:38:10] oh, no worries :) [21:38:11] mdholloway, sorry [21:38:21] (03PS2) 10Mobrovac: Add the Scap configuration [software/cassandra-metrics-collector] - 10https://gerrit.wikimedia.org/r/366404 (https://phabricator.wikimedia.org/T137371) [21:38:25] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.373 second response time [21:38:29] jynus: I mean timing makes it seem so but I'm not sure why, I guess keystone just kept trying [21:38:38] what's the next thing to do? [21:38:46] repool, investigate if it fails again [21:39:15] k [21:39:18] thanks jynus [21:39:36] !log reloading haproxy on dbproxy1005 for repooling db1009 [21:39:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:51] (03PS15) 10Reedy: Add composer test for coding standards, and try to pass [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271936 (https://phabricator.wikimedia.org/T162835) (owner: 10Jforrester) [21:40:12] chasemp, there is 400+ processes running on the db [21:40:19] probably not normal [21:40:25] RECOVERY - haproxy failover on dbproxy1005 is OK: OK check_failover servers up 2 down 0 [21:40:44] andrewbogott: seems m5 (backing db for openstack things) go thammered and pushed out of the pool(?) [21:40:58] it is stable though [21:41:00] andrewbogott: jynus repooled but it seems way too busy, ideas? [21:41:10] so no immediate issue [21:41:22] I would check graphs or why it could get depooled [21:41:39] chasemp: I don't know what would be hitting it. All those puppet runs will add a /little/ bit of traffic but I wouldn't think anything difficult [21:42:26] high level of aborted connects, but I cannot give a reason for that [21:42:31] (03CR) 10Mobrovac: "> That means r366404 needs to be updated accordingly, no?" [puppet] - 10https://gerrit.wikimedia.org/r/366459 (https://phabricator.wikimedia.org/T137371) (owner: 10Mobrovac) [21:42:45] Shall I stop keystone for a moment and we'll see what happens? [21:42:51] James_F: MERGE IT [21:42:53] maybe components connecting and then failing when ldap connection stalls out [21:43:04] jynus: did that make things quieter? [21:43:22] (03CR) 10Reedy: [C: 031] "It passes, let's get it merged, then start untangling the exclusions to re-enable them" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271936 (https://phabricator.wikimedia.org/T162835) (owner: 10Jforrester) [21:43:31] jouncebot: now [21:43:33] No deployments scheduled for the next 1 hour(s) and 16 minute(s) [21:43:33] jouncebot: next [21:43:33] In 1 hour(s) and 16 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170719T2300) [21:43:34] labstore would be trying too [21:43:39] andrewbogott, strange [21:43:57] I would expect a spike in connections or latency for the proxy to decide the depool [21:44:03] but I do not see such a thing [21:44:06] https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1009&from=now-3h&to=now [21:44:43] !log netmon1001 (librenms) - re-enable puppet once to get new CA, restart Apache, disable puppet again [21:44:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:30] jynus: huh so mystery as to why teh proxy depooled the backend? [21:46:46] Reedy: Happy to merge if you sync to stop icinga going mad… But can't you? [21:46:48] I do not see a reason based on metrics [21:46:49] I will check the logs [21:47:11] (03CR) 10Jforrester: [C: 031] Add composer test for coding standards, and try to pass [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271936 (https://phabricator.wikimedia.org/T162835) (owner: 10Jforrester) [21:47:27] to stop icinga? [21:47:38] !log netmon1001 - adding manual ferm rule for 80/443 - fixed librenms.wm.org [21:47:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:46] James_F: What does icinga have to do with some coding style fixes? [21:49:19] RainbowSprinkles: Merging code in mediawiki-config without syncing it… [21:49:27] Why wouldn't we? [21:49:30] Oh [21:49:33] (or couldn't?) [21:49:34] (03PS1) 10Mobrovac: Cassandra: Switch logback-encoder to Scap3 [puppet] - 10https://gerrit.wikimedia.org/r/366473 (https://phabricator.wikimedia.org/T116340) [21:49:50] So basically, if we merge it, we need to deploy it :P [21:49:52] (03CR) 10Chad: [C: 032] wikidatawiki back to wmf.9 for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366467 (https://phabricator.wikimedia.org/T171107) (owner: 10Chad) [21:49:55] WFM [21:50:03] I'll do it [21:50:22] * RainbowSprinkles puts his train conductor hat back on [21:50:25] andrewbogott, it happens at the same time that puppet runs updating the certificate, but I do not see why that would affect haproxy [21:50:25] choo choo all aboard [21:50:30] danke [21:50:51] !log tegmen - restarted apache [21:51:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:24] (03CR) 10jerkins-bot: [V: 04-1] Cassandra: Switch logback-encoder to Scap3 [puppet] - 10https://gerrit.wikimedia.org/r/366473 (https://phabricator.wikimedia.org/T116340) (owner: 10Mobrovac) [21:51:32] (03CR) 10Eevans: "A couple of inline comments, otherwise, LGTM" (032 comments) [software/logstash-logback-encoder] - 10https://gerrit.wikimedia.org/r/366466 (https://phabricator.wikimedia.org/T116340) (owner: 10Mobrovac) [21:51:45] (03Merged) 10jenkins-bot: wikidatawiki back to wmf.9 for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366467 (https://phabricator.wikimedia.org/T171107) (owner: 10Chad) [21:51:55] (03CR) 10jenkins-bot: wikidatawiki back to wmf.9 for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366467 (https://phabricator.wikimedia.org/T171107) (owner: 10Chad) [21:52:04] (03CR) 10Chad: [C: 032] Add composer test for coding standards, and try to pass [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271936 (https://phabricator.wikimedia.org/T162835) (owner: 10Jforrester) [21:52:14] (03PS2) 10Mobrovac: Cassandra: Switch logback-encoder to Scap3 [puppet] - 10https://gerrit.wikimedia.org/r/366473 (https://phabricator.wikimedia.org/T116340) [21:52:53] jynus: I have no idea :( [21:52:54] !log netmon1003 - puppet run, restarted apache - fixed servermon.wikimedia.org [21:53:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:43] (03Merged) 10jenkins-bot: Add composer test for coding standards, and try to pass [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271936 (https://phabricator.wikimedia.org/T162835) (owner: 10Jforrester) [21:53:54] (03CR) 10jenkins-bot: Add composer test for coding standards, and try to pass [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271936 (https://phabricator.wikimedia.org/T162835) (owner: 10Jforrester) [21:54:40] (03PS3) 10Mobrovac: Add the Scap3 configuration [software/logstash-logback-encoder] - 10https://gerrit.wikimedia.org/r/366466 (https://phabricator.wikimedia.org/T116340) [21:54:46] (03CR) 10jerkins-bot: [V: 04-1] Cassandra: Switch logback-encoder to Scap3 [puppet] - 10https://gerrit.wikimedia.org/r/366473 (https://phabricator.wikimedia.org/T116340) (owner: 10Mobrovac) [21:54:59] !log demon@tin Started scap: all kinds of code style stuff for James_F & Reedy [21:55:10] <3 [21:55:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:00] (03CR) 10Mobrovac: "@eevans, {{done}} both. For restbase-dev, I just dropped 100[1-3] in favour of 100[4-6] as it is unlikely we will start using this before " [software/logstash-logback-encoder] - 10https://gerrit.wikimedia.org/r/366466 (https://phabricator.wikimedia.org/T116340) (owner: 10Mobrovac) [21:56:14] !log graphite200* - restarted apache [21:56:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:47] RainbowSprinkles: Yay. [21:57:04] James_F: Know how to get phpcbf to fix the easy stuff? [21:57:23] (03PS3) 10Mobrovac: Add the Scap configuration [software/cassandra-metrics-collector] - 10https://gerrit.wikimedia.org/r/366404 (https://phabricator.wikimedia.org/T137371) [21:57:27] Reedy: Yup. Want me to? [21:57:35] Might aswell do the easy stuff [21:57:40] (I've got a meeting in 150 seconds.) [21:57:45] Plenty of time! [21:58:14] Uh-huh. [21:59:07] jynus: that's timings a mystery to me atm [21:59:12] (03CR) 10Eevans: [C: 031] Add the Scap configuration [software/cassandra-metrics-collector] - 10https://gerrit.wikimedia.org/r/366404 (https://phabricator.wikimedia.org/T137371) (owner: 10Mobrovac) [21:59:19] !log dbmonitor2001 - restarted apache [21:59:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:45] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [21:59:49] (03CR) 10Mobrovac: "PCC - https://puppet-compiler.wmflabs.org/compiler02/7108/" [puppet] - 10https://gerrit.wikimedia.org/r/366473 (https://phabricator.wikimedia.org/T116340) (owner: 10Mobrovac) [21:59:55] !log bromine _transparency.wm.org - restarted apache [22:00:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:08] (03CR) 10Eevans: [C: 031] "> @eevans, {{done}} both. For restbase-dev, I just dropped 100[1-3]" [software/logstash-logback-encoder] - 10https://gerrit.wikimedia.org/r/366466 (https://phabricator.wikimedia.org/T116340) (owner: 10Mobrovac) [22:00:22] !log demon@tin Finished scap: all kinds of code style stuff for James_F & Reedy (duration: 05m 23s) [22:00:29] Nice [22:00:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:03] (03CR) 10Eevans: [C: 031] Cassandra: Switch logback-encoder to Scap3 [puppet] - 10https://gerrit.wikimedia.org/r/366473 (https://phabricator.wikimedia.org/T116340) (owner: 10Mobrovac) [22:01:10] Reedy: `composer install` isn't clean on master. [22:01:19] Reedy: Should we fix? [22:01:23] moment [22:01:27] (ISTR I've asked this before.) [22:01:36] I thought I'd fixed it once [22:01:49] twig? [22:01:51] It probably drifted. [22:01:55] !log hafnium, labmon1001 - restarted apache [22:02:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:10] Even with --no-dev twig's there, yah. [22:02:12] Meh. [22:02:18] I guess it's probably due to dependancies that aren't hard [22:02:21] modifiers [22:02:45] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [22:02:54] (03PS1) 10Reedy: Updating twig/twig (v1.34.3 => v1.34.4) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366476 [22:04:52] (03PS1) 10Andrew Bogott: labs_vmbuilder: get domain name from eth0 instead of * [puppet] - 10https://gerrit.wikimedia.org/r/366477 (https://phabricator.wikimedia.org/T170828) [22:06:31] (03CR) 10Reedy: [C: 032] Updating twig/twig (v1.34.3 => v1.34.4) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366476 (owner: 10Reedy) [22:08:19] (03Merged) 10jenkins-bot: Updating twig/twig (v1.34.3 => v1.34.4) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366476 (owner: 10Reedy) [22:08:28] (03CR) 10jenkins-bot: Updating twig/twig (v1.34.3 => v1.34.4) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366476 (owner: 10Reedy) [22:13:03] (03Abandoned) 10Andrew Bogott: labs_vmbuilder: get domain name from eth0 instead of * [puppet] - 10https://gerrit.wikimedia.org/r/366477 (https://phabricator.wikimedia.org/T170828) (owner: 10Andrew Bogott) [22:13:15] PROBLEM - check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/toolscron - 185 bytes in 0.010 second response time [22:13:55] (03PS1) 10Andrew Bogott: nova fullstack test: use new 8.8 image [puppet] - 10https://gerrit.wikimedia.org/r/366478 [22:14:52] !log reedy@tin Synchronized multiversion/: (no justification provided) (duration: 01m 11s) [22:15:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:37] (03CR) 10Andrew Bogott: [C: 032] nova fullstack test: use new 8.8 image [puppet] - 10https://gerrit.wikimedia.org/r/366478 (owner: 10Andrew Bogott) [22:18:45] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [22:36:29] (03PS2) 10Dzahn: librenms: rsync rrd data from netmon1001 to netmon1002 [puppet] - 10https://gerrit.wikimedia.org/r/366324 (https://phabricator.wikimedia.org/T159756) [22:38:11] RECOVERY - Test LDAP for query on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.379 second response time [22:45:50] (03PS1) 10Smalyshev: Use correct class name for JsonUnitStorage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366480 (https://phabricator.wikimedia.org/T171107) [22:51:26] RECOVERY - check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.008 second response time [22:52:45] PROBLEM - Host cp3048 is DOWN: PING CRITICAL - Packet loss = 100% [22:53:32] 10Operations, 10monitoring: librenms: consider using Distributed Poller with multiple netmon servers - https://phabricator.wikimedia.org/T171122#3454761 (10Dzahn) [22:54:45] RECOVERY - Host cp3048 is UP: PING WARNING - Packet loss = 54%, RTA = 83.79 ms [22:55:20] 10Operations, 10monitoring: librenms: consider using Distributed Poller with multiple netmon servers - https://phabricator.wikimedia.org/T171122#3454779 (10Dzahn) p:05Triage>03Low [23:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170719T2300). [23:00:04] Jdlrobson and ebernhardson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:50] \o here [23:01:02] \o [23:02:00] Hello [23:02:07] I can SWAT this evening. [23:03:18] Reedy: Aha, thanks. [23:07:01] jdlrobson: on https://gerrit.wikimedia.org/r/#/c/366314/1/wmf-config/InitialiseSettings.php why get rid of one, and keep a default value for the other (in extension.json, RelatedArticlesEnabledBucketSize has also 1 as default value)? [23:07:26] PROBLEM - check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/toolscron - 185 bytes in 0.007 second response time [23:07:44] jdlrobson: understood, because you removed RelatedArticlesEnabledSamplingRate from the code [23:07:54] (03CR) 10Aude: "suggestion..." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366480 (https://phabricator.wikimedia.org/T171107) (owner: 10Smalyshev) [23:08:01] Commit message is a good place for such context. [23:09:00] jdlrobson: add a reference to f4c82d3a33 please (the commit renaming SamplingRate to BucketSize) in the commit message [23:09:08] (03PS1) 10Reedy: phpcbf on mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366483 [23:09:13] aude: not sure how would we have two configs? [23:10:32] (03PS2) 10Jdlrobson: Stop RelatedArticles A/B test and clean up config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366314 (https://phabricator.wikimedia.org/T169948) [23:11:17] (03PS2) 10Smalyshev: Use correct class name for JsonUnitStorage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366480 (https://phabricator.wikimedia.org/T171107) [23:12:03] (03PS3) 10Smalyshev: Use correct class name for JsonUnitStorage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366480 (https://phabricator.wikimedia.org/T171107) [23:14:07] ebernhardson: changes will soon be merged, tests are at 86% for the last task [23:14:26] SMalyshev: you intend to add this one to SWAT? [23:14:41] Dereckson: yes, i think [23:15:06] Dereckson: I think yes [23:15:31] ok [23:16:20] (and it's ready?) [23:16:27] (03CR) 10Aude: [C: 031] Use correct class name for JsonUnitStorage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366480 (https://phabricator.wikimedia.org/T171107) (owner: 10Smalyshev) [23:16:42] (logics sounds good to me to keep best compatibility) [23:16:45] Dereckson: I think so, if no objections from aude & daniel [23:17:56] Fine, please add it to the deployments table. [23:18:02] Dereckson: did so :) [23:18:06] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366480 (https://phabricator.wikimedia.org/T171107) (owner: 10Smalyshev) [23:19:21] thanks [23:19:58] Dereckson: thanks. I wonder how it can be validated? On mwdebug? [23:20:55] (03CR) 10Jforrester: [C: 04-1] "This should include the change to phpcs.xml so that further issues don't creep in. :-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366483 (owner: 10Reedy) [23:20:56] we can see if it complains about a non existant class for example [23:21:11] (03Merged) 10jenkins-bot: Use correct class name for JsonUnitStorage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366480 (https://phabricator.wikimedia.org/T171107) (owner: 10Smalyshev) [23:21:24] Dereckson: ok, yes, let's see [23:21:33] live on mwdebug1002 [23:21:47] (03CR) 10Reedy: "I dunno what changes need making :P" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366483 (owner: 10Reedy) [23:22:02] You can also track https://logstash.wikimedia.org/app/kibana#/dashboard/mwdebug1002 [23:22:12] (03CR) 10jenkins-bot: Use correct class name for JsonUnitStorage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366480 (https://phabricator.wikimedia.org/T171107) (owner: 10Smalyshev) [23:22:16] (03Restored) 10Aude: Update my ssh key [puppet] - 10https://gerrit.wikimedia.org/r/363180 (owner: 10Aude) [23:23:41] Dereckson: I've gone through a couple wikidata entities which should have triggered it, nothing so far [23:23:58] so it looks ok I think [23:24:23] did you check test.wikidata? [23:24:30] good to check both [23:24:48] since they are on different versions of wikibase / mediawiki [23:25:44] (03PS1) 10Reedy: Bump mediawiki/mediawiki-codesniffer to 0.10.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366485 [23:26:30] seems to be ok [23:26:46] (03PS4) 10Aude: Update my ssh key [puppet] - 10https://gerrit.wikimedia.org/r/363180 [23:26:50] ok [23:26:57] (03CR) 10jerkins-bot: [V: 04-1] Bump mediawiki/mediawiki-codesniffer to 0.10.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366485 (owner: 10Reedy) [23:27:00] ack'ed [23:27:04] Dereckson: I think it's good (at least not breaking anything so far :) [23:27:31] ebernhardson: your revert change is live on mwdebug1002 [23:27:54] Dereckson: checking [23:28:34] (03Abandoned) 10Reedy: Bump mediawiki/mediawiki-codesniffer to 0.10.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366485 (owner: 10Reedy) [23:28:41] Dereckson: wmf.10 or wmf.9? [23:28:45] both [23:28:46] ACKNOWLEDGEMENT - check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/toolscron - 185 bytes in 0.006 second response time Madhuvishy The test cronjob keeps failing on the grid. Investigating. [23:29:02] !log dereckson@tin Synchronized wmf-config/Wikibase-production.php: Use correct class name for JsonUnitStorage (T171107) (duration: 00m 48s) [23:29:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:29:14] T171107: Class undefined: \Wikibase\Lib\JsonUnitStorage - https://phabricator.wikimedia.org/T171107 [23:30:42] Dereckson: wmf.10 isn't on mwdebug1002. was wondering why it didn't work :P [23:30:50] Dereckson: the fix i mean, i just logged in and checked the code [23:31:12] I redo a scap pull [23:31:34] 23:31:14 Finished rsync common (duration: 00m 04s) [23:32:31] Dereckson: works as expected now for wmf.9 and wmf.10 [23:32:55] in what order to sync those files? [23:33:50] Dereckson: hmm, probably js then php? might cause a few errors in the meantime not sure. It will take a few minutes before users get the new js [23:35:54] is someone from ops around who can review/merge https://gerrit.wikimedia.org/r/#/c/363180/ (updating my ssh key) [23:37:33] Dereckson: i'm going to need to go soon... [23:37:36] is there a problem with my swat? [23:37:58] aude: oh right, I need to do that.. [23:37:59] a bunch of people from the wikidata team verified / gave +1 [23:38:22] new laptop is arriving in a few days but for now i'm on my wmde macbook [23:38:28] aude: puppet swat? [23:38:37] when? [23:38:47] tues and thurs, 9-10 PST ? [23:38:58] its on deploy calender if i'm wrong :) [23:39:43] the time isn't best for me, but maybe i can do that [23:40:08] aude: maybe if you put a really nice comment asking nicely they'll do it even without you present [23:40:14] ok :) [23:40:14] :) [23:40:30] thanks [23:40:35] RECOVERY - check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.004 second response time [23:44:49] (03PS1) 10Smalyshev: Cleanup old BC config for JsonUnitStorage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366488 (https://phabricator.wikimedia.org/T171107) [23:45:10] (03CR) 10Smalyshev: [C: 04-1] "-1 for now until time comes to deploy it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366488 (https://phabricator.wikimedia.org/T171107) (owner: 10Smalyshev) [23:45:32] 10Operations, 10Ops-Access-Requests: Requesting access to tools.speedydeletionwikia for Dylann1024 (Nathan Larson) - https://phabricator.wikimedia.org/T171130#3454957 (10Mdupont) [23:47:18] (03CR) 10Reedy: phpcbf on mediawiki-config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366483 (owner: 10Reedy) [23:47:42] Dereckson: syncing my revert? [23:50:05] (03PS2) 10Reedy: phpcbf on mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366483 [23:50:13] * jdlrobson is lost [23:51:43] o/ [23:51:50] Hey gang! [23:51:55] Did labs auth change recently? [23:52:09] ? [23:52:27] PROBLEM - check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/toolscron - 185 bytes in 0.020 second response time [23:52:35] My SSH keys seem to not work anymore and when I tried to hit the labs UI everything had changed. :-) [23:53:15] (03CR) 10Jdlrobson: [C: 031] Stop RelatedArticles A/B test and clean up config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366314 (https://phabricator.wikimedia.org/T169948) (owner: 10Jdlrobson) [23:53:25] Which one? :P [23:53:28] (03PS1) 10Reedy: Disable all phpcs rules for lols [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366489 [23:53:32] Also, toolsadmin is something I've yet to see. :-) [23:54:02] I probably need an LDAP password reset. [23:54:18] I *think* I remember it, but it's been a while. [23:54:41] At least I still have my 2FA. If that's still used even. :-) [23:54:55] (03CR) 10jerkins-bot: [V: 04-1] Disable all phpcs rules for lols [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366489 (owner: 10Reedy) [23:58:30] (03CR) 10Bartosz Dziewoński: "Oh the hilarity! A megabyte of PHPCS output!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366489 (owner: 10Reedy) [23:59:37] (03PS3) 10Reedy: phpcbf on mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366483 [23:59:59] Dereckson still logged into tin, idle 27 minutes though ...