[00:04:25] (03PS1) 10Giuseppe Lavagetto: role::deployment::server: only run l10update on the active server [puppet] - 10https://gerrit.wikimedia.org/r/351556 [00:05:21] <_joe_> thcipriani: ^^ this is a decent stopgap methinks [00:06:37] (03CR) 10Thcipriani: [C: 031] role::deployment::server: only run l10update on the active server [puppet] - 10https://gerrit.wikimedia.org/r/351556 (owner: 10Giuseppe Lavagetto) [00:06:58] _joe_: lgtm [00:07:25] <_joe_> thcipriani: https://puppet-compiler.wmflabs.org/6275 [00:07:34] im wondering is it possible to add puppet code like <% %> in a normal file without the file ending in erb? [00:07:43] <_joe_> ok it should work [00:08:47] (03PS2) 10Giuseppe Lavagetto: role::deployment::server: only run l10update on the active server [puppet] - 10https://gerrit.wikimedia.org/r/351556 [00:08:55] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] role::deployment::server: only run l10update on the active server [puppet] - 10https://gerrit.wikimedia.org/r/351556 (owner: 10Giuseppe Lavagetto) [00:09:56] (03Abandoned) 10Thcipriani: l10nupdate: don't run during deployment freeze [puppet] - 10https://gerrit.wikimedia.org/r/351365 (owner: 10Thcipriani) [00:14:00] RECOVERY - Disk space on ocg1003 is OK: DISK OK [01:04:13] (03PS1) 10Dzahn: gerrit: move hiera lookup to profile parameter [puppet] - 10https://gerrit.wikimedia.org/r/351564 [01:08:51] (03PS2) 10Dzahn: gerrit: move hiera lookup to profile parameter [puppet] - 10https://gerrit.wikimedia.org/r/351564 [01:13:57] 06Operations, 10Mail, 10Wikimedia-Mailing-lists, 05Security: Sender email spoofing - https://phabricator.wikimedia.org/T160529#3230368 (10jayvdb) As a bit of a temporary solution, I've added a hold rule on wikimedia-l for `X-Spam-Score:[^+]*[+]{4,}`. cc other list admins @Austin, @Ijon , @Esh77 . We may... [01:16:34] (03PS1) 10Dzahn: gerrit: use ssh::userkey to install ssh key in proper location [puppet] - 10https://gerrit.wikimedia.org/r/351565 (https://phabricator.wikimedia.org/T152525) [01:32:22] (03PS1) 10Dzahn: gerrit: use new ecdsa key for replication, add pub key [puppet] - 10https://gerrit.wikimedia.org/r/351566 (https://phabricator.wikimedia.org/T152525) [01:33:21] (03PS2) 10Dzahn: gerrit: use new ecdsa key for replication, add pub key [puppet] - 10https://gerrit.wikimedia.org/r/351566 (https://phabricator.wikimedia.org/T152525) [01:36:17] 06Operations, 13Patch-For-Review: setup netmon1002.wikimedia.org - https://phabricator.wikimedia.org/T159756#3230374 (10Dzahn) At this point if anyone wants to start putting more roles on it, (one role at a time), please feel free to pick one and go ahead. [01:38:11] ACKNOWLEDGEMENT - HP RAID on ms-be1039 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. daniel_zahn https://phabricator.wikimedia.org/T163690 [01:38:12] ACKNOWLEDGEMENT - puppet last run on ms-be1039 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[parted-/dev/sdc] daniel_zahn https://phabricator.wikimedia.org/T163690 [01:39:00] ACKNOWLEDGEMENT - puppet last run on ms-be3003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 22 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[parted-/dev/sdl] daniel_zahn https://phabricator.wikimedia.org/T83811 [01:39:28] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2256 is CRITICAL: Host mw2256 is not in mediawiki-installation dsh group daniel_zahn https://phabricator.wikimedia.org/T163346 [01:41:20] !log kubernetes - puppet fails because "E: Unable to locate package cni [01:41:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:50:58] 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment, 15User-Joe: Build calico - https://phabricator.wikimedia.org/T150434#2785879 (10Dzahn) on kubernetes1001-1004 we currently have Icinga alerts due to puppet failures due to "**E: Unable to locate package cni**". on apt.wikimedia.org we ha... [01:51:43] ACKNOWLEDGEMENT - puppet last run on kubernetes1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 10 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[cni] daniel_zahn https://phabricator.wikimedia.org/T150434#3230384 [01:51:43] ACKNOWLEDGEMENT - puppet last run on kubernetes1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 8 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[cni] daniel_zahn https://phabricator.wikimedia.org/T150434#3230384 [01:51:43] ACKNOWLEDGEMENT - puppet last run on kubernetes1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 12 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[cni] daniel_zahn https://phabricator.wikimedia.org/T150434#3230384 [01:51:43] ACKNOWLEDGEMENT - puppet last run on kubernetes1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 16 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[cni] daniel_zahn https://phabricator.wikimedia.org/T150434#3230384 [01:52:00] 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment, 15User-Joe: Build calico - https://phabricator.wikimedia.org/T150434#3230386 (10Dzahn) 05Resolved>03Open [01:54:16] ACKNOWLEDGEMENT - IPsec on kafka1018 is CRITICAL: Strongswan CRITICAL - ok: 144 connecting: cp3003_v4, cp3003_v6, cp3009_v4, cp3009_v6 daniel_zahn https://phabricator.wikimedia.org/T162132 [01:54:16] ACKNOWLEDGEMENT - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 144 connecting: cp3003_v4, cp3003_v6, cp3009_v4, cp3009_v6 daniel_zahn https://phabricator.wikimedia.org/T162132 [01:54:16] ACKNOWLEDGEMENT - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 144 connecting: cp3009_v4, cp3009_v6 not-conn: cp3003_v4, cp3003_v6 daniel_zahn https://phabricator.wikimedia.org/T162132 [02:43:12] !log l10nupdate@naos scap sync-l10n completed (1.29.0-wmf.21) (duration: 14m 02s) [02:43:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:48:33] !log l10nupdate@naos ResourceLoader cache refresh completed at Wed May 3 02:48:33 UTC 2017 (duration 5m 21s) [02:48:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:19:43] (03PS1) 10BryanDavis: keyholder: fix provisioning on trusty [puppet] - 10https://gerrit.wikimedia.org/r/351571 [03:22:36] (03CR) 10BryanDavis: "I have a Trusty deploy server in a labs project and I had to create a dummy /etc/tmpfiles.d directory in order to provision keyholder ther" [puppet] - 10https://gerrit.wikimedia.org/r/351571 (owner: 10BryanDavis) [03:28:30] PROBLEM - puppet last run on cp4010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:35:17] (03PS1) 10Dzahn: icinga: move nsca config to pub repo [puppet] - 10https://gerrit.wikimedia.org/r/351572 [03:38:33] PROBLEM - MariaDB disk space on db1015 is CRITICAL: DISK CRITICAL - free space: /srv 97933 MB (5% inode=99%) [03:42:30] (03PS2) 10Dzahn: icinga: move nsca config to pub repo [puppet] - 10https://gerrit.wikimedia.org/r/351572 [03:44:00] (03PS3) 10Dzahn: icinga: move nsca config to pub repo [puppet] - 10https://gerrit.wikimedia.org/r/351572 [03:49:36] (03PS4) 10Dzahn: icinga: move nsca config to pub repo [puppet] - 10https://gerrit.wikimedia.org/r/351572 [03:54:20] PROBLEM - Disk space on db1015 is CRITICAL: DISK CRITICAL - free space: /srv 63972 MB (3% inode=99%) [03:56:30] RECOVERY - puppet last run on cp4010 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [03:58:09] (03PS5) 10Dzahn: icinga: move nsca config to pub repo [puppet] - 10https://gerrit.wikimedia.org/r/351572 [04:16:04] (03PS6) 10Dzahn: icinga: move nsca config to pub repo [puppet] - 10https://gerrit.wikimedia.org/r/351572 [04:24:03] (03PS1) 10Dzahn: dumps: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/351574 [04:25:14] (03PS2) 10Dzahn: dumps: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/351574 [04:31:15] (03PS1) 10Dzahn: zotero: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/351575 [04:32:40] PROBLEM - MariaDB Slave SQL: s3 on db1015 is CRITICAL: CRITICAL slave_sql_state could not connect [04:32:50] PROBLEM - MariaDB Slave IO: s3 on db1015 is CRITICAL: CRITICAL slave_io_state could not connect [04:34:03] PROBLEM - mysqld processes on db1015 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [04:36:38] just got these pages [04:37:04] yea, me too [04:37:13] it's out of disk [04:37:17] right [04:37:27] that seems to be the first alert [04:37:42] but so far i didnt categorize it as outage or so [04:38:25] since we had some db pages like this before and they could be handled later, afaict [04:39:00] if it was / i'd try some quick fixes to give it space, but this is /srv, just the mysql data itself [04:39:03] (03PS3) 10Phuedx: pagePreviews: Fix NavPopups gadget detection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351166 (https://phabricator.wikimedia.org/T164044) (owner: 10Jdlrobson) [04:39:40] and dont wanna touch that without calling dba [04:39:45] yeah [04:39:47] https://tendril.wikimedia.org/host/view/db1015.eqiad.wmnet/3306 [04:40:09] spike in query traffic few minutes ago [04:40:43] shwiki is 80g [04:40:58] https://tendril.wikimedia.org/report/slow_queries?host=db1015&hours=1 [04:41:02] may be the culprit [04:41:06] what's on it, in the old dbtree there was this "hover" effect, told you which wikis, but we lost that at some point [04:41:09] ah [04:42:37] i'm unsure if this is important enough to call [04:43:51] any signs of user impact? [04:44:26] looking at #-tech [04:44:44] don't see anything [04:44:45] looking at #wikipedia [04:44:47] neither [04:45:41] well, given that we have some db pages in the "for awareness" section every once in a while, i think we can let them sleep and soon they will wake up anyways? [04:46:51] it's not a master [04:46:53] mutante, madhuvishy: it was depooled recently for being low on space -- https://gerrit.wikimedia.org/r/#/c/351273/ [04:46:57] just a slave of db1075 [04:47:05] bd808: aah [04:47:12] thanks for letting us know [04:47:40] ok, i won't worry as long as it's not a master [04:48:02] not sure why it isn't downtimed then [04:48:17] I found that by grepping mediawiki-config and blaming the db-eqiad.php file [04:48:24] ACKNOWLEDGEMENT - Disk space on db1015 is CRITICAL: DISK CRITICAL - free space: /srv 0 MB (0% inode=0%): daniel_zahn https://gerrit.wikimedia.org/r/#/c/351273/ [04:48:24] ACKNOWLEDGEMENT - MariaDB Slave IO: s3 on db1015 is CRITICAL: CRITICAL slave_io_state could not connect daniel_zahn https://gerrit.wikimedia.org/r/#/c/351273/ [04:48:24] ACKNOWLEDGEMENT - MariaDB Slave SQL: s3 on db1015 is CRITICAL: CRITICAL slave_sql_state could not connect daniel_zahn https://gerrit.wikimedia.org/r/#/c/351273/ [04:48:26] ACKNOWLEDGEMENT - MariaDB disk space on db1015 is CRITICAL: DISK CRITICAL - free space: /srv 0 MB (0% inode=0%): daniel_zahn https://gerrit.wikimedia.org/r/#/c/351273/ [04:48:28] ACKNOWLEDGEMENT - mysqld processes on db1015 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld daniel_zahn https://gerrit.wikimedia.org/r/#/c/351273/ [04:48:33] nice [04:48:42] madhuvishy: most likely because we have that bug that icinga forgets them ... [04:48:45] right [04:48:56] it forgets mine too :/ [04:48:57] but that is also kind of a new problem [04:49:07] since we used tegmen or so. .it feels [04:49:14] not like Icinga has always been doing this [04:49:21] but we dont know yet why [04:49:26] right, okay [04:50:17] https://phabricator.wikimedia.org/T164206 [05:00:55] (03CR) 10Dzahn: [C: 031] keyholder: fix provisioning on trusty [puppet] - 10https://gerrit.wikimedia.org/r/351571 (owner: 10BryanDavis) [05:03:50] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:03:52] PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:03:52] PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:03:53] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:03:53] PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:03:53] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:03:53] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:03:53] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:04:40] RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:04:40] RECOVERY - MariaDB Slave IO: s4 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:04:40] RECOVERY - MariaDB Slave SQL: s4 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [05:04:40] RECOVERY - MariaDB Slave IO: s7 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:04:40] RECOVERY - MariaDB Slave SQL: m3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [05:04:41] RECOVERY - MariaDB Slave IO: s1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:04:41] RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [05:04:42] RECOVERY - MariaDB Slave IO: x1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:05:00] uhm.. this may look bad here, but it's not paging [05:05:40] out again for now [05:49:21] 06Operations, 10Mail, 10Wikimedia-Mailing-lists, 05Security: Sender email spoofing - https://phabricator.wikimedia.org/T160529#3230476 (10jayvdb) Ugh, the problem is not false positives, but that rule is increasing the amount of spam seen by list moderators. Presumably, somewhere, previously spam was gett... [05:56:50] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=2565.30 Read Requests/Sec=462.80 Write Requests/Sec=6.90 KBytes Read/Sec=38609.60 KBytes_Written/Sec=184.00 [06:03:50] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=15.70 Read Requests/Sec=169.60 Write Requests/Sec=6.50 KBytes Read/Sec=1498.00 KBytes_Written/Sec=317.20 [06:12:00] 06Operations, 10Mail, 10Wikimedia-Mailing-lists, 05Security: Sender email spoofing - https://phabricator.wikimedia.org/T160529#3230481 (10revi) From my experience combating spam on wikinews-l, I believe those with spam score +4 or above are usually safe to discard. (List is otherwise low traffic, so I can'... [06:23:40] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [06:24:30] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] [06:32:28] 06Operations, 10DBA: Create less overhead on bacula jobs when dumping production databases - https://phabricator.wikimedia.org/T162789#3230503 (10Marostegui) Looks like db1015 ran out of space in the middle of the night: check emails from icinga and: https://grafana.wikimedia.org/dashboard/file/server-board.js... [06:39:20] 06Operations: Degraded RAID on restbase1018 - https://phabricator.wikimedia.org/T163280#3230527 (10Marostegui) >>! In T163280#3218884, @Cmjohnson wrote: > The ssd has been received and swapped. @Marostegui or someone else please fix the raid cfg and resolve. Thanks > > I would prefer if someone more famili... [06:42:52] 06Operations, 10ops-eqiad, 10DBA: Reset db1070 idrac - https://phabricator.wikimedia.org/T160392#3230537 (10Marostegui) @Cmjohnson this looks similar as the same issue with lots of servers, including dbstore1001 on this task: T158893#3186029 if we get Dell to fix this or advise on that ticket, probably we ca... [06:44:30] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:45:40] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:51:30] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [06:51:40] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [06:52:08] (03PS2) 10Marostegui: wmnet: Switch dns master alias to eqiad [dns] - 10https://gerrit.wikimedia.org/r/350824 (https://phabricator.wikimedia.org/T155099) [06:52:30] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:55:30] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [1000.0] [06:58:47] (03CR) 10Muehlenhoff: [C: 031] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/351575 (owner: 10Dzahn) [07:00:36] (03CR) 10Muehlenhoff: [C: 031] dumps: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/351574 (owner: 10Dzahn) [07:08:30] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:08:40] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:13:33] (03PS1) 10Marostegui: db-eqiad.php: Pool db1097 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351577 [07:14:36] (03CR) 10Marostegui: "@jcrespo, I was reviewing the db-eqiad.php before the switchover and I thought why not pooling db1097 (new host with data from db1040) jus" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351577 (owner: 10Marostegui) [07:16:45] (03CR) 10ArielGlenn: [C: 032] dumps: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/351574 (owner: 10Dzahn) [07:20:53] (03PS1) 10Muehlenhoff: Fix entry for zareen [puppet] - 10https://gerrit.wikimedia.org/r/351578 [07:21:22] (03PS2) 10Muehlenhoff: Fix entry for zareen [puppet] - 10https://gerrit.wikimedia.org/r/351578 [07:23:22] (03CR) 10Muehlenhoff: [C: 032] Fix entry for zareen [puppet] - 10https://gerrit.wikimedia.org/r/351578 (owner: 10Muehlenhoff) [07:26:03] (03CR) 10DCausse: "only nitpicks, feel free to ignore" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/351327 (owner: 10Gehel) [07:31:15] 06Operations, 10ops-codfw, 10DBA: es2019 crashed again - https://phabricator.wikimedia.org/T149526#3230560 (10Marostegui) 05Open>03Resolved Let's close this for now. [07:32:32] 06Operations, 10DBA, 06DC-Ops: db1063 thermal issues (was: db1063 io (s5 master eqiad) performance is bad) - https://phabricator.wikimedia.org/T164107#3230562 (10Marostegui) db1063 is definitely in good shape to keep being the master for the switchover: ``` root@db1063:~# megacli -AdpBbuCmd -a0 | grep Tem... [07:33:14] 06Operations, 10Mail, 10Wikimedia-Mailing-lists, 05Security: Sender email spoofing - https://phabricator.wikimedia.org/T160529#3230564 (10Nemo_bis) Spam filtering based on X-Spam-Score is best discussed on T161082 IMHO (we need a global solution). [07:35:19] !log updated python-jenkins on labnodepool1001 to 0.4.14 (needed by latest Jenkins LTS) [07:35:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:58] !log Restarting Nodepool to catch up with python-jenkins 0.4.14 [07:36:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:15] moritzm: I am watching logs [07:40:27] moritzm: looks good to me now. Thank you a lot [07:40:43] going to upgrade Jenkins and verify [07:42:03] (03PS1) 10Elukey: Remove old memcached settings related to decommed hw [puppet] - 10https://gerrit.wikimedia.org/r/351579 [07:42:32] <_joe_> !log rebuilding RAIDs on restbase1018 T163280 [07:42:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:40] T163280: Degraded RAID on restbase1018 - https://phabricator.wikimedia.org/T163280 [07:43:00] RECOVERY - MD RAID on restbase1018 is OK: OK: Active: 12, Working: 12, Failed: 0, Spare: 0 [07:43:37] (03CR) 10Elukey: [C: 032] Remove old memcached settings related to decommed hw [puppet] - 10https://gerrit.wikimedia.org/r/351579 (owner: 10Elukey) [07:45:47] (03PS1) 10Muehlenhoff: Pick a new memcached canary for debdedeploy [puppet] - 10https://gerrit.wikimedia.org/r/351581 [07:45:56] (03PS2) 10Muehlenhoff: Pick a new memcached canary for debdedeploy [puppet] - 10https://gerrit.wikimedia.org/r/351581 [07:47:59] 06Operations, 10ops-eqiad: Decommission old memcached hosts - mc1001->mc1018 - https://phabricator.wikimedia.org/T164341#3230583 (10elukey) [07:48:09] (03PS3) 10Gehel: logstash - delete all indices older than 31 days [puppet] - 10https://gerrit.wikimedia.org/r/351327 [07:48:12] after almost a year \o/ [07:48:25] (decommission of old mc hosts) [07:48:53] (03CR) 10Gehel: logstash - delete all indices older than 31 days (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/351327 (owner: 10Gehel) [07:52:14] (03PS2) 10Elukey: Remove mgmt dns records for mw2090->mw2096 [dns] - 10https://gerrit.wikimedia.org/r/350813 (https://phabricator.wikimedia.org/T161488) [07:53:26] !log Upgrading Jenkins 2.46.1 -> 2.46.2 - T144106 [07:53:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:35] T144106: Upgrade Jenkins from 1.x to latest 2.x - https://phabricator.wikimedia.org/T144106 [07:58:08] 06Operations, 06Performance-Team, 15User-Elukey: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#3230597 (10elukey) 05stalled>03Open Finally we were able to decommission the old mc1001->mc1018 hardware and replace it with mc1019-mc1036. Some time has passed... [07:59:01] ACKNOWLEDGEMENT - MD RAID on restbase1018 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T164342 [07:59:04] 06Operations, 10ops-eqiad: Degraded RAID on restbase1018 - https://phabricator.wikimedia.org/T164342#3230599 (10ops-monitoring-bot) [08:01:04] <_joe_> sigh [08:01:23] !log Rolling back Jenkins 2.46.2 -> 2.46.1 - T144106 [08:01:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:31] T144106: Upgrade Jenkins from 1.x to latest 2.x - https://phabricator.wikimedia.org/T144106 [08:01:40] <_joe_> volans: ^^ this is a defect of the auto handler [08:03:02] (03CR) 10Muehlenhoff: [C: 032] Pick a new memcached canary for debdedeploy [puppet] - 10https://gerrit.wikimedia.org/r/351581 (owner: 10Muehlenhoff) [08:10:35] (03CR) 10Jcrespo: "honestly, I am not sure. I do not think we will have a peak this time (servers are warmed up, and we did have it last time). I would leave" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351577 (owner: 10Marostegui) [08:11:12] (03CR) 10John Vandenberg: [C: 031] Make mailman reject messages with high X-Spam-Score by default [puppet] - 10https://gerrit.wikimedia.org/r/350429 (https://phabricator.wikimedia.org/T161082) (owner: 10Nemo bis) [08:11:48] (03Abandoned) 10Marostegui: db-eqiad.php: Pool db1097 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351577 (owner: 10Marostegui) [08:20:37] (03CR) 10Jcrespo: "Don't abandon it, precisely I commented to leave it on standby until we see if we merge it or not." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351577 (owner: 10Marostegui) [08:21:02] (03Restored) 10Marostegui: db-eqiad.php: Pool db1097 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351577 (owner: 10Marostegui) [08:24:39] <_joe_> !log deactivating restbase1018-vg for RAID failover and rebuild T163280 [08:24:40] PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:24:40] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:24:40] PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:24:40] PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:24:40] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:24:41] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:24:41] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:24:42] PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:24:42] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:24:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:48] T163280: Degraded RAID on restbase1018 - https://phabricator.wikimedia.org/T163280 [08:24:50] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:24:56] probably the backups [08:25:30] RECOVERY - MariaDB Slave IO: m2 on dbstore1001 is OK: OK slave_io_state not a slave [08:25:30] RECOVERY - MariaDB Slave SQL: x1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [08:25:30] RECOVERY - MariaDB Slave IO: m3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [08:25:30] RECOVERY - MariaDB Slave SQL: s6 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [08:25:30] RECOVERY - MariaDB Slave SQL: m2 on dbstore1001 is OK: OK slave_sql_state not a slave [08:25:31] RECOVERY - MariaDB Slave IO: s3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [08:25:31] RECOVERY - MariaDB Slave IO: s6 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [08:25:32] RECOVERY - MariaDB Slave SQL: s1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [08:25:32] RECOVERY - MariaDB Slave IO: s5 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [08:25:40] RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [08:34:32] 06Operations, 10DBA, 06DC-Ops: db1063 thermal issues (was: db1063 io (s5 master eqiad) performance is bad) - https://phabricator.wikimedia.org/T164107#3230684 (10jcrespo) Actually, temperatures have apparently shifted side: ``` $ cat /sys/class/thermal/thermal_zone*/temp 67000 53000 ``` [08:36:08] 06Operations, 06Labs, 10Tool-Labs, 13Patch-For-Review: Upgrade bootstrap-vz version for tools docker builder - https://phabricator.wikimedia.org/T157526#3230688 (10MoritzMuehlenhoff) 05Open>03Resolved This has been fixed by moving to a backport of bootstrap-vz from stretch (replacing the old stdeb pack... [08:36:42] 06Operations, 06Release-Engineering-Team, 05Goal, 13Patch-For-Review, and 3 others: Prepare and maintain base container images - https://phabricator.wikimedia.org/T162042#3230690 (10MoritzMuehlenhoff) [08:37:39] (03PS1) 10Elukey: Set adapter=MYSQLI to Piwik's config.ini [puppet] - 10https://gerrit.wikimedia.org/r/351584 (https://phabricator.wikimedia.org/T164073) [08:38:51] (03CR) 10Elukey: [C: 032] Set adapter=MYSQLI to Piwik's config.ini [puppet] - 10https://gerrit.wikimedia.org/r/351584 (https://phabricator.wikimedia.org/T164073) (owner: 10Elukey) [08:46:01] (03CR) 10Filippo Giunchedi: "LGTM, some minor comments." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/351327 (owner: 10Gehel) [08:47:21] (03PS1) 10Muehlenhoff: Clean up netboot config [puppet] - 10https://gerrit.wikimedia.org/r/351587 [08:48:38] (03PS4) 10Gehel: logstash - delete all indices older than 31 days [puppet] - 10https://gerrit.wikimedia.org/r/351327 [08:53:08] <_joe_> !log rebooting restbase1018 T163280 [08:53:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:16] T163280: Degraded RAID on restbase1018 - https://phabricator.wikimedia.org/T163280 [08:54:40] PROBLEM - Host restbase1018 is DOWN: PING CRITICAL - Packet loss = 100% [08:55:00] RECOVERY - Host restbase1018 is UP: PING OK - Packet loss = 0%, RTA = 37.57 ms [08:55:20] RECOVERY - Check systemd state on restbase1018 is OK: OK - running: The system is fully operational [08:55:26] (03PS1) 10Giuseppe Lavagetto: Revert to using scap code deploys for MediaWiki [switchdc] - 10https://gerrit.wikimedia.org/r/351590 [08:55:33] (03PS2) 10Giuseppe Lavagetto: Revert to using scap code deploys for MediaWiki [switchdc] - 10https://gerrit.wikimedia.org/r/351590 [08:55:59] (03CR) 10Gehel: "I just had a look at curator (I did not know about it yet). It looks like a lot of code to just wrap the _cat API, not sure I like it. It " (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/351327 (owner: 10Gehel) [08:56:26] godog: thanks for the pointer to curator! I did not know about it yet... [08:58:32] gehel: np! I know about it only because I've been using it to do the same purging, but you are right it is a lot of code [08:58:49] 06Operations, 10ops-eqiad, 10Cassandra, 13Patch-For-Review, 06Services (doing): Failed disk / degraded RAID arrays: restbase1018.eqiad.wmnet - https://phabricator.wikimedia.org/T163292#3192581 (10Joe) I just recreated the RAID arrays and rebooted the system with the new disk in place. @Eevans I'd let you... [08:59:09] 06Operations, 10ops-eqiad: Degraded RAID on restbase1018 - https://phabricator.wikimedia.org/T164202#3230791 (10Joe) 05Open>03Resolved [08:59:48] godog: I'm wondering if using curator will not just move the complexity we have in python code to complexity in a yaml config file... [08:59:58] I'll play with it a bit and see [09:01:06] (03CR) 10Filippo Giunchedi: [C: 031] logstash - delete all indices older than 31 days [puppet] - 10https://gerrit.wikimedia.org/r/351327 (owner: 10Gehel) [09:01:30] 06Operations, 10ops-eqiad: Degraded RAID on restbase1018 - https://phabricator.wikimedia.org/T164342#3230800 (10Joe) [09:01:32] 06Operations: Degraded RAID on restbase1018 - https://phabricator.wikimedia.org/T163280#3230802 (10Joe) [09:01:35] gehel: ok thanks! not urgent, the current solution will work just fine IMO [09:02:01] godog: thanks for the review! [09:02:48] 06Operations, 10ops-eqiad: Degraded RAID on restbase1018 - https://phabricator.wikimedia.org/T164287#3230805 (10Joe) [09:02:50] 06Operations: Degraded RAID on restbase1018 - https://phabricator.wikimedia.org/T163280#3192172 (10Joe) [09:05:21] !log rebuild mismounted FSes on ms-be1035 - T163673 [09:05:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:29] T163673: Some swift disks wrongly mounted on 5 ms-be hosts - https://phabricator.wikimedia.org/T163673 [09:08:02] (03PS1) 10Giuseppe Lavagetto: Switch back to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351592 [09:18:03] !log rebooting restbase1018 for update to Linux 4.9 [09:18:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:47] !log reboot mc[1019-1036].eqiad.wmnet for kernel upgrades [09:19:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:43] 06Operations, 06Operations-Software-Development, 05codfw-rollout: switchdc: Improve wgReadOnly message - https://phabricator.wikimedia.org/T164177#3224532 (10Joe) >>! In T164177#3224891, @EddieGP wrote: >>>! In T164177#3224771, @Volans wrote: >> The manual change + commit + deploy of the MW configuration mig... [09:38:20] RECOVERY - Check systemd state on labsdb1006 is OK: OK - running: The system is fully operational [09:39:49] 06Operations, 10ops-eqiad: ms-be1017 repeated usb connect/disconnect message - https://phabricator.wikimedia.org/T148016#3230905 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi @Cmjohnson ack, thanks! In the meantime I don't see this message anymore on ms-be1017. Tentatively resolving for now, to be reop... [09:41:05] (03PS1) 10Alexandros Kosiaris: interface: Add a new define for handling /e/n/i config [puppet] - 10https://gerrit.wikimedia.org/r/351603 [09:44:05] 06Operations, 10ops-eqiad, 15User-fgiunchedi: HP RAID icinga alert on ms-be1021 - https://phabricator.wikimedia.org/T163777#3230916 (10fgiunchedi) @Cmjohnson yeah it looks like a hw raid controller cache of some sorts? We can coordinate some downtime maybe next week after the switchover to debug (or reseat?)... [09:48:21] (03CR) 10Alexandros Kosiaris: [C: 032] zotero: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/351575 (owner: 10Dzahn) [09:48:25] (03PS2) 10Alexandros Kosiaris: zotero: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/351575 (owner: 10Dzahn) [09:48:28] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] zotero: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/351575 (owner: 10Dzahn) [09:50:17] !log Restart db1097 to change its binlog to STATEMENT - T155099 [09:50:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:32] T155099: Database maintenance scheduled while eqiad datacenter is non primary (after the DC switchover) - https://phabricator.wikimedia.org/T155099 [09:50:59] (03PS1) 10Hashar: WMF: force Jenkins queries to use POST [debs/nodepool] (patch-queue/debian) - 10https://gerrit.wikimedia.org/r/351606 (https://phabricator.wikimedia.org/T144106) [09:52:57] (03CR) 10Volans: [C: 031] Revert to using scap code deploys for MediaWiki [switchdc] - 10https://gerrit.wikimedia.org/r/351590 (owner: 10Giuseppe Lavagetto) [09:54:37] (03CR) 10Hashar: [C: 032] WMF: force Jenkins queries to use POST [debs/nodepool] (patch-queue/debian) - 10https://gerrit.wikimedia.org/r/351606 (https://phabricator.wikimedia.org/T144106) (owner: 10Hashar) [09:55:46] (03CR) 10Volans: [C: 04-2] "Waiting the actual switchover time" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351592 (owner: 10Giuseppe Lavagetto) [09:58:15] (03CR) 10Giuseppe Lavagetto: [C: 032] Revert to using scap code deploys for MediaWiki [switchdc] - 10https://gerrit.wikimedia.org/r/351590 (owner: 10Giuseppe Lavagetto) [10:00:17] (03CR) 10Alexandros Kosiaris: [C: 04-1] icinga: move nsca config to pub repo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/351572 (owner: 10Dzahn) [10:03:14] (03PS1) 10Hashar: 0.1.1-wmf6: force jenkins queries to use POST [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/351608 (https://phabricator.wikimedia.org/T144106) [10:05:45] !log installing icu security updates on trusty (jessie already fixed) [10:05:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:01] <_joe_> !log testing reverted steps of switchdc, non-dry-run --dc-from eqiad --dc-to codfw (should be noop) [10:13:07] (03CR) 10Jcrespo: [C: 031] Switch back to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351592 (owner: 10Giuseppe Lavagetto) [10:13:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:57] !log START - Set MediaWiki in read-only mode in eqiad (db-eqiad config already merged and git pulled) - t02_start_mediawiki_readonly (switchdc/oblivian@neodymium) [10:13:57] !log END (PASS) - Set MediaWiki in read-only mode in eqiad (db-eqiad config already merged and git pulled) - t02_start_mediawiki_readonly (switchdc/oblivian@neodymium) [10:14:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:33] !log START - Set MediaWiki in read-write mode in codfw (db-codfw config already merged and git pulled) - t08_stop_mediawiki_readonly (switchdc/oblivian@neodymium) [10:14:34] !log END (PASS) - Set MediaWiki in read-write mode in codfw (db-codfw config already merged and git pulled) - t08_stop_mediawiki_readonly (switchdc/oblivian@neodymium) [10:14:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:06] 06Operations, 06Operations-Software-Development, 05codfw-rollout: switchdc: Improve wgReadOnly message - https://phabricator.wikimedia.org/T164177#3231000 (10EddieGP) >>! In T164177#3230847, @Joe wrote: >>>! In T164177#3224891, @EddieGP wrote: >>>>! In T164177#3224771, @Volans wrote: >>> The manual change +... [10:17:57] !log START - Switch MediaWiki master datacenter and read-write discovery records from eqiad to codfw - t05_switch_datacenter (switchdc/oblivian@neodymium) [10:18:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:04] !log END (PASS) - Switch MediaWiki master datacenter and read-write discovery records from eqiad to codfw - t05_switch_datacenter (switchdc/oblivian@neodymium) [10:18:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:34] 06Operations, 15User-Elukey, 07Wikimedia-log-errors: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735#3231002 (10elukey) >>! In T125735#3199656, @aaron wrote: > I think it's worth trying persistent connections... [10:29:21] 06Operations, 06Operations-Software-Development, 05codfw-rollout: switchdc: Improve wgReadOnly message - https://phabricator.wikimedia.org/T164177#3224532 (10Trizek-WMF) Some suggestions from IRC: * "This wiki is in read-only mode for a server switch test. See https://meta.wikimedia.org/wiki/codfw for more... [10:29:44] 06Operations, 06Operations-Software-Development, 05codfw-rollout: switchdc: Improve wgReadOnly message - https://phabricator.wikimedia.org/T164177#3224532 (10Elitre) > MediaWiki is in read-only mode for a datacenter switchover test; please try again in a few minutes. See https://meta.wikimedia.org/wiki/Speci... [10:34:38] (03CR) 10Muehlenhoff: "Looks good, that matches the changes done in python-jenkins 0.4.14" [debs/nodepool] (patch-queue/debian) - 10https://gerrit.wikimedia.org/r/351606 (https://phabricator.wikimedia.org/T144106) (owner: 10Hashar) [10:38:12] (03CR) 10Volans: [C: 031] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/350824 (https://phabricator.wikimedia.org/T155099) (owner: 10Marostegui) [10:40:39] (03PS2) 10Giuseppe Lavagetto: Switch back to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351592 [10:40:41] (03PS1) 10Giuseppe Lavagetto: Change the read-only message to be more informative during the switchover [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351614 (https://phabricator.wikimedia.org/T164177) [10:44:06] (03CR) 10EddieGP: "Wont [link] render as 1 ? Or is this some special format (neither the default wiki syntax nor html)?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351614 (https://phabricator.wikimedia.org/T164177) (owner: 10Giuseppe Lavagetto) [10:51:27] (03CR) 10EddieGP: "So, to clarify this more, in normal wikitext syntax" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351614 (https://phabricator.wikimedia.org/T164177) (owner: 10Giuseppe Lavagetto) [10:52:08] (03PS2) 10Giuseppe Lavagetto: Change the read-only message to be more informative during the switchover [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351614 (https://phabricator.wikimedia.org/T164177) [10:52:10] (03PS3) 10Giuseppe Lavagetto: Switch back to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351592 [10:55:16] 06Operations, 06Operations-Software-Development, 13Patch-For-Review, 05codfw-rollout: switchdc: Improve wgReadOnly message - https://phabricator.wikimedia.org/T164177#3224532 (10ArielGlenn) So: I tested on a local install with MW 1.29wmf18. I tested setting $wgReadOnly or readOnlyBySection with a link wit... [10:56:16] (03CR) 10Giuseppe Lavagetto: "> So, to clarify this more, in normal wikitext syntax" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351614 (https://phabricator.wikimedia.org/T164177) (owner: 10Giuseppe Lavagetto) [10:56:50] (03CR) 10Muehlenhoff: [C: 032] 0.1.1-wmf6: force jenkins queries to use POST [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/351608 (https://phabricator.wikimedia.org/T144106) (owner: 10Hashar) [10:57:33] (03CR) 10EddieGP: [C: 031] Change the read-only message to be more informative during the switchover [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351614 (https://phabricator.wikimedia.org/T164177) (owner: 10Giuseppe Lavagetto) [10:58:31] !log upgrading nodepool on labnodepool1001 to a package including https://gerrit.wikimedia.org/r/351608 [10:58:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:59] (03CR) 10ArielGlenn: "Plain link is fine, gives the user just a bit more info before they decide to click through." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351614 (https://phabricator.wikimedia.org/T164177) (owner: 10Giuseppe Lavagetto) [11:00:08] (03CR) 10ArielGlenn: [C: 031] Change the read-only message to be more informative during the switchover [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351614 (https://phabricator.wikimedia.org/T164177) (owner: 10Giuseppe Lavagetto) [11:00:16] (03CR) 10EddieGP: [C: 031] "> Yeah you're not the only one who gave feedback :P we're going with the plain link which you can see in PS2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351614 (https://phabricator.wikimedia.org/T164177) (owner: 10Giuseppe Lavagetto) [11:02:35] !log Restarting Nodepool [11:02:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:28] 06Operations, 10Monitoring, 06Performance-Team: (icinga/grafana) webpagetest-alerts: slow page rendering for Internet Explorer - https://phabricator.wikimedia.org/T163408#3231121 (10Peter) Looks like [IE is ok again](https://grafana.wikimedia.org/dashboard/db/webpagetest-drilldown?orgId=1&var-wiki=enwiki&var... [11:11:45] (03PS1) 10Giuseppe Lavagetto: Change the expected message in the db configurations [switchdc] - 10https://gerrit.wikimedia.org/r/351616 (https://phabricator.wikimedia.org/T164177) [11:11:53] <_joe_> volans: ^^ [11:11:58] * volans looking [11:12:15] <_joe_> volans: check it, if you think it's ok we can merge the mediawiki-config patch and deploy it using switchdc [11:12:31] so to test it [11:12:46] <_joe_> switchdc --dc-from eqiad --dc-to codfw [11:13:31] <_joe_> or, I do a plain scap sync-file and then we test the expressions match [11:13:50] the former works for me [11:14:00] <_joe_> ok [11:14:19] <_joe_> so once you're ok with that patch, can you merge it while I merge mediawiki-config? [11:14:33] sure, checking is the same of mw-config [11:14:51] (03CR) 10EddieGP: [C: 031] Change the expected message in the db configurations [switchdc] - 10https://gerrit.wikimedia.org/r/351616 (https://phabricator.wikimedia.org/T164177) (owner: 10Giuseppe Lavagetto) [11:15:38] where is jenkins? [11:15:43] (03CR) 10jerkins-bot: [V: 04-1] Change the expected message in the db configurations [switchdc] - 10https://gerrit.wikimedia.org/r/351616 (https://phabricator.wikimedia.org/T164177) (owner: 10Giuseppe Lavagetto) [11:15:48] (03PS1) 10Hashar: 0.1.1-wmf7 build: python-pbr < 0.10 [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/351617 [11:15:50] <_joe_> there it is [11:16:00] <_joe_> what did I do wrong? [11:16:06] too long :D [11:16:16] <_joe_> wtf. let's ignore that [11:16:18] I'll add a # noqa [11:16:23] !log restarting replication on s*, and x1 eqiad -> codfw [11:16:25] <_joe_> yes please do [11:16:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:08] (03CR) 10Hashar: "The package build but lintian is not happy https://integration.wikimedia.org/ci/job/debian-glue-non-voting/861/testReport/junit/lintian/no" [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/351617 (owner: 10Hashar) [11:18:54] (03PS2) 10Volans: Change the expected message in the db configurations [switchdc] - 10https://gerrit.wikimedia.org/r/351616 (https://phabricator.wikimedia.org/T164177) (owner: 10Giuseppe Lavagetto) [11:19:16] (03PS2) 10Hashar: 0.1.1-wmf7 build: python-pbr < 0.10 [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/351617 [11:19:27] <_joe_> volans: thanks [11:19:44] wait a sec [11:20:02] (03PS3) 10Volans: Change the expected message in the db configurations [switchdc] - 10https://gerrit.wikimedia.org/r/351616 (https://phabricator.wikimedia.org/T164177) (owner: 10Giuseppe Lavagetto) [11:20:07] cleaner :D [11:20:20] <_joe_> w/e [11:20:26] (03CR) 10Muehlenhoff: [V: 032 C: 032] 0.1.1-wmf7 build: python-pbr < 0.10 [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/351617 (owner: 10Hashar) [11:20:27] <_joe_> is the patch ok? [11:20:36] yes, merging [11:20:52] (03CR) 10Volans: [C: 032] Change the expected message in the db configurations [switchdc] - 10https://gerrit.wikimedia.org/r/351616 (https://phabricator.wikimedia.org/T164177) (owner: 10Giuseppe Lavagetto) [11:20:52] <_joe_> ok, merging mwconfig after that [11:21:05] I'll pull it on sarin/neodymium [11:21:16] close any switchdc you might still have open, JIC [11:21:46] really? [11:21:48] :-) [11:22:04] :-P was to be sure we had the new code once testing it [11:22:07] (busy with replication, will troll later) [11:22:10] ahahah [11:22:49] _joe_: done [11:23:02] !log Upgrading Jenkins 2.46.1 -> 2.46.2 - T144106 [11:23:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:10] T144106: Upgrade Jenkins from 1.x to latest 2.x - https://phabricator.wikimedia.org/T144106 [11:23:13] PROBLEM - MariaDB Slave Lag: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:23:31] (03CR) 10Giuseppe Lavagetto: [C: 032] Change the read-only message to be more informative during the switchover [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351614 (https://phabricator.wikimedia.org/T164177) (owner: 10Giuseppe Lavagetto) [11:23:31] ^ backups most likely [11:23:45] (03CR) 10jenkins-bot: Change the read-only message to be more informative during the switchover [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351614 (https://phabricator.wikimedia.org/T164177) (owner: 10Giuseppe Lavagetto) [11:24:53] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:25:28] moritzm: Node id: 642912 added to jenkins \O/ [11:25:35] !log uploaded nodepool 0.1.1+wmf7 to apt.wikimedia.org [11:25:41] nodepool is now able to speak to jenkins :-} [11:25:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:53] RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [11:25:57] hashar: \o/ [11:26:09] <_joe_> volans: testing on mwdebug1001 [11:26:15] ok [11:27:43] <_joe_> uhm interesting result [11:27:49] didn't change? [11:27:56] <_joe_> on enwiki I do not see the text of the error at all :P [11:28:08] <_joe_> they have a specialized notice about db being read-only [11:28:13] is in siteinfo though [11:28:22] "This wiki is in read-only mode for a datacenter switchover test. See https://meta.wikimedia.org/wiki/codfw for more information." [11:28:40] <_joe_> volans: ok, go to mwdebug1001 try to edit an enwiki page and see for yourself [11:28:53] ok, is that the translated one from last week? [11:28:55] <_joe_> (I used my talk page for this test) [11:28:59] sorry 2 weeks ago [11:29:05] <_joe_> nope [11:29:10] <_joe_> that's a CN [11:30:40] I've looked at testwiki and dewiki, message shown there correctly, lgtm [11:30:49] <_joe_> on enwiki the message is shown correctly too [11:30:53] <_joe_> lgtm too [11:31:27] <_joe_> sorry, s/enwiki/itwiki/ [11:31:51] <_joe_> volans: should we try switchdc for scap sync-filing then? [11:32:05] I think so, I've double checked the code, does exactly what we want [11:32:10] it will just spam a bit SAL [11:32:18] with confusing messages about readonly [11:32:26] 06Operations, 10Monitoring, 06Performance-Team: (icinga/grafana) webpagetest-alerts: slow page rendering for Internet Explorer - https://phabricator.wikimedia.org/T163408#3231172 (10Gilles) 05Open>03Resolved a:03Gilles As expected, this happened when the traffic was switched back: https://wikitech.wiki... [11:32:42] <_joe_> yeah, but we do need to test this IMHO [11:32:56] <_joe_> we can do a dry run after I'm done though [11:32:59] <_joe_> it's ok too [11:33:15] <_joe_> yeah let me scap sync-file the two files [11:33:25] ok, as you wish [11:33:28] both works for me [11:33:31] <_joe_> sorry, changed my mind 3 times, I know [11:33:38] :D [11:33:55] hey you guys [11:34:38] yes :) [11:34:57] nothing, just hi :) [11:35:07] so I guess no etcd for MW after all right? :) [11:35:17] unfortunately not :( [11:37:36] !log oblivian@naos Synchronized wmf-config: Changing the read-only reason for the DC switchover (T164177) (duration: 01m 20s) [11:37:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:46] T164177: switchdc: Improve wgReadOnly message - https://phabricator.wikimedia.org/T164177 [11:39:29] !log rebooting kubernetes1001 for update to Linux 4.9 [11:39:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:21] (03CR) 10Volans: [C: 04-2] "LGTM. Waiting for the switchover" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351592 (owner: 10Giuseppe Lavagetto) [11:43:31] (03PS2) 10Volans: cache::text: switch all mediawiki to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/351313 (https://phabricator.wikimedia.org/T160178) [11:43:36] (03CR) 10Marostegui: [C: 031] Switch back to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351592 (owner: 10Giuseppe Lavagetto) [11:44:10] (03PS2) 10Volans: discovery::app_routes: switch mediawiki to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/351315 (https://phabricator.wikimedia.org/T160178) [11:44:48] (03CR) 10Giuseppe Lavagetto: [C: 031] discovery::app_routes: switch mediawiki to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/351315 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [11:46:11] (03CR) 10Giuseppe Lavagetto: [C: 031] cache::text: switch all mediawiki to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/351313 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [11:49:43] 06Operations, 06Operations-Software-Development, 13Patch-For-Review, 05codfw-rollout: switchdc: Improve wgReadOnly message - https://phabricator.wikimedia.org/T164177#3231220 (10Volans) 05Open>03Resolved a:03Volans [11:54:27] 06Operations, 10Monitoring, 10netops: Icinga check for VRRP - https://phabricator.wikimedia.org/T150264#3231238 (10ayounsi) a:03faidon Re-assigning to @faidon as he has a check ready. [11:57:04] 06Operations, 06Operations-Software-Development, 13Patch-For-Review, 05codfw-rollout: switchdc: Improve wgReadOnly message - https://phabricator.wikimedia.org/T164177#3231240 (10EddieGP) @Volans [Not relevant for todays switchover, feel free to read later if busy] I don't agree that this is resolved. This... [11:59:01] !log rebooting kubernetes1003 for update to Linux 4.9 [11:59:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:21] !log rebooted kubernetes1002, not 1003 [11:59:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:11] 06Operations, 06Operations-Software-Development, 13Patch-For-Review, 05codfw-rollout: switchdc: Improve wgReadOnly message - https://phabricator.wikimedia.org/T164177#3231248 (10Volans) @EddieGP I agree with you, I closed it because this one was targeting this specific rollout and switchdc and didn't want... [12:14:34] (03PS1) 10Muehlenhoff: aptrepo: Re-enable jenkins for jessie [puppet] - 10https://gerrit.wikimedia.org/r/351626 (https://phabricator.wikimedia.org/T157429) [12:19:13] (03CR) 10Hashar: [C: 031] aptrepo: Re-enable jenkins for jessie [puppet] - 10https://gerrit.wikimedia.org/r/351626 (https://phabricator.wikimedia.org/T157429) (owner: 10Muehlenhoff) [12:20:48] (03CR) 10Muehlenhoff: [C: 032] aptrepo: Re-enable jenkins for jessie [puppet] - 10https://gerrit.wikimedia.org/r/351626 (https://phabricator.wikimedia.org/T157429) (owner: 10Muehlenhoff) [12:20:54] (03PS2) 10Muehlenhoff: aptrepo: Re-enable jenkins for jessie [puppet] - 10https://gerrit.wikimedia.org/r/351626 (https://phabricator.wikimedia.org/T157429) [12:22:34] (03CR) 10Muehlenhoff: [V: 032 C: 032] aptrepo: Re-enable jenkins for jessie [puppet] - 10https://gerrit.wikimedia.org/r/351626 (https://phabricator.wikimedia.org/T157429) (owner: 10Muehlenhoff) [12:26:48] (03PS1) 10Gehel: elasticsearch - update reprepro configuration [puppet] - 10https://gerrit.wikimedia.org/r/351629 [12:31:13] PROBLEM - Check Varnish expiry mailbox lag on cp2011 is CRITICAL: CRITICAL: expiry mailbox lag is 602607 [12:46:26] (03PS2) 10Alexandros Kosiaris: librenms: Introduce scap3 deployment [puppet] - 10https://gerrit.wikimedia.org/r/350861 (https://phabricator.wikimedia.org/T129136) [12:50:59] (03PS1) 10Muehlenhoff: Add comment for jenkins and reprepro [puppet] - 10https://gerrit.wikimedia.org/r/351634 [12:51:09] (03CR) 10DCausse: [V: 032 C: 032] Elastic 5.1.2 plugins [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/341826 (owner: 10DCausse) [12:52:17] (03PS2) 10Muehlenhoff: Add comment for jenkins and reprepro [puppet] - 10https://gerrit.wikimedia.org/r/351634 [12:53:33] PROBLEM - Check Varnish expiry mailbox lag on cp2005 is CRITICAL: CRITICAL: expiry mailbox lag is 589069 [12:56:13] PROBLEM - Host kubernetes1003 is DOWN: PING CRITICAL - Packet loss = 100% [12:56:53] RECOVERY - Host kubernetes1003 is UP: PING OK - Packet loss = 0%, RTA = 37.51 ms [12:57:45] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 07Jenkins: Upload Jenkins LTS v2.46.2 to jessie-wikimedia/third-party - https://phabricator.wikimedia.org/T157429#3231345 (10MoritzMuehlenhoff) 05Open>03Resolved Unfortunately the Jenkins repository section still canno... [12:57:58] (03CR) 10Muehlenhoff: [C: 032] Add comment for jenkins and reprepro [puppet] - 10https://gerrit.wikimedia.org/r/351634 (owner: 10Muehlenhoff) [12:59:54] (03PS3) 10Muehlenhoff: jenkins: remove groovy init that disabled CLI [puppet] - 10https://gerrit.wikimedia.org/r/351261 (owner: 10Hashar) [13:00:25] (03CR) 10Hashar: [C: 031] jenkins: remove groovy init that disabled CLI [puppet] - 10https://gerrit.wikimedia.org/r/351261 (owner: 10Hashar) [13:01:00] jouncebot: next [13:01:00] In 119 hour(s) and 58 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170508T1300) [13:01:12] (03PS1) 10Alexandros Kosiaris: Snakeoil keys for deploy-librenms [labs/private] - 10https://gerrit.wikimedia.org/r/351635 [13:02:34] (03PS2) 10Alexandros Kosiaris: Snakeoil keys for deploy-librenms [labs/private] - 10https://gerrit.wikimedia.org/r/351635 [13:03:30] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Snakeoil keys for deploy-librenms [labs/private] - 10https://gerrit.wikimedia.org/r/351635 (owner: 10Alexandros Kosiaris) [13:06:06] !log db1028: Increased /srv/ by 20G to clear the warning [13:06:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:58] (03CR) 10Muehlenhoff: [C: 032] jenkins: remove groovy init that disabled CLI [puppet] - 10https://gerrit.wikimedia.org/r/351261 (owner: 10Hashar) [13:12:59] (03PS1) 10Elukey: Add Prometheus Apache exporter to Bohrium [puppet] - 10https://gerrit.wikimedia.org/r/351636 (https://phabricator.wikimedia.org/T163204) [13:14:09] (03CR) 10Muehlenhoff: [C: 031] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/351629 (owner: 10Gehel) [13:16:36] !log Restarting Jenkins [13:16:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:03] PROBLEM - Host kubernetes1004 is DOWN: PING CRITICAL - Packet loss = 100% [13:19:03] RECOVERY - Host kubernetes1004 is UP: PING OK - Packet loss = 0%, RTA = 36.17 ms [13:20:38] (03PS2) 10Gehel: elasticsearch - update reprepro configuration [puppet] - 10https://gerrit.wikimedia.org/r/351629 [13:24:01] (03PS1) 10Jcrespo: Revert "Re-enable ContentTranslation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351638 [13:24:04] (03PS2) 10Elukey: Add a Prometheus Apache exporter to Bohrium [puppet] - 10https://gerrit.wikimedia.org/r/351636 (https://phabricator.wikimedia.org/T163204) [13:24:17] (03CR) 10Jcrespo: "Hopefully this is not needed, but I will prepare it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351638 (owner: 10Jcrespo) [13:24:32] (03PS2) 10Jcrespo: Revert "Re-enable ContentTranslation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351638 [13:24:40] (03CR) 10Marostegui: [C: 031] Revert "Re-enable ContentTranslation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351638 (owner: 10Jcrespo) [13:25:13] (03CR) 10Elukey: "Forgot to ask! @Ottomata, WDYT?" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/350542 (owner: 10Elukey) [13:28:34] (03CR) 10Jcrespo: [C: 04-2] "Do not deploy unless https://grafana.wikimedia.org/dashboard/db/mysql?panelId=37&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-se" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351638 (owner: 10Jcrespo) [13:30:45] (03CR) 10Alexandros Kosiaris: "I am thinking this is ready, wondering if I have missed something (aside from updating the deploy repo itself)" [puppet] - 10https://gerrit.wikimedia.org/r/350861 (https://phabricator.wikimedia.org/T129136) (owner: 10Alexandros Kosiaris) [13:31:13] RECOVERY - Check Varnish expiry mailbox lag on cp2011 is OK: OK: expiry mailbox lag is 25177 [13:31:42] (03CR) 10Volans: [C: 032] "I've verified that the request is legit." [puppet] - 10https://gerrit.wikimedia.org/r/351362 (owner: 10Anomie) [13:31:58] (03PS2) 10Volans: Add an additional SSH key for anomie [puppet] - 10https://gerrit.wikimedia.org/r/351362 (owner: 10Anomie) [13:36:53] 06Operations, 10ops-eqiad, 15User-fgiunchedi: Degraded RAID on ms-be1039 - https://phabricator.wikimedia.org/T163690#3231437 (10Cmjohnson) @fgiunchedi the disk has been replaced with a new one. You will probably need to add back. Return shipping information UPS 1Z A73 27E 90 8316 9978 [13:38:07] 06Operations, 10ops-eqiad: ripe-atlas-eqiad is down - https://phabricator.wikimedia.org/T163243#3231438 (10Cmjohnson) The null modem adapter cable is connected to iron. [13:40:38] (03PS1) 10Muehlenhoff: Fix typo in stretch config [puppet] - 10https://gerrit.wikimedia.org/r/351640 [13:42:53] (03PS3) 10Giuseppe Lavagetto: discovery::app_routes: switch mediawiki to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/351315 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [13:42:55] (03PS3) 10Giuseppe Lavagetto: cache::text: switch all mediawiki to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/351313 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [13:45:56] (03PS3) 10Elukey: Add a Prometheus Apache exporter to Bohrium [puppet] - 10https://gerrit.wikimedia.org/r/351636 (https://phabricator.wikimedia.org/T163204) [13:50:24] 06Operations, 10ops-eqiad, 10Cassandra, 06Services (blocked): restbase-test100[13] lost power redundancy - https://phabricator.wikimedia.org/T153248#3231443 (10Cmjohnson) Today, I used a known good psu from a decom'd R720XD. Once plugged in the PSU tripped the circuit on the phase. This is what happened... [13:53:09] (03CR) 10Volans: cache::text: switch all mediawiki to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/351313 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [13:53:17] (03CR) 10Volans: discovery::app_routes: switch mediawiki to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/351315 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [13:53:24] removing -2s [13:53:30] (03CR) 10Volans: Switch back to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351592 (owner: 10Giuseppe Lavagetto) [13:55:13] RECOVERY - puppet last run on ms-be1039 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [13:55:37] 06Operations, 10ops-eqiad, 10Cassandra, 13Patch-For-Review, 06Services (doing): Failed disk / degraded RAID arrays: restbase1018.eqiad.wmnet - https://phabricator.wikimedia.org/T163292#3231458 (10Eevans) >>! In T163292#3230784, @Joe wrote: > I just recreated the RAID arrays and rebooted the system with t... [13:55:51] 06Operations, 10ops-eqiad, 13Patch-For-Review, 15User-Joe: Decommission mw1152 - https://phabricator.wikimedia.org/T149185#3231459 (10Cmjohnson) [13:55:53] RECOVERY - HP RAID on ms-be1039 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [13:56:00] 06Operations, 10ops-eqiad, 13Patch-For-Review, 15User-Joe: Decommission mw1152 - https://phabricator.wikimedia.org/T149185#2744592 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson [13:57:12] (03CR) 10Giuseppe Lavagetto: [C: 032] Switch back to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351592 (owner: 10Giuseppe Lavagetto) [13:57:24] <_joe_> here we go. [13:57:24] (03CR) 10jenkins-bot: Switch back to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351592 (owner: 10Giuseppe Lavagetto) [13:57:38] <_joe_> can someone change the topic [13:57:44] I 'll do that [13:58:45] ok, starting with stage 0 [13:58:49] go :) [13:58:54] !log START - Disabling puppet on selected hosts in codfw and eqiad - t00_disable_puppet (switchdc/oblivian@neodymium) [13:59:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:30] !log END (PASS) - Disabling puppet on selected hosts in codfw and eqiad - t00_disable_puppet (switchdc/oblivian@neodymium) [13:59:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:45] !log START - Reduce the TTL of all the MediaWiki read-write discovery records - t00_reduce_ttl (switchdc/oblivian@neodymium) [13:59:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:53] !log END (PASS) - Reduce the TTL of all the MediaWiki read-write discovery records - t00_reduce_ttl (switchdc/oblivian@neodymium) [14:00:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:00] godog: around? [14:00:00] volans: ping detected, please leave a message! [14:00:05] volans: yeah, I'm here [14:00:07] go for swiftrepl [14:00:29] ack [14:00:49] !log stop swiftrepl on ms-fe1005 [14:00:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:17] volans _joe_ ^ [14:01:20] ack [14:01:51] !log START - Stop MediaWiki jobrunners, videoscalers and cronjobs in codfw - t01_stop_maintenance (switchdc/oblivian@neodymium) [14:01:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:03] PROBLEM - Check systemd state on mw2249 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:04:03] PROBLEM - Check systemd state on mw2154 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:04:10] expected ^ [14:04:13] PROBLEM - Check systemd state on mw2158 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:04:13] PROBLEM - Check systemd state on mw2162 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:04:23] <_joe_> qyes [14:04:23] PROBLEM - Check systemd state on mw2153 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:04:23] PROBLEM - Check systemd state on mw2157 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:04:33] PROBLEM - Check systemd state on mw2156 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:04:43] PROBLEM - Check systemd state on mw2160 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:04:43] PROBLEM - Check systemd state on mw2243 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:04:43] PROBLEM - Check systemd state on mw2247 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:04:43] we should add a task in switchdc to submit a scheduled downtime for that [14:04:53] PROBLEM - Check systemd state on mw2155 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:04:53] PROBLEM - Check systemd state on mw2159 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:04:53] PROBLEM - Check systemd state on mw2161 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:04:53] PROBLEM - Check systemd state on mw2248 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:04:53] yep, too late now [14:04:53] PROBLEM - Check systemd state on mw2250 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:04:59] there's already a task for the next switchover [14:05:10] (03PS1) 10Cmjohnson: Removing mgmt entries for decom host gallium [dns] - 10https://gerrit.wikimedia.org/r/351643 [14:05:13] PROBLEM - MariaDB Slave Lag: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:05:15] to late to do it, not to file, of course :-) [14:05:23] ignore dbstore1001 [14:05:31] bad backups [14:05:35] (03CR) 10Cmjohnson: [C: 032] Removing mgmt entries for decom host gallium [dns] - 10https://gerrit.wikimedia.org/r/351643 (owner: 10Cmjohnson) [14:05:36] paravoid: etherpad with stuff to do for the future swithovers ? do we have one already or should I create one ? [14:05:40] i will silence dbstoer1001 so it doesn't bother here for the next few hours [14:05:53] akosiaris: https://etherpad.wikimedia.org/p/codfw-switchover-AprMay2017 [14:05:56] thanks [14:06:34] !log END (PASS) - Stop MediaWiki jobrunners, videoscalers and cronjobs in codfw - t01_stop_maintenance (switchdc/oblivian@neodymium) [14:06:35] (03CR) 10Hashar: "I have cleaned up the groovy script from both contint1001 and contint2001. They have Jenkins 2.46.2 and I have confirmed the Jenkins CLI" [puppet] - 10https://gerrit.wikimedia.org/r/351261 (owner: 10Hashar) [14:06:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:49] going for readonly [14:06:59] go [14:07:00] let's go [14:07:08] !log START - Set MediaWiki in read-only mode in codfw (db-codfw config already merged and git pulled) - t02_start_mediawiki_readonly (switchdc/oblivian@neodymium) [14:07:08] !log MediaWiki read-only period starts at: 2017-05-03 14:07:08.261300 (switchdc/oblivian@neodymium) [14:07:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:33] PROBLEM - HHVM jobrunner on mw2249 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.002 second response time [14:07:43] PROBLEM - HHVM jobrunner on mw2156 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.002 second response time [14:07:43] PROBLEM - HHVM jobrunner on mw2160 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [14:07:54] !log root@naos Synchronized wmf-config/db-codfw.php: Set MediaWiki in read-only mode in datacenter codfw (duration: 00m 45s) [14:07:54] !log END (PASS) - Set MediaWiki in read-only mode in codfw (db-codfw config already merged and git pulled) - t02_start_mediawiki_readonly (switchdc/oblivian@neodymium) [14:08:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:07] confirm read only on enwiki [14:08:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:11] (I) [14:08:20] indeed [14:08:33] 06Operations, 10ops-ulsfo, 10Traffic: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327#3231482 (10BBlack) Yeah we should discuss our options a bit here re: minimizing ulsfo downtime, I think we have a few options for how we arrange this. There some complicating factors with the misc... [14:08:33] !log START - Set core DB masters in read-only mode in codfw, ensure all masters are read-only - t03_coredb_masters_readonly (switchdc/oblivian@neodymium) [14:08:33] RECOVERY - HHVM jobrunner on mw2249 is OK: HTTP OK: HTTP/1.1 200 OK - 203 bytes in 0.004 second response time [14:08:34] jynus: get a screenshot of that not very nice message you were mentioning [14:08:36] !log END (PASS) - Set core DB masters in read-only mode in codfw, ensure all masters are read-only - t03_coredb_masters_readonly (switchdc/oblivian@neodymium) [14:08:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:43] RECOVERY - HHVM jobrunner on mw2156 is OK: HTTP OK: HTTP/1.1 200 OK - 203 bytes in 0.002 second response time [14:08:43] RECOVERY - HHVM jobrunner on mw2160 is OK: HTTP OK: HTTP/1.1 200 OK - 203 bytes in 0.002 second response time [14:08:44] akosiaris, already fixed :-) [14:08:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:51] !log START - Resync the redis for jobqueues in eqiad with the masters in codfw - t04_resync_redis (switchdc/oblivian@neodymium) [14:08:52] ah cool! :-D [14:08:53] 06Operations, 10ops-eqiad, 10Continuous-Integration-Infrastructure (phase-out-gallium): decom gallium (data center) - https://phabricator.wikimedia.org/T150316#3231496 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson gallium has been wiped, dns entries removed and removed from rack. racktables updated. [14:08:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:06] !log START - Wipe and warmup caches in eqiad - t04_cache_wipe (switchdc/oblivian@neodymium) [14:09:10] confirm codfw masters in RO [14:09:11] those 2 are in parallel [14:09:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:14] (03CR) 10Giuseppe Lavagetto: [C: 032] discovery::app_routes: switch mediawiki to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/351315 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [14:09:26] (logging appears all from oblivian but is because of tmux ;) ) [14:09:27] (03CR) 10Giuseppe Lavagetto: [C: 032] cache::text: switch all mediawiki to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/351313 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [14:09:55] I see no db errors on mediawiki logs [14:10:11] volans: quick q, we could have run switchdc from sarin with no problems, right ? [14:10:15] db test script look fast [14:10:23] and good [14:10:25] akosiaris: of course, we did in the way to codfw [14:10:36] yup, that's what I remembered hence my question [14:10:37] ok [14:10:42] edit count is 0 as expected: https://grafana.wikimedia.org/dashboard/db/edit-count?refresh=5m&panelId=8&fullscreen&orgId=1&from=now-1h&to=now [14:11:03] PROBLEM - Check systemd state on kafka1012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:11:06] no x1 problems so far [14:11:10] checking kafka [14:11:13] PROBLEM - Kafka MirrorMaker main-eqiad_to_analytics on kafka1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [14:11:13] kafka? [14:11:14] yeah [14:11:14] warmup is running now [14:11:16] (but we haven't switched) [14:11:26] <_joe_> we are running warmup and redis sync [14:11:33] PROBLEM - Check health of redis instance on 6379 on rdb1007 is CRITICAL: CRITICAL: replication_delay is 1493820687 600 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 3861124 keys, up 2 minutes 35 seconds - replication_delay is 1493820687 [14:11:51] kafka-mirror-main-eqiad_to_analytics.service failed, nothing to worry abouyt [14:11:55] is that ok^ [14:12:10] !log END (PASS) - Resync the redis for jobqueues in eqiad with the masters in codfw - t04_resync_redis (switchdc/oblivian@neodymium) [14:12:13] RECOVERY - Kafka MirrorMaker main-eqiad_to_analytics on kafka1012 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [14:12:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:22] !log restart kafka-mirror-main-eqiad_to_analytics.service on kafka1012 [14:12:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:31] <_joe_> jynus: expected but it should recover [14:12:33] RECOVERY - Check health of redis instance on 6379 on rdb1007 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 3855775 keys, up 3 minutes 35 seconds - replication_delay is 0 [14:12:39] thanks elukey [14:12:41] heh [14:12:51] some securepoll errors [14:12:55] I assume expected [14:13:03] RECOVERY - Check systemd state on kafka1012 is OK: OK - running: The system is fully operational [14:13:12] It might not respect $wgReadOnly properly? [14:13:13] and some --read-only errors [14:13:21] they shouldn't happen but it is ok [14:13:22] memcached metrics in https://grafana.wikimedia.org/dashboard/db/prometheus-memcached-dc-stats?orgId=1&from=now-1h&to=now [14:13:24] I mean, how often do you have extended periods of read-only time during an election [14:13:28] oh hey RoanKattouw, thank you for being here :) [14:13:36] * RoanKattouw rubs his eyes some more [14:13:51] PROBLEM - ElasticSearch health check for shards on relforge1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 17 threshold =0.1 breach: status: red, number_of_nodes: 2, unassigned_shards: 17, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 124, task_max_waiting_in_queue_millis: 0, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: 87.9432624 [14:13:55] RoanKattouw, we will file some bugs for some extensions to handle better read only [14:14:03] wikidata, securepoll, etc. [14:14:07] Yeah there's probably lots of them that don't handle it well [14:14:21] (that is why we enforce it on db side) [14:14:21] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [14:14:22] PROBLEM - ElasticSearch health check for shards on relforge1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 17 threshold =0.1 breach: status: red, number_of_nodes: 2, unassigned_shards: 17, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 124, task_max_waiting_in_queue_millis: 0, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: 87.9432624 [14:14:31] <_joe_> waiting for the warmup to end [14:14:49] I think it has? [14:14:53] 06Operations, 10Cassandra, 13Patch-For-Review, 06Services (blocked): setup/install restbase-dev100[123] - https://phabricator.wikimedia.org/T151075#3231528 (10Eevans) This is done, isn't it? [14:14:56] one of two [14:14:59] the global one [14:15:05] the other is still running [14:15:06] <_joe_> paravoid: that was the redis resync [14:15:21] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [14:15:32] !log END (PASS) - Wipe and warmup caches in eqiad - t04_cache_wipe (switchdc/oblivian@neodymium) [14:15:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:46] !log START - Switch MediaWiki master datacenter and read-write discovery records from codfw to eqiad - t05_switch_datacenter (switchdc/oblivian@neodymium) [14:15:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:20] !log root@naos Synchronized wmf-config/CommonSettings.php: Switch MediaWiki active datacenter to eqiad (duration: 00m 31s) [14:16:22] !log END (FAIL) - Switch MediaWiki master datacenter and read-write discovery records from codfw to eqiad - t05_switch_datacenter (switchdc/oblivian@neodymium) [14:16:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:36] 2017-05-03 14:16:21,906 [ERROR] Expected IP '10.2.2.1', got '10.2.1.1' [14:16:38] Expected IP '10.2.2.1', got '10.2.1.1' [14:16:42] yep, checking [14:16:51] I was about to paste that too [14:16:57] 06Operations, 10ops-eqiad, 10Cassandra, 06Services (blocked): restbase-test100[13] lost power redundancy - https://phabricator.wikimedia.org/T153248#3231544 (10Eevans) >>! In T153248#3231443, @Cmjohnson wrote: > Today, I used a known good psu from a decom'd R720XD. Once plugged in the PSU tripped the circ... [14:17:13] <_joe_> that's ok [14:17:16] !log START - Switch traffic flow to the appservers from codfw to eqiad - t05_switch_traffic (switchdc/oblivian@neodymium) [14:17:17] <_joe_> it was some lag [14:17:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:27] we should add a small sleep before the check [14:17:46] race condition, then? [14:17:53] <_joe_> jynus: yes [14:17:56] it was only one one [14:17:57] [DEBUG dnsdisc.py:81 in resolve] eeden.wikimedia.org:appservers-rw: 10.2.1.1 TTL 10 [14:18:07] good, if it is ok, now we can continue [14:18:21] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] [14:18:55] or do you want to retry? [14:18:55] !log END (PASS) - Switch traffic flow to the appservers from codfw to eqiad - t05_switch_traffic (switchdc/oblivian@neodymium) [14:19:01] ah, sorry [14:19:01] PROBLEM - Check health of redis instance on 6378 on rdb2003 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS 2.8.17 on 127.0.0.1:6378 has 1 databases (db0) with 4705607 keys, up 40 days 21 hours [14:19:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:04] not following the log [14:19:06] (03CR) 10Thcipriani: [C: 04-1] "Should be all the changes needed in puppet aside from adding deploy_librenms and deploy_librenms.pub private and public keys to puppet sec" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/350861 (https://phabricator.wikimedia.org/T129136) (owner: 10Alexandros Kosiaris) [14:19:11] PROBLEM - Check health of redis instance on 6379 on rdb2003 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 7357093 keys, up 40 days 21 hours [14:19:14] !log START - Switch the Redis masters from codfw to eqiad and invert the replication - t06_redis (switchdc/oblivian@neodymium) [14:19:15] <_joe_> redis is expected [14:19:20] !log END (PASS) - Switch the Redis masters from codfw to eqiad and invert the replication - t06_redis (switchdc/oblivian@neodymium) [14:19:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:21] PROBLEM - Check health of redis instance on 6379 on mc2027 is CRITICAL: CRITICAL: replication_delay is 1209222 600 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 663637 keys, up 40 days 5 hours - replication_delay is 1209222 [14:19:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:31] PROBLEM - Check health of redis instance on 6380 on rdb2003 is CRITICAL: CRITICAL: replication_delay is 1209232 600 - REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 7357426 keys, up 40 days 21 hours - replication_delay is 1209232 [14:19:38] !log START - Set core DB masters in read-write mode in eqiad, ensure masters in codfw are read-only - t07_coredb_masters_readwrite (switchdc/oblivian@neodymium) [14:19:40] eqiad slaves getting queries alerady: https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?orgId=1&from=now-1h&to=now&refresh=30s [14:19:41] !log END (PASS) - Set core DB masters in read-write mode in eqiad, ensure masters in codfw are read-only - t07_coredb_masters_readwrite (switchdc/oblivian@neodymium) [14:19:41] PROBLEM - Check health of redis instance on 6381 on rdb2003 is CRITICAL: CRITICAL: replication_delay is 1209247 600 - REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 7265447 keys, up 40 days 21 hours - replication_delay is 1209247 [14:19:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:54] !log START - Set MediaWiki in read-write mode in eqiad (db-eqiad config already merged and git pulled) - t08_stop_mediawiki_readonly (switchdc/oblivian@neodymium) [14:20:01] going back to RW [14:20:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:05] volans: go! [14:20:08] yay [14:20:11] ok, small spike on x1, but ok for now [14:20:20] (14 connections only) [14:20:21] RECOVERY - Check health of redis instance on 6379 on mc2027 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 657298 keys, up 40 days 5 hours - replication_delay is 0 [14:20:24] scap running [14:20:28] !log root@naos Synchronized wmf-config/db-eqiad.php: Set MediaWiki in read-write mode in datacenter eqiad (duration: 00m 32s) [14:20:28] !log MediaWiki read-only period ends at: 2017-05-03 14:20:28.286697 (switchdc/oblivian@neodymium) [14:20:28] !log END (PASS) - Set MediaWiki in read-write mode in eqiad (db-eqiad config already merged and git pulled) - t08_stop_mediawiki_readonly (switchdc/oblivian@neodymium) [14:20:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:37] <_joe_> let's confirm everything works? [14:20:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:53] I can edit! [14:20:55] same session of before [14:21:01] :-) [14:21:05] <_joe_> ok [14:21:08] 13 minutes [14:21:09] \o/ [14:21:11] RECOVERY - Check health of redis instance on 6378 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6378 has 1 databases (db0) with 4705607 keys, up 40 days 21 hours - replication_delay is 5 [14:21:12] nice [14:21:12] yes, I can, too [14:21:13] 13m20, new record :) [14:21:21] eswiki works too [14:21:21] RECOVERY - Check health of redis instance on 6379 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 7357349 keys, up 40 days 21 hours - replication_delay is 0 [14:21:22] !log START - Restore the TTL of all the MediaWiki read-write discovery records and cleanup confd stale files - t09_restore_ttl (switchdc/oblivian@neodymium) [14:21:27] \o/ [14:21:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:31] we haven't finished the work yet :-), now is the fun!!!!! [14:21:31] RECOVERY - Check health of redis instance on 6380 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 7357702 keys, up 40 days 21 hours - replication_delay is 0 [14:21:34] !log END (PASS) - Restore the TTL of all the MediaWiki read-write discovery records and cleanup confd stale files - t09_restore_ttl (switchdc/oblivian@neodymium) [14:21:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:41] RECOVERY - Check health of redis instance on 6381 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 7265767 keys, up 40 days 21 hours - replication_delay is 0 [14:21:48] edits are coming in: https://grafana.wikimedia.org/dashboard/db/edit-count?refresh=5m&orgId=1&from=now-1h&to=now [14:21:51] 06Operations, 06Performance-Team, 10Traffic: Evaluate/Deploy TCP BBR when available (kernel 4.9+) - https://phabricator.wikimedia.org/T147569#3231583 (10BBlack) @Gilles - FYI the kernel upgrades that were blocking this are done, and we're tentatively looking at turning on BBR on May 22, so that we have a wee... [14:22:01] PROBLEM - Check health of redis instance on 6379 on rdb2001 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6379 [14:22:09] x1 connections still growing [14:22:22] jynus: ok to restart jobrunner? [14:22:32] I'll wait for your ok [14:22:32] ok to me [14:22:36] <_joe_> elukey: can you check rdb2001? please [14:22:41] !log START - Start MediaWiki jobrunners, videoscalers and maintenance in eqiad - t09_start_maintenance (switchdc/oblivian@neodymium) [14:22:42] for what I can see for now [14:22:43] _joE_ ack [14:22:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:01] volans, ^ [14:23:01] RECOVERY - Check health of redis instance on 6379 on rdb2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 8556192 keys, up 40 days 21 hours - replication_delay is 0 [14:23:13] jynus: change_tag and tag_summary tables working fine in eqiad and getting new records without breaking repl \o/ [14:23:28] ah ok, everything was up and runnng [14:23:30] good [14:23:31] edits on "normal level" [14:23:41] Flow is working [14:23:41] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1209483 600 - REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3874713 keys, up 40 days 22 hours - replication_delay is 1209483 [14:23:45] checking overall load [14:23:47] Still trying to confirm that notification delivery is working [14:23:49] checking 2005 [14:23:58] thanks RoanKattouw [14:23:59] only unexpected message in fatatmonitor afaics is 'Fatal error: Timeout reached waiting for an available pooled curl connection!' for elastica/cirrussearch cc dcausse gehel [14:24:00] RoanKattouw, it may home some delay [14:24:13] due to job queue (I assume notifications use the job queue) [14:24:14] Yeah I just got an email for my test notif [14:24:26] some lag in s3 [14:24:28] just gone [14:24:31] godog: probably warming up, percentiles are decreasing [14:24:35] And saw it delivered [14:24:41] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3855227 keys, up 40 days 22 hours - replication_delay is 0 [14:24:46] gooood [14:24:46] I don't think they use the job queue but they do use DeferredUpdate IIRC [14:24:47] dcausse: indeed it is trending down [14:24:48] !log END (PASS) - Start MediaWiki jobrunners, videoscalers and maintenance in eqiad - t09_start_maintenance (switchdc/oblivian@neodymium) [14:24:50] godog: you can restart swiftrepl when you want [14:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:57] I'll try another one, see if it goes faster this time [14:25:24] volans: ack [14:25:29] !log start swiftrepl on ms-fe1005 [14:25:32] jynus: updating tendril tree [14:25:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:38] marostegui, actually very verbose replication lag is lower than usual [14:25:40] marostegui: too [14:25:48] x1 connections up to 60 [14:25:50] thanks volans [14:25:56] !log START - Update Tendril tree to start from the core DB masters in eqiad - t09_tendril (switchdc/oblivian@neodymium) [14:26:02] OK, still a bit slow but faster now, ~15s [14:26:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:04] !log END (PASS) - Update Tendril tree to start from the core DB masters in eqiad - t09_tendril (switchdc/oblivian@neodymium) [14:26:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:16] but they are not language-related [14:26:36] last one is parsoid [14:26:38] <_joe_> the jobqueue didn't explode this time [14:26:46] nice :) [14:26:52] :-) [14:26:54] jynus: nice :) [14:26:57] <_joe_> our very refined "resync immediately before switching" worked [14:26:59] great! [14:27:05] db health seems ok overally, but it is too soon to say [14:27:09] !log START - Rolling restart of parsoid in codfw and eqiad - t09_restart_parsoid (switchdc/oblivian@neodymium) [14:27:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:18] it's long, we do it slowly ;) [14:27:25] x1 connections going now down [14:27:35] memcached hit ratio is ~0.8 that is good at this stage [14:27:43] ah the rolling restart, right [14:27:48] replication lag warnings starting to rise [14:27:52] db1055 is struggling a bit [14:28:05] I think memcached hit ratio going up and x1 conns going down are strongly related [14:28:22] only 4 db connection errors so far [14:28:31] The only reason Flow isn't a giant performance nightmare is memcached [14:28:36] RoanKattouw, actually, x1 is not that large ovearlly [14:28:42] most of the lag is gone i think now [14:28:45] compared to s1-7 [14:28:51] I would imagine [14:29:00] Right [14:29:04] what did you see db1055 ? [14:29:08] *on, marostegui [14:29:22] it had a spike on lag, and lots of connections, but it quickly recovered [14:29:25] it is fine now [14:29:35] what shard is that? [14:29:37] s1 [14:29:50] it is rc service [14:30:04] es* doing ok [14:30:20] PROBLEM - Check Varnish expiry mailbox lag on cp2011 is CRITICAL: CRITICAL: expiry mailbox lag is 587095 [14:30:30] yeah, that is the typical thing, like api, that gets hammered on events like this [14:30:43] "doesn't work, reload or send the request again :-)" [14:31:23] it's not large because there aren't that many revisions in flow; the most is mediawikiwiki with 300k [14:31:30] on fatalmonitor (mediawiki) the only large error is an elastic search related query [14:31:40] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [14:31:55] RoanKattouw, see: https://grafana.wikimedia.org/dashboard/db/mysql?panelId=37&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1031&from=now-1h&to=now&refresh=1m [14:32:01] yes I think we need to work on better warm up process for search queries [14:32:10] apergos: Yeah, we'll need to come up with a better architecture soon. It's in our annual plan for the coming year. [14:32:23] RoanKattouw, versus: https://grafana.wikimedia.org/dashboard/db/mysql?panelId=37&fullscreen&orgId=1&from=1492567227791&to=1492657936071&refresh=1m&var-dc=codfw%20prometheus%2Fops&var-server=db2033 [14:32:27] gtk! [14:32:28] Oh, when you said a lot you meant 60 [14:32:41] I didn't say a lot [14:32:45] James_F: Flow is in the annual plan for next year? [14:32:45] did I? [14:32:54] paravoid: Yup. [14:33:07] "jynus> ok, small spike on x1, but ok for now" [14:33:08] Oh, no you're right, you literally said "up to 60" [14:33:17] 60 is abnormal [14:33:24] The only one I saw was "going down now" so my sleepy brain filled in the blanks [14:33:24] but not crazt like 5000 :-) [14:34:48] Happy to see that x1 didn't have any issues this time [14:34:58] or anything else really? [14:35:01] replication is looking good [14:35:05] does anyone see anything wrong anywhere? [14:35:08] Yeah, looks like everything was smooth [14:35:24] I would say to have a look at the queue [14:35:25] Famous last words. [14:35:25] genuinely curious, it's suspiciously quiet :) [14:35:32] while I keep looking at the queries [14:35:39] <_joe_> an editor reported that the citation generator in VE didn't work properly during the switchover [14:35:46] paravoid, specially given the deep maintenance we did to the databases [14:35:57] MediaWiki jobs not dequeued at 14.89% on icinga, I think is expeceted [14:35:57] more schema changes probably than in the last year :-) [14:36:07] indeed [14:36:12] <_joe_> volans: yes [14:36:13] and so far so good! [14:36:20] heheh 5xx graph has been remarkably unchanged too, good [14:36:46] also no save errors reported [14:37:04] which means the read-only detection for wikitext worked nicely [14:37:42] <_joe_> ok [14:37:50] job queue is procesing faster than enqueing for now [14:37:50] <_joe_> this starts to sound a bit fishy [14:38:08] still waiting for parsoid, we didn't increase the rate since last time didn't we? [14:38:25] rate of restarts, I mean [14:38:29] paravoid: nope, having no impact on anything else [14:38:33] jynus: watchlist also works fine in codfw, i forgot to say, no replication issues [14:38:36] I know, just annoying [14:39:10] RECOVERY - Check systemd state on mw2158 is OK: OK - running: The system is fully operational [14:39:20] RECOVERY - Check systemd state on mw2162 is OK: OK - running: The system is fully operational [14:39:20] RECOVERY - Check systemd state on mw2157 is OK: OK - running: The system is fully operational [14:39:20] RECOVERY - Check systemd state on mw2153 is OK: OK - running: The system is fully operational [14:39:20] RECOVERY - Check systemd state on mw2156 is OK: OK - running: The system is fully operational [14:39:41] RECOVERY - Check systemd state on mw2160 is OK: OK - running: The system is fully operational [14:39:50] RECOVERY - Check systemd state on mw2243 is OK: OK - running: The system is fully operational [14:39:50] RECOVERY - Check systemd state on mw2247 is OK: OK - running: The system is fully operational [14:39:50] RECOVERY - Check systemd state on mw2155 is OK: OK - running: The system is fully operational [14:39:50] RECOVERY - Check systemd state on mw2159 is OK: OK - running: The system is fully operational [14:39:50] RECOVERY - Check systemd state on mw2161 is OK: OK - running: The system is fully operational [14:39:51] RECOVERY - Check systemd state on mw2248 is OK: OK - running: The system is fully operational [14:39:51] RECOVERY - Check systemd state on mw2250 is OK: OK - running: The system is fully operational [14:40:00] RECOVERY - Check systemd state on mw2249 is OK: OK - running: The system is fully operational [14:40:00] RECOVERY - Check systemd state on mw2154 is OK: OK - running: The system is fully operational [14:40:35] dcausse: gehel: everything OK on your side? [14:40:50] paravoid: looking good... [14:41:00] I see a couple of relforge alerts [14:41:06] unrelated, but still [14:41:23] gehel: there are 2 alarms for relforge [14:41:27] <_joe_> should we switch search back too? [14:41:27] ops :D [14:41:37] didn't read the line above [14:41:44] yeah, relforge is having some issues, but unrelated to the DC switch, I'm on it [14:41:50] PROBLEM - MariaDB Slave Lag: s2 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 764520.90 seconds [14:41:50] PROBLEM - MariaDB Slave Lag: s7 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 764386.95 seconds [14:41:51] PROBLEM - MariaDB Slave Lag: s4 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 764470.02 seconds [14:42:00] PROBLEM - MariaDB Slave Lag: s5 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 764441.08 seconds [14:42:10] PROBLEM - MariaDB Slave Lag: s1 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 764585.90 seconds [14:42:20] PROBLEM - MariaDB Slave Lag: s3 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 782235.10 seconds [14:42:21] PROBLEM - MariaDB Slave Lag: s6 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 764445.12 seconds [14:42:34] _joe_: search is already back, it follows mediawiki... [14:42:48] <_joe_> gehel: oh ok I wasn't sure [14:42:51] <_joe_> good [14:43:01] yeah, everything is back afaik [14:43:42] <_joe_> now we're just left with deciding when to switch back services [14:43:45] <_joe_> and traffic [14:43:56] !log END (PASS) - Rolling restart of parsoid in codfw and eqiad - t09_restart_parsoid (switchdc/oblivian@neodymium) [14:44:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:03] paravoid: finished :D [14:44:05] wasn't traffic switched yesterday? [14:44:10] yes [14:44:23] <_joe_> I meant traffic to misc/etc [14:44:28] <_joe_> is that already switched too? [14:44:41] well, depends on what you mean by "traffic" I guess! :) [14:44:51] <_joe_> bblack: varnish calls restbase in codfw or in eqiad? [14:44:59] codfw [14:45:02] <_joe_> ok [14:45:10] <_joe_> we need to switch that back [14:45:18] <_joe_> I think the original plan is tomorrow [14:45:26] the actual "traffic" part (inter-cache routing and end-user routing) switched back over the past two days, for all clusters [14:45:38] Trying to revdel content gives me: A database query error has occurred. This may indicate a bug in the software. [WQnrpgpAADsAAj14qxoAAABF] 2017-05-03 14:39:35: Fatal exception of type "Wikimedia\Rdbms\DBQueryError" --NeilN talk to me 14:42, 3 May 2017 (UTC) [14:45:40] we don't need to switch anything back. it's still on codfw... [14:46:01] anyone know what this is about? [14:46:19] paravoid: Looking up that error now [14:46:19] <_joe_> I have no idea [14:46:25] _joe_: maybe I misunderstood your wording :) [14:46:36] everything is currently according-to-plan, including all services things [14:46:56] Error: 1176 Key 'ls_field_val' doesn't exist in table 'log_search' (10.64.16.77) [14:47:00] Ah... what? [14:47:02] jynus: ---^^ [14:47:02] <_joe_> bblack: we want varnish to contact restbase in eqiad, normally, and changeprop to go to codfw [14:47:02] Yeah what Roan said ^ [14:47:08] RoanKattouw: yeah, RainbowSprinkles just reported the same [14:47:16] services switch back (including the part that deploys changes to the back edge of the traffic infra) tomorrow [14:47:22] it hasn't happened yet [14:47:24] log_search? [14:47:28] <_joe_> bblack: ok, I said we need to schedule it [14:47:34] <_joe_> but let's talk about this later [14:47:44] RoanKattouw, which wiki? [14:47:45] INNER JOIN `log_search` FORCE INDEX (ls_field_val) ON ((ls_log_id=log_id)) [14:47:47] There's a FORCE INDEX on it [14:47:49] Yeah [14:47:51] enwiki [14:48:02] ORCE INDEX (PRIMARY) [14:48:05] Reported against enwiki [14:48:11] But searching for ls_field_val shows it's hitting others [14:48:22] itwiki, jawikibooks [14:48:50] Only seeing those 3 so far [14:49:26] I see [14:49:36] let me check the list of alters I did [14:49:59] OK it looks to me like volans is probably right that that may have to be FORCE INDEX(PRIMARY), judging from the index name and PRIMARY KEY (`ls_field`,`ls_value`,`ls_log_id`) [14:50:04] yes, we dropped that index [14:50:13] 06Operations, 10ops-eqiad, 10Cassandra, 13Patch-For-Review, 06Services (doing): Failed disk / degraded RAID arrays: restbase1018.eqiad.wmnet - https://phabricator.wikimedia.org/T163292#3231629 (10Eevans) [14:50:22] I'll patch it [14:50:24] I can add it back, but please report it at T17441 [14:50:24] T17441: Some tables lack unique or primary keys, may allow confusing duplicate data - https://phabricator.wikimedia.org/T17441 [14:50:32] (afk for ~10m) [14:50:41] Patch is probably easier than re-adding the key everywhere [14:50:45] <_joe_> I'd prefer to fix the query if possible [14:50:45] no [14:50:51] RainbowSprinkles, it is not everywhere [14:51:01] we didn't finish the alter everywhere [14:51:03] (03PS1) 10Gehel: elasticsearch - fix awareness attribute [puppet] - 10https://gerrit.wikimedia.org/r/351650 [14:51:09] so it is only on the masters on some shards [14:51:22] it is safer to add it back [14:51:29] and then patch it [14:51:30] 06Operations, 10ChangeProp, 10ORES, 10Scoring-platform-team-Backlog, 10Traffic: Split ORES scores in datacenters based on wiki - https://phabricator.wikimedia.org/T164376#3231630 (10Ladsgroup) [14:51:52] unless you can drop the FORCE completely [14:52:11] Are you saying we don't reliably have that primary key on all DBs? [14:52:37] it was an unique key we converted into a PK [14:52:42] so doing a force doesn't work anymore [14:52:43] we have the key [14:52:45] 06Operations, 10ops-eqiad, 10Cassandra, 13Patch-For-Review, 06Services (doing): Failed disk / degraded RAID arrays: restbase1018.eqiad.wmnet - https://phabricator.wikimedia.org/T163292#3231643 (10Eevans) The host is now ready to have the instances re-bootstrapped, but let's postpone doing so until after... [14:53:02] but it is called primary on some and ls_field_val in otherws [14:53:13] alternatively, we can add ignore index [14:53:18] for the transition period [14:53:30] the explain without the force on the join doesn't seems too bad at first sight [14:53:34] so what I am proposing is to duplicate the index [14:53:45] unless you can patch the IGNORE or no hint [14:54:04] !log restart of elasticsearch on relforge [14:54:06] 06Operations, 10ChangeProp, 10ORES, 10Scoring-platform-team-Backlog, 10Traffic: Split ORES scores in datacenters based on wiki - https://phabricator.wikimedia.org/T164376#3231644 (10Joe) Yes, my only doubt with this proposal is exactly that we want to be active/active but to being able to serve all the t... [14:54:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:50] I can patch it to whatever we want, I'm just not sure why that FORCE is there [14:54:54] There's only one other key though [14:55:01] so I guess it's not too complicated [14:55:05] if there is only other key [14:55:10] IGNORE is the safe bet [14:55:11] (03PS1) 10Milimetric: Remove redundant Dashiki config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351651 [14:55:19] until we can continue "fixing production" [14:55:39] (03CR) 10Milimetric: [V: 032 C: 032] Remove redundant Dashiki config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351651 (owner: 10Milimetric) [14:55:54] (03CR) 10jenkins-bot: Remove redundant Dashiki config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351651 (owner: 10Milimetric) [14:55:58] ugh [14:56:01] milimetric: can you hold? [14:56:06] argh sorry [14:56:20] I looked everywhere for freezes [14:56:26] should I revert? [14:56:29] <_joe_> yeah revert please [14:56:30] actually [14:56:35] yes [14:56:41] <_joe_> the whole week is a deployment freeze : [14:56:51] (03PS1) 10Milimetric: Revert "Remove redundant Dashiki config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351652 [14:56:58] IGNORE INDEX => ls_log_id is the right way for a quick fix [14:57:00] milimetric: https://wikitech.wikimedia.org/wiki/Deployments#Week_of_May_1st [14:57:03] (03CR) 10Milimetric: [C: 032] "reverting one more time" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351652 (owner: 10Milimetric) [14:57:03] OK, doing that now [14:57:15] we can later refine/think if it is needed, etc [14:58:05] (03Merged) 10jenkins-bot: Revert "Remove redundant Dashiki config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351652 (owner: 10Milimetric) [14:58:06] I'm sorry I thought that was over, read that page and missed it [14:58:13] (03CR) 10jenkins-bot: Revert "Remove redundant Dashiki config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351652 (owner: 10Milimetric) [14:58:14] step number 1, fix what it is broken, step number 2, analysis [14:58:27] and this is totally my fault [14:58:43] milimetric: it's ok, no worries [14:59:36] the query rate is very low, so 30-40 errors on the last hour only [14:59:53] top tip: when ignoring an index, don't ignore the same index you were previously forcing [14:59:55] Good morning, Roan [15:00:07] xddd [15:00:13] he he [15:00:18] :-) [15:00:37] it is the only index change that it is failing- the others where tested extensively [15:00:49] yeah, we missed that one :( [15:00:51] all, were, this apparently wasn't noriced [15:01:24] RECOVERY - ElasticSearch health check for shards on relforge1002 is OK: OK - elasticsearch status relforge-eqiad: status: green, number_of_nodes: 2, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 136, task_max_waiting_in_queue_millis: 0, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: 100.0, active_shards: 141, initializing [15:01:32] Please check my work: https://gerrit.wikimedia.org/r/351653 [15:01:34] RECOVERY - ElasticSearch health check for shards on relforge1001 is OK: OK - elasticsearch status relforge-eqiad: status: green, number_of_nodes: 2, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 136, task_max_waiting_in_queue_millis: 0, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: 100.0, active_shards: 141, initializing [15:01:51] Re testing: this query is only used for suppression which is a highly privileged feature [15:02:06] I'm going to have to find someone to test this for me because I don't have suppression rights [15:02:14] PROBLEM - Check Varnish expiry mailbox lag on cp2014 is CRITICAL: CRITICAL: expiry mailbox lag is 694161 [15:03:00] (03PS2) 10Gehel: elasticsearch - fix awareness attribute [puppet] - 10https://gerrit.wikimedia.org/r/351650 [15:03:01] if ( count( $index ) ) -> funny [15:03:28] Hmm, no wait, it's "just" revdel [15:03:30] RoanKattouw: We can just ask on the en.wiki channel, unless colleagues volunteer to help [15:03:50] Elitre: I'll need someone with the WikimediaDebug browser extension [15:03:56] I'll just try to repro on testwiki, I can give myself revdel there [15:04:02] yes, it's just selective deletion, not suppressing entire page AFAICT [15:04:37] alsom, it will work or not depending on the wiki and the db [15:04:42] In fact I already have it [15:04:48] Right :/ [15:04:53] summary is s3 and s6 are fully done [15:04:58] so it should fail now [15:05:00] everywhere [15:05:11] the others are only done on the masters [15:05:14] OK well I can repro pretty reliably [15:05:23] On testwiki which is s3 [15:05:26] yes [15:05:38] (eqiad) codfw is fully not done [15:06:10] I'm going to apply https://gerrit.wikimedia.org/r/#/c/351653/ on tin and sync it to mwdebug1002 [15:06:17] yes, thank you [15:06:29] RoanKattouw: on naos, active deployment server [15:06:36] ^ [15:06:37] Good point [15:06:51] ...isn't naos in codfw? [15:06:59] yes [15:07:03] marostegui: last thing left is the merge of the DNS change for DB master aliases, I'll leave it to you when you see fit [15:07:08] RoanKattouw: it is yeah, switching back to tin later today or tomorrow [15:07:11] OK cool [15:07:24] I had expected that to be part of the script but I suppose it's safer for it not to be [15:07:30] volans: yes, I think I will do it in a bit [15:07:32] I'll line up patches now for that [15:08:05] otherwise imagine we break mediawiki and the way to fix it at the same time :-) [15:08:44] RoanKattouw: hey, at least you didn't wake up for nothing :) [15:08:48] haha yup [15:08:48] RoanKattouw: seriously though, appreciate the help [15:08:56] (03PS3) 10Marostegui: wmnet: Switch dns master alias to eqiad [dns] - 10https://gerrit.wikimedia.org/r/350824 (https://phabricator.wikimedia.org/T155099) [15:08:56] Not what I thought I was here for, but happy to help all the same [15:09:30] (03PS1) 10Filippo Giunchedi: Revert "Switch deployment CNAMEs to naos.codfw.wmnet" [dns] - 10https://gerrit.wikimedia.org/r/351654 [15:09:42] godog: wait for that [15:09:50] godog: deployments still in progress [15:09:52] !log Live-hacked (cherry-picked) https://gerrit.wikimedia.org/r/#/c/351653/ onto naos and synced to mwdebug1002 for testing [15:09:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:58] paravoid: yeah just staging patches, will hold [15:10:20] I am monitoring "index doesn't exist" errors except ls_field_bal [15:10:22] (03CR) 10Filippo Giunchedi: [C: 04-2] "on hold" [dns] - 10https://gerrit.wikimedia.org/r/351654 (owner: 10Filippo Giunchedi) [15:10:28] (03CR) 10Marostegui: [C: 032] wmnet: Switch dns master alias to eqiad [dns] - 10https://gerrit.wikimedia.org/r/350824 (https://phabricator.wikimedia.org/T155099) (owner: 10Marostegui) [15:10:29] no one found since the switch [15:10:59] OK revdel is working for me on mwdebug1002, syncing the patch to all backends [15:11:02] 06Operations, 10ChangeProp, 10ORES, 10Scoring-platform-team-Backlog, 10Traffic: Split ORES scores in datacenters based on wiki - https://phabricator.wikimedia.org/T164376#3231630 (10BBlack) To re-iterate what @Joe is saying a little differently: the point of cross-dc active/active (which is a goal for al... [15:11:41] I am creating a task specific to ls_field_val for followup [15:12:14] RECOVERY - Check Varnish expiry mailbox lag on cp2014 is OK: OK: expiry mailbox lag is 71 [15:12:31] thanks jynus [15:12:38] (03PS3) 10Alexandros Kosiaris: librenms: Introduce scap3 deployment [puppet] - 10https://gerrit.wikimedia.org/r/350861 (https://phabricator.wikimedia.org/T129136) [15:13:06] !log catrope@naos Synchronized php-1.29.0-wmf.21/includes/logging/LogPager.php: Replace FORCE INDEX(ls_field_val) with IGNORE INDEX(ls_log_id) (https://gerrit.wikimedia.org/r/#/c/351653/ for T17441) (duration: 01m 14s) [15:13:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:13] T17441: Some tables lack unique or primary keys, may allow confusing duplicate data - https://phabricator.wikimedia.org/T17441 [15:14:20] (03PS1) 10Filippo Giunchedi: Revert "Switch deployment server to naos.codfw.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/351655 [15:14:29] marostegui, do you have handy the task for templatelinks/pagelinks ? [15:14:34] (03CR) 10Filippo Giunchedi: [C: 04-2] "on hold" [puppet] - 10https://gerrit.wikimedia.org/r/351655 (owner: 10Filippo Giunchedi) [15:16:28] jynus: give me a sec, i'll find it [15:18:10] RoanKattouw, do you want me to add you to the followup ticket (I understand you may not want to be subscribed there, you already helped a lot) [15:18:41] jynus: Patch deployed at 15:13 (see logmsgbot above), I'm tailing the logs and not seeing any new errors since. We were previously seeing a couple errors per minute but with stops and starts [15:18:42] jynus: we only have: https://phabricator.wikimedia.org/T17441 and https://phabricator.wikimedia.org/T162774 [15:18:53] I've asked the reported on enwiki to confirm that it works for them now [15:19:07] jynus: Feel free; I'm always behind on my phab email anyway :D [15:19:10] yeah, the followup is to align [15:19:25] mediawiki and WMF and to see if we have to delete completely [15:20:57] I will write an incident report [15:21:10] what is the action that that function does? [15:21:40] godog: I am done deploying and going to have breakfast etc, as far as I'm concerned you can switch deployment hosts (but probably ask paravoid too) [15:21:54] RoanKattouw: kk, thanks for your help! [15:21:54] (03PS1) 10BBlack: r::c::base - support 800G disks for new ulsfo systems [puppet] - 10https://gerrit.wikimedia.org/r/351659 (https://phabricator.wikimedia.org/T164327) [15:22:19] RoanKattouw: Did you merge + backport your change yet? [15:22:31] Looks like no [15:22:47] (I'm actually not sure if we want this to land in master, or just wmf.21) [15:23:06] RainbowSprinkles: Probably both until we fix things; don't want the train to break? [15:23:34] There's this lovely feature in make-wmf-branch that will backport fixes for us ;-) [15:23:38] RainbowSprinkles: No, thanks for reminding. I don't wanna self -+2 it though [15:23:45] (that nobody ever uses, we just like merging wmf hacks to master....) [15:23:48] Sorry, my home internet crapped out for a (literal) minute [15:24:17] RainbowSprinkles: Honestly it was broken in MW core too, the schema says it should be PRIMARY [15:24:19] so does anyone know what was the high-level reported failure. Edit suppresion? [15:24:30] RoanKattouw, oh really? [15:24:30] Or... no! I stand corrected [15:24:35] tables.sql still has the old index [15:24:46] jynus: "A database query error has occurred. This may indicate a bug in the software. [WQnrpgpAADsAAj14qxoAAABF] 2017-05-03 14:39:35: Fatal exception of type "Wikimedia\Rdbms\DBQueryError"" [15:24:46] no, probably it is a DBA-only thing [15:24:56] That's what users saw at a high level [15:25:19] but duing something specifically? [15:25:26] Trying to revision delete [15:25:29] ok [15:25:32] that is what I needed [15:25:33] It was renamed *from* PRIMARY *to* ls_field_val by Tim in 2009 [15:25:33] thank you [15:25:38] really? [15:25:40] lol [15:25:48] :_( [15:25:50] jynus: I can get you a full stacktrace too if you'd like. [15:25:53] it is ok [15:26:02] jynus: Specifically, it's the query that searches the logs for revdel events related to what you're about to revdel [15:26:22] I just neded the non-very technical explanation [15:26:26] So that the revdel-er can see if that content had already been (partially) suppressed/unsuppressed and why and by whom and when [15:26:30] for user awareness [15:26:47] so not like super-duper common [15:27:08] No, it's an admin behavior [15:27:11] But practically it meant that when you began trying to revdel something, it would show an error when you should have gotten the form asking you for more details and a log reason [15:27:15] with the patch that should work with the official mediawiki release [15:27:29] We'll want to backport to REL1_29 then too [15:27:34] and with the intended next iteration [15:27:38] which is starting to be live [15:27:51] but I had to be bold [15:28:06] or we would not advance with schema changes [15:28:22] having a primary key allows now to change easily to any other schema [15:28:32] so I had to take the maintenance window to do that [15:28:39] https://github.com/wikimedia/mediawiki/commit/f0d3466268ad5f25221ef9cc3538a47159e8d66f for the record [15:29:29] RainbowSprinkles: I don't think we need this in REL1_29, the changes that jynus et al have made on the cluster are entirely unreflected in MW core AFAICT [15:29:41] yes, that is correct [15:29:48] this will be eventually on a future version [15:29:51] ls_field_val is in tables.sql and the database updaters and the situation around it has been stable since that commit I just linked in 2009 [15:29:57] but not on an old one [15:29:57] * RainbowSprinkles nods [15:30:09] In which case, we'll want to fix in master for the 1.30 cycle :) [15:30:18] Yes [15:30:31] I created already the ticket to track it [15:30:33] The good news is, ls_field_val only appears in *.sql files now, not in *.php files [15:30:41] * RainbowSprinkles wanders off to find coffee [15:30:55] paravoid, we are ok now [15:31:00] ack [15:31:23] the impact wasn't like huge, but very annoying for admins [15:31:44] seen here: https://logstash.wikimedia.org/goto/bc04cf0fd66c61783c8319dffd1b742b [15:31:53] For a small subset of admins at that. Revision deletion is *far* less common than full page deletion. [15:32:01] It may also have broken our ability to remove problematic content from public view for half an hour [15:32:12] 50-60 failed requests within 45 minute4s [15:32:15] To be honest, with all the schema changes we did this week, I am happy this is the only (so far) issue we have faced [15:32:22] 111 actually [15:32:31] Curious how many were retries by the same person :) [15:32:46] yeah , if you saw the large amount of alters done [15:32:54] this is a huge sucess [15:33:46] I will generate an incident report anyway [15:47:42] BTW, paravoid, this incident is unrelated to the swithchover, it was caused by unrelated maintenance [15:49:39] kinda related, be able to do this kind of maintenance is one of the reasons we do the switchover in the first place :-P [15:50:11] hehe [15:50:17] actually no [15:50:27] it is only a secondary reason [15:50:38] main one is reliablity readyness [15:50:46] sure, "one of" :) [15:53:03] congnate is failing a lot [15:55:19] Is Addshore around? [15:55:32] jynus: hello! [15:55:41] I am / was just about to step out [15:56:01] do you have mediawiki log access? [15:56:04] yup [15:56:13] search for cognate :-) [15:56:26] in any particular log? [15:56:27] I see insert failing [15:56:38] for example, channel:DBQuery [15:57:08] (03PS1) 10BBlack: cache_upload: git rid of upload_domain [puppet] - 10https://gerrit.wikimedia.org/r/351663 [15:57:15] 60 failures of ConganteStore:inserPage [15:57:18] in the last hour [15:57:36] Hmm, they all appear to be Lost connections or Read timeouts [15:58:05] they are db1031, x1 master [15:58:34] (03CR) 10Ottomata: [C: 031] check_hadoop_yarn_node_state: add syslog logging for CRITICAL states [puppet] - 10https://gerrit.wikimedia.org/r/347857 (owner: 10Elukey) [15:58:36] self -inflicted [15:58:41] it is not the db [15:58:45] "Waiting for table level lock" [15:58:58] table is locked for writing [15:59:01] hmmmm, okay [15:59:22] Looking at logstash it seems this has only been happening today [15:59:53] maybe from swithover time? [15:59:55] or before [16:00:24] infact looking at logstash it seems to have stopped too [16:00:33] ok [16:00:41] I am personally not worried about that [16:00:44] https://logstash.wikimedia.org/goto/9f73a7a8279708e017c6b82995e02171 [16:00:54] but it was just a heads up [16:00:56] I'll check back on it tomorrow and investigate a bit more. [16:00:59] for you [16:01:00] Thanks for the heads up! [16:01:52] I am checking bad queries or other database errors [16:02:58] I think we are back to previous levels of errors [16:03:06] given the peak time [16:04:38] yeah, everything looks pretty good [16:08:02] !log install2002 - temp stop puppet to debug dhcp issue of db2084 [16:08:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:01] (03PS1) 10Jdlrobson: Wikivoyage should show related pages in footer of skin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351664 (https://phabricator.wikimedia.org/T164391) [16:13:13] I am going to do some fine-tuning of db loads [16:15:16] (03Abandoned) 10Jcrespo: Revert "Re-enable ContentTranslation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351638 (owner: 10Jcrespo) [16:15:44] (03PS2) 10Jdlrobson: Wikivoyage should show related pages in footer of skin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351664 (https://phabricator.wikimedia.org/T164391) [16:15:54] (03Abandoned) 10Marostegui: db-eqiad.php: Pool db1097 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351577 (owner: 10Marostegui) [16:17:05] (03PS3) 10Gehel: elasticsearch - fix awareness attribute [puppet] - 10https://gerrit.wikimedia.org/r/351650 [16:17:50] !log install2002 / db2084 - reverting live hack, re-enabling puppet. db2084 doesnt even talk to DHCP, all other new db servers are fine, just this one out of 22 is not. seems to be actually broken NIC, cable was switched, switch config was checked too [16:17:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:32] (03PS1) 10Jcrespo: mariadb: Fine-tune per-server load to reduce db connection errors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351668 [16:22:12] (03CR) 10Jcrespo: [C: 032] mariadb: Fine-tune per-server load to reduce db connection errors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351668 (owner: 10Jcrespo) [16:24:11] (03CR) 10Dzahn: "arsenic is more decom'ed than the other 2. it only has mgmt DNS left which is the case before the very last decom steps by dc-ops are done" [puppet] - 10https://gerrit.wikimedia.org/r/351587 (owner: 10Muehlenhoff) [16:24:55] (03PS2) 10Dzahn: Clean up netboot config [puppet] - 10https://gerrit.wikimedia.org/r/351587 (owner: 10Muehlenhoff) [16:24:57] (03Merged) 10jenkins-bot: mariadb: Fine-tune per-server load to reduce db connection errors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351668 (owner: 10Jcrespo) [16:26:02] (03CR) 10jenkins-bot: mariadb: Fine-tune per-server load to reduce db connection errors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351668 (owner: 10Jcrespo) [16:28:34] (03CR) 10Dzahn: "@Robh more incomplete decoms detected.. i'll paste your checkbox template on tickets i can find" [puppet] - 10https://gerrit.wikimedia.org/r/351587 (owner: 10Muehlenhoff) [16:30:26] !log jynus@naos Synchronized wmf-config/db-eqiad.php: Fine-tune per-server load to reduce db connection errors (duration: 01m 27s) [16:30:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:05] 06Operations, 10ops-eqiad: reclaim europium - https://phabricator.wikimedia.org/T153918#3232039 (10Dzahn) a:03Dzahn [16:34:08] 06Operations, 10ops-eqiad, 06Labs: decom promethium - https://phabricator.wikimedia.org/T164395#3232060 (10Dzahn) [16:35:31] 06Operations, 10ops-eqiad, 06Labs: decom promethium - https://phabricator.wikimedia.org/T164395#3232084 (10Dzahn) a:05Andrew>03Dzahn [16:35:44] PROBLEM - puppet last run on analytics1059 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:38:30] 06Operations: reclaim arsenic as spare - https://phabricator.wikimedia.org/T83340#3232085 (10Dzahn) [16:38:50] (03PS3) 10Dzahn: Clean up netboot config [puppet] - 10https://gerrit.wikimedia.org/r/351587 (https://phabricator.wikimedia.org/T153918) (owner: 10Muehlenhoff) [16:39:15] (03CR) 10Dzahn: [C: 032] Clean up netboot config [puppet] - 10https://gerrit.wikimedia.org/r/351587 (https://phabricator.wikimedia.org/T153918) (owner: 10Muehlenhoff) [16:40:52] 06Operations, 10ops-codfw, 10DBA, 10hardware-requests: codfw: (1) spare pool system for temp allocation as database failover - https://phabricator.wikimedia.org/T161712#3232098 (10jcrespo) tempdb2001 is not going to be used anymore, but before returning to the pool of spares, we need to retire if from pupp... [16:41:54] 06Operations, 13Patch-For-Review: reclaim arsenic as spare - https://phabricator.wikimedia.org/T83340#3232107 (10Dzahn) 05Resolved>03Open re-opening since nowdays racktables says it's in the "decom" rack. but we still had entry in netboot and there is mgmt DNS left now. we should check for switch config to... [16:43:01] (03PS1) 10Volans: Dnsdisc: try multiple times on check_record [switchdc] - 10https://gerrit.wikimedia.org/r/351670 (https://phabricator.wikimedia.org/T164396) [16:43:36] 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 07Wikimedia-Multiple-active-datacenters: Prepare and improve the datacenter switchover procedure - https://phabricator.wikimedia.org/T154658#3232138 (10jcrespo) [16:44:00] 06Operations, 10ops-codfw, 10DBA, 10hardware-requests: codfw: (1) spare pool system for temp allocation as database failover - https://phabricator.wikimedia.org/T161712#3232139 (10Marostegui) >>! In T161712#3232098, @jcrespo wrote: > tempdb2001 is not going to be used anymore, but before returning to the p... [16:44:06] 06Operations, 10ops-eqiad: decommission db1025 - https://phabricator.wikimedia.org/T164397#3232141 (10Jgreen) [16:45:04] (03PS1) 10Dzahn: final decom of arsenic (mgmt and asset tag) [dns] - 10https://gerrit.wikimedia.org/r/351671 (https://phabricator.wikimedia.org/T83340) [16:45:15] 06Operations, 10ops-eqiad: decommission lutetium - https://phabricator.wikimedia.org/T164398#3232158 (10Jgreen) [16:46:34] (03CR) 10Dzahn: "racktables says it's already in the decom rack, D6" [dns] - 10https://gerrit.wikimedia.org/r/351671 (https://phabricator.wikimedia.org/T83340) (owner: 10Dzahn) [16:47:45] (03PS1) 10Volans: t05_switch_traffic: increase verbosity [switchdc] - 10https://gerrit.wikimedia.org/r/351674 (https://phabricator.wikimedia.org/T164400) [16:48:05] 06Operations, 06Performance-Team, 10Traffic: Evaluate/Deploy TCP BBR when available (kernel 4.9+) - https://phabricator.wikimedia.org/T147569#3232221 (10ema) [16:48:27] https://upload.wikimedia.org/wikipedia/commons/thumb/6/6a/Templeborough_Roman_Fort_visualised_3D_flythrough_-_Rotherham.webm/640px--Templeborough_Roman_Fort_visualised_3D_flythrough_-_Rotherham.webm.jpg [16:48:29] 06Operations, 13Patch-For-Review: decom arsenic: (was: reclaim arsenic as spare) - https://phabricator.wikimedia.org/T83340#3232222 (10Dzahn) [16:48:37] broken thumbnail [16:48:49] should I open a bug report? [16:49:54] Error generating thumbnail [16:49:56] There have been too many recent failed attempts (4 or more) to render this thumbnail. Please try again later. [16:50:01] 06Operations, 13Patch-For-Review: decom arsenic: (was: reclaim arsenic as spare) - https://phabricator.wikimedia.org/T83340#912592 (10Dzahn) [16:50:12] yeah sounds like some kind of rendering bug on that one? [16:50:17] yannf, yes, and add this information: https://phabricator.wikimedia.org/P5367 [16:50:54] ffmpeg isn't happy about making a thumb from that video, hmm [16:51:04] RECOVERY - Router interfaces on pfw-eqiad is OK: OK: host 208.80.154.218, interfaces up: 108, down: 0, dormant: 0, excluded: 3, unused: 0 [16:51:23] could be the same error: https://phabricator.wikimedia.org/T161465 [16:51:34] 06Operations, 10ops-eqiad, 13Patch-For-Review: decom europium (was: reclaim europium) - https://phabricator.wikimedia.org/T153918#3232241 (10Dzahn) [16:53:08] I am about to leave, if someone can help him moving on phab/our tickets if he needs help [16:54:00] (03PS1) 10Dzahn: DHCP: decom europium [puppet] - 10https://gerrit.wikimedia.org/r/351676 (https://phabricator.wikimedia.org/T153918) [16:54:04] PROBLEM - Router interfaces on mr1-codfw is CRITICAL: CRITICAL: host 208.80.153.196, interfaces up: 35, down: 1, dormant: 0, excluded: 0, unused: 0BRge-0/0/2: down - Cust: Airport Express WiFi APBR [16:54:38] why is there a bpo version in the Phab paste? the video scalers are all trusty [16:54:49] "ffmpeg version 3.0.2-2~bpo8+1" [16:54:51] 06Operations, 10ops-eqiad, 13Patch-For-Review: decom europium (was: reclaim europium) - https://phabricator.wikimedia.org/T153918#3232248 (10Dzahn) still here: https://racktables.wikimedia.org/index.php?page=object&tab=default&object_id=1728 HW warranty expiration: 2015-08-29 [16:54:59] that is what mediawiki threw at me [16:55:05] live on the server [16:55:52] ok done https://phabricator.wikimedia.org/T164401 [16:56:00] 06Operations, 10fundraising-tech-ops: Port fundraising stats off Ganglia - https://phabricator.wikimedia.org/T152562#3232264 (10Jgreen) >>! In T152562#3227173, @fgiunchedi wrote: > @Jgreen any news/updates on having FR fully on jessie? We're still waiting for the hardware install to replace the last Precise... [16:56:04] yannf, thank you very much [16:56:22] 06Operations, 10ops-eqiad, 13Patch-For-Review: decom europium (was: reclaim europium) - https://phabricator.wikimedia.org/T153918#3232277 (10Dzahn) [16:56:40] PROBLEM - MariaDB Slave IO: x1 on db1031 is CRITICAL: CRITICAL slave_io_state could not connect [16:56:46] PROBLEM - MariaDB Slave SQL: x1 on db1031 is CRITICAL: CRITICAL slave_sql_state could not connect [16:56:50] nice [16:57:07] lovely [16:57:17] too many connections [16:57:29] yep [16:57:55] can I help? [16:58:56] (03PS1) 10Dzahn: decom europium, keep mgmt [dns] - 10https://gerrit.wikimedia.org/r/351677 (https://phabricator.wikimedia.org/T153918) [16:59:01] doesn't even let me login from the alternative port [16:59:11] should we merge: https://gerrit.wikimedia.org/r/#/c/351638/ ? [16:59:12] uh oh [16:59:23] but we need to know if it was content translation first [16:59:44] 06Operations, 10ops-eqiad, 06Labs, 13Patch-For-Review: decom promethium - https://phabricator.wikimedia.org/T164395#3232282 (10Dzahn) [16:59:48] https://grafana-admin.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1031&from=now-1h&to=now [16:59:48] or cognate, which was having some heavy locking [17:00:01] oh, yes, you mentioned cognate too [17:00:09] what is cognate btw ? [17:00:12] can that be disabled? addshore ? [17:00:14] 06Operations, 10ops-eqiad, 06Labs, 13Patch-For-Review: decom promethium - https://phabricator.wikimedia.org/T164395#3232060 (10Dzahn) still here: https://racktables.wikimedia.org/index.php?page=object&tab=default&object_id=1786 HW warranty expiration: 2016-01-22 [17:00:24] 170503 16:54:55 [Note] Threadpool has been blocked for 30 seconds [17:00:46] I am restarting the server [17:00:50] it is the only option now [17:00:55] jynus: restarting or killing? [17:01:06] killing [17:01:08] good [17:01:19] Threadpool could not create additional thread to handle queries, because the number of allowed threads was reached. [17:01:36] how nice ... [17:01:44] yeah, see: https://grafana-admin.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1031&from=now-1h&to=now [17:01:50] If 'extra_port' parameter is set, you can still connect to the database with superuser account (it must be TCP connection using extra_port as TCP port) and troubleshoot the situation. [17:01:56] I thought we had that [17:01:59] we have it set [17:02:02] akosiaris: we do [17:02:02] 3307 [17:02:11] but jynus said that not even that worked [17:02:14] :-( [17:02:21] and akosiaris max conn and max threads are a bit different, we have a pool of threads [17:02:43] see /etc/my.cnf [17:02:56] yeah yeah not arguing with that [17:03:21] (03PS1) 10Dzahn: decom promethium (!??) - do not merge [puppet] - 10https://gerrit.wikimedia.org/r/351678 (https://phabricator.wikimedia.org/T164395) [17:03:25] it is cognate [17:03:26] kill that [17:03:38] <_joe_> what is cognate exactly? [17:03:42] RECOVERY - MariaDB Slave IO: x1 on db1031 is OK: OK slave_io_state Slave_IO_Running: Yes [17:03:44] mediawiki extension [17:03:48] RECOVERY - MariaDB Slave SQL: x1 on db1031 is OK: OK slave_sql_state Slave_SQL_Running: Yes [17:03:49] nobody knows, how nice [17:03:58] addshore, [17:04:12] someone grep for cognate on settings [17:04:28] <_joe_> interwiki links [17:04:29] <_joe_> ok [17:04:58] RECOVERY - puppet last run on analytics1059 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [17:04:59] <_joe_> volans: no, I don't know how each single extension we have in prod works :/ [17:05:03] https://gerrit.wikimedia.org/r/#/c/346524/ [17:05:06] could that be it? [17:05:34] <_joe_> do we miss some schema change? [17:05:56] it's false by default ? [17:06:03] (03PS1) 10Jcrespo: Disable congnate- it is creating an outage on x1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351679 [17:06:28] (03CR) 10Marostegui: [C: 031] Disable congnate- it is creating an outage on x1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351679 (owner: 10Jcrespo) [17:06:29] just wiktionary ? and it causes all that ? [17:06:39] anyway ok, move forward with it [17:06:54] I have added addshore as a reviewer so he can see it when he is back [17:07:33] things are up now [17:07:40] so I am waiting for CI [17:07:45] <_joe_> do we know what causes those locks [17:07:49] don't wait [17:08:03] I can wait, things are up [17:08:09] (03CR) 10Jcrespo: [C: 032] Disable congnate- it is creating an outage on x1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351679 (owner: 10Jcrespo) [17:08:18] <_joe_> volans: actually I'd prefer to be sure we're not just disabling it and we know which queries are the problem [17:08:37] <_joe_> and can open a UBN! bug about it [17:08:43] I have the screens [17:08:44] _joe_: agree, but jynus said he saw it was this one [17:08:48] I might have misread though [17:08:50] yes, I have proof [17:08:53] (03PS1) 10Volans: t09_start_maintenance: clear systemctl state on dc_from [switchdc] - 10https://gerrit.wikimedia.org/r/351680 (https://phabricator.wikimedia.org/T164403) [17:08:55] <_joe_> ok cool [17:08:57] not doing it at random [17:09:04] <_joe_> then just disable it [17:09:09] I commented bad behavioud before [17:09:14] <_joe_> I am sure you know it's that extension [17:09:23] (03Merged) 10jenkins-bot: Disable congnate- it is creating an outage on x1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351679 (owner: 10Jcrespo) [17:09:31] (03CR) 10jenkins-bot: Disable congnate- it is creating an outage on x1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351679 (owner: 10Jcrespo) [17:09:33] <_joe_> not sure if you were able to pin it to some specific queries [17:09:43] it is not queries [17:09:49] transaction is open [17:09:53] <_joe_> oh ok [17:09:55] but not doing anything [17:10:00] <_joe_> the usual one [17:10:02] <_joe_> :P [17:10:10] <_joe_> I've seen that kind of issues so many times [17:10:20] normally the query killer takes care of that [17:10:29] but apparently x1 is special [17:10:33] <_joe_> but that depends on the amount of contention [17:10:35] <_joe_> I guess [17:10:37] yes [17:10:58] it is starting again [17:11:01] <_joe_> if you can trust the bad actors to be a few and evenly spaced, a query killer is effective [17:11:04] so deploying [17:11:12] <_joe_> jynus: yes please do [17:11:14] Is there a task for the x1 outage already? [17:11:22] (So I can reference it on T164406 ) [17:11:22] T164406: Something weird going on with Flow in nowiki? - https://phabricator.wikimedia.org/T164406 [17:11:39] <_joe_> RoanKattouw: I guess not, we're first trying to firefight [17:11:50] OK, no rush, fight first and phab later [17:13:16] evening, nothing to do with CX I hope? I been watching the slow query log and only been seeing cognate there since the switch [17:13:25] Nikerabbit: no :) [17:13:30] and guess what? [17:13:38] it is cognate this time :-) [17:13:58] RoanKattouw is going to be super-happy [17:14:09] jynus: is it deployed already? [17:14:12] it is on it [17:14:18] ah ok [17:14:23] (not urgent) What even is cognate? [17:14:23] !log jynus@naos Synchronized wmf-config/InitialiseSettings.php: Disable cognate- it is causing an outage on x1 (duration: 01m 06s) [17:14:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:53] look, problem solved [17:15:02] not it is some else's problem :-) [17:15:05] yeah [17:15:16] connections are gone [17:15:40] _joe_, by the way, our monitoring captures show processlist [17:15:44] Hah, TIL about https://www.mediawiki.org/wiki/Extension:Cognate [17:16:09] so there is always that for future debugging [17:16:58] now there should be no more errors [17:17:24] yeah, there are no more cognate connections there [17:17:41] nice! [17:18:07] now, if things are horrible wrong , that is another story, but as an operator, those swiches is what we can do under clock [17:18:20] and why I asked for more [17:18:33] I am not writing this incident [17:18:41] I assume addshore will [17:18:59] when was that extension enabled? [17:19:07] 9 days ago I thin [17:19:41] 1620 hits [17:19:54] max execution time 141 seconds [17:19:59] do the math :-) [17:20:00] yeah on Apr 24 [17:20:18] RECOVERY - Check Varnish expiry mailbox lag on cp2011 is OK: OK: expiry mailbox lag is 6815 [17:20:23] 06Operations, 10MediaWiki-Configuration, 06MediaWiki-Platform-Team, 06Performance-Team, and 10 others: Allow integration of data from etcd into the MediaWiki configuration - https://phabricator.wikimedia.org/T156924#3232383 (10Volans) [17:20:24] 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 13Patch-For-Review, and 6 others: DNS: dynamically generate entries for service discovery - https://phabricator.wikimedia.org/T156100#3232384 (10Volans) [17:20:26] 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 07Wikimedia-Multiple-active-datacenters: Prepare and improve the datacenter switchover procedure - https://phabricator.wikimedia.org/T154658#3232385 (10Volans) [17:20:28] 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 13Patch-For-Review, 07Wikimedia-Multiple-active-datacenters: MediaWiki Datacenter Switchover automation - https://phabricator.wikimedia.org/T160178#3232378 (10Volans) 05Open>03Resolved a:03Volans Resolving this after a successful MediaWiki swit... [17:20:40] but that is not the worse [17:20:48] the query killer should take care of that [17:20:58] it has for some reason table-level locks [17:21:11] every write was blocked on something [17:21:37] some delete without indexes or something [17:22:40] not sure if to convert T164406 [17:22:41] T164406: Something weird going on with Flow in nowiki? - https://phabricator.wikimedia.org/T164406 [17:22:51] or create a new one and merge it there [17:27:48] Just read all of the above, I may be able to look into I this evening, but I may also be without wifi [17:27:48] Also, as cognate is now disabled all interwiki links in the Wiktionary main namespace will not not appear [17:27:48] Odd how this only started appearing today, but will investigate asap [17:28:35] addshore, that is your doing- disabling an extension should not disable other functionality [17:29:33] Well, the extension is what provides the interwiki links [17:29:52] I thought wikibase-client did that [17:30:55] No, cognate provides the interwiki links for Wiktionaries, and wikibase-client for other projects (different usecase) [17:30:55] https://phabricator.wikimedia.org/T164407 [17:31:31] But the equivalent happened, if you turned off wikibase-client client Wikipedia's would have no site links, it's not exactly avoidable if your going to turn something like that off [17:31:50] so there was no interwikis there before cognate? [17:32:24] Yes, but they have been removed from the wikitext in most cases, as happened when wikibase first rolled out [17:32:34] as I said, your doing [17:32:36] :-) [17:33:59] !log uploaded openjdk-8 u131 to apt.wikimedia.org [17:34:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:55] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: Make WDQS active / active - https://phabricator.wikimedia.org/T162111#3232427 (10Smalyshev) 05Open>03Resolved a:03Smalyshev I think everything is fine, we can close this? [17:37:07] (03PS2) 10Dzahn: DHCP: decom europium [puppet] - 10https://gerrit.wikimedia.org/r/351676 (https://phabricator.wikimedia.org/T153918) [17:37:28] ACKNOWLEDGEMENT - Router interfaces on mr1-codfw is CRITICAL: CRITICAL: host 208.80.153.196, interfaces up: 35, down: 1, dormant: 0, excluded: 0, unused: 0BRge-0/0/2: down - Cust: Airport Express WiFi APBR Ayounsi Troubleshooting codfw wifi [17:39:41] it took 2 minutes to go from 16 connections to 5000 [17:39:54] nothing we or the query killer could have done [17:40:12] jynus: do we know what query was being attempted on those connections? [17:40:18] many [17:40:40] but it wasn't so much the query as the locking [17:40:48] see the process list- it is in waiting [17:41:03] last time I saw several querys waiting for full table lock [17:41:05] <_joe_> jynus: were the locked connections coming from the jobrunners by any chance? [17:41:08] that should not happene ever [17:41:13] wikuser [17:41:18] let me see the ips [17:42:00] in any case, queries can be seen on tendril [17:42:05] filtering by db1031 [17:42:28] Awesome, I can do that when I get to see WiFi / not mobilr [17:43:21] 06Operations, 06Performance-Team: webpagetest-alerts: Difference in size authenticated - https://phabricator.wikimedia.org/T164209#3232442 (10Peter) Hmm, we only test one page as logged in right now so there's no way to separate changes for that specific page/article and the framework. We should add test two p... [17:43:34] _joe_, more ips that you may need: https://phabricator.wikimedia.org/P5368 [17:43:52] not sure what job runners range is [17:45:10] they seem to be end-user queries [17:45:29] PROBLEM - HP RAID on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [17:45:30] I see the user names [17:45:37] <_joe_> first ip I tried, it's a jobrunner [17:45:46] oh, that could be it [17:45:58] but the queries (not connections) I am seeing [17:46:06] are all user-generated [17:46:16] <_joe_> yup, all cronjobs [17:46:26] <_joe_> which could be generated by user actions [17:46:29] RoanKattouw: yeah, isn't it a wonderfully descriptive name for an extension? :) [17:46:37] let me show you [17:46:38] So, cognate has one job, which simply triggers HTMLCacheUpdateJob [17:46:39] It actually is! [17:46:45] Now that I know what it does [17:47:01] <_joe_> for which wiki addshore ? [17:47:06] <_joe_> any? [17:47:07] (03CR) 10Dzahn: [C: 032] DHCP: decom europium [puppet] - 10https://gerrit.wikimedia.org/r/351676 (https://phabricator.wikimedia.org/T153918) (owner: 10Dzahn) [17:47:10] <_joe_> I can inspect the queue [17:47:15] All wiktionaries [17:47:21] <_joe_> ok [17:47:24] But only wiktionaries [17:48:08] https://phabricator.wikimedia.org/T164407#3232448 [17:48:20] not that the contention [17:48:33] The job coda can be seen at https://github.com/wikimedia/mediawiki-extensions-Cognate/blob/master/src/CacheUpdateJob.php [17:48:34] https://grafana.wikimedia.org/dashboard/db/mysql?panelId=37&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1031&from=1493828837614&to=1493832461569 [17:48:40] is "waiting" [17:48:49] so it was not a query overload [17:48:58] queries were waiting a lock [17:49:18] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [17:51:30] 06Operations, 10ops-eqiad: ripe-atlas-eqiad is down - https://phabricator.wikimedia.org/T163243#3232449 (10faidon) a:05Cmjohnson>03RobH So I tried the serial console and also asked @Cmjohnson to reboot it in case the BIOS appears. We got nothing back, so I guess it's fried :( According to the [[ https://a... [17:52:18] "Waiting for table level lock" [17:52:48] that is a huge, huge mess if table-level blocks are involved in any way [17:53:22] I think the job is a red herring [17:54:08] <_joe_> yeah I agree [17:54:18] <_joe_> that code doesn't really justify those connections [17:55:07] Yeah, I am still thoroughly confused, but also only looking on a phone screen... [17:55:38] _joe_: how did he job queue look? [17:56:41] Without looking any further a set of mass deletions or creations or moves of pages in ns0 of Wiktionaries would trigger many jobs for the cache update [17:56:54] But I'll stop guessing now until I look later [17:58:02] No schema differences between db1031 and db2033 (not that I'd expect any, given how new Cognate is) [17:58:19] (03PS2) 10Dzahn: decom europium, keep mgmt [dns] - 10https://gerrit.wikimedia.org/r/351677 (https://phabricator.wikimedia.org/T153918) [17:59:25] (03CR) 10Dzahn: [C: 032] "has been shutdown since quite a while, 100% packet loss, console empty" [dns] - 10https://gerrit.wikimedia.org/r/351677 (https://phabricator.wikimedia.org/T153918) (owner: 10Dzahn) [17:59:48] I have dumped all information I had [17:59:57] jynus: thanks! [18:00:22] I really need to pause, as there is no ongoing issues- and us ops are dumb and not much help for the real rockstarts that you are [18:01:06] <_joe_> addshore: does that job in any way touch the cognate db? [18:01:19] <_joe_> the cache update job does I guess [18:01:41] <_joe_> anyways, I didn't find any large job queue item [18:02:14] clarly it is lock contention [18:02:23] how it happened is the mystery [18:02:32] <_joe_> a large number of jobs were enqueued around 17:30 UTC though [18:02:40] I have to move on but I'll put my (lack of) findings on the task [18:02:52] <_joe_> which seems to be the time the trouble happened? [18:02:53] So https://github.com/wikimedia/mediawiki-extensions-Cognate/blob/master/src/CognateHooks.php#L64 [18:03:06] <_joe_> I'll do that tomorrow, I'm too tired and I need to go afk [18:03:07] That will lead to a select [18:03:14] _joe_, but the jobs [18:03:28] can be overloaded and fail [18:03:44] they didn't they got stuck (if it is the jobs and not user connections) [18:04:09] Thanks RoanKattouw [18:04:32] <_joe_> yeah and https://grafana.wikimedia.org/dashboard/db/job-queue-health?refresh=1m&orgId=1&from=1493817043849&to=1493834528536&var-jobType=CognateCacheUpdateJob has nothing to show for that [18:04:40] ah [18:04:55] do you know who was doimg 137 second queries? [18:05:07] Mr googlebot [18:05:10] <_joe_> oh [18:05:15] <_joe_> nice! [18:05:26] so that is my bet [18:05:28] <_joe_> our old friend [18:05:37] <_joe_> that seems plausible [18:05:47] ho ho ho [18:05:55] now the question is why a get was creating such a contetion [18:06:08] RECOVERY - Check Varnish expiry mailbox lag on cp2005 is OK: OK: expiry mailbox lag is 0 [18:06:25] It seems selectLinkDetailsForPage is not that fast [18:06:32] with a bit of load [18:07:13] and the crawling times fit the outage [18:08:16] 06Operations, 10ops-eqiad, 13Patch-For-Review: decom europium (was: reclaim europium) - https://phabricator.wikimedia.org/T153918#3232532 (10Dzahn) [18:09:18] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/2/1: down - Core: cr1-eqord:xe-0/0/0 (Telia, IC-314534, 24ms) {#10694} [10Gbps wave]BR [18:09:24] 06Operations, 10ops-eqiad, 13Patch-For-Review: decom europium (was: reclaim europium) - https://phabricator.wikimedia.org/T153918#2894991 (10Dzahn) revoked puppet cert, removed prod IP, was already shut down since a while, checked all the boxes i could do. handing over for switch config and datacenter-ops s... [18:09:38] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 35, down: 2, dormant: 0, excluded: 0, unused: 0BRxe-0/0/0: down - Core: cr2-codfw:xe-5/2/1 (Telia, IC-314534, 29ms) {#11375} [10Gbps wave]BRxe-0/0/1: down - Core: cr1-ulsfo:xe-1/2/0 (Telia, IC-313592, 51ms) {#11372} [10Gbps wave]BR [18:09:48] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/2/0: down - Core: cr1-eqord:xe-0/0/1 (Telia, IC-313592, 51ms) {#1502} [10Gbps wave]BR [18:10:45] 06Operations, 10ops-eqiad, 13Patch-For-Review: decom europium (was: reclaim europium) - https://phabricator.wikimedia.org/T153918#3232553 (10Dzahn) a:05Dzahn>03Cmjohnson [18:10:53] 06Operations, 10ops-eqiad, 13Patch-For-Review: decom europium (was: reclaim europium) - https://phabricator.wikimedia.org/T153918#2894991 (10Dzahn) a:05Cmjohnson>03None [18:11:09] RoanKattouw, please do not expose IPs in public- logs were put under NDA for a reason [18:11:12] 06Operations, 10ops-eqiad: decom europium (was: reclaim europium) - https://phabricator.wikimedia.org/T153918#2894991 (10Dzahn) [18:11:18] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:11:34] jynus: Whooops [18:11:39] jynus: I copied that query from you though [18:11:50] Oh but that pastie is WMF-NDA [18:11:51] Ugh [18:11:51] yeah, that is why I put it under NDA [18:11:56] Damn you Phabricator and your transparent embedding [18:12:00] 06Operations, 10ops-eqiad: decom europium (was: reclaim europium) - https://phabricator.wikimedia.org/T153918#2894991 (10Dzahn) @RobH see above, wanna check for switch port? thanks [18:12:07] anyone know what's going on with those router interface alerts? [18:12:11] please delete the comment [18:12:18] and paste it again without it [18:12:56] in general, be very very careful about pasting queries [18:13:05] I just did that [18:13:09] RoanKattouw: embedding doesnt break the permissions though, still just shows it to users with the rights [18:13:10] Thanks for pointing that out [18:13:10] I do not do that without puting under NDA just in case [18:13:27] mutante: I know, I just didn't realize the pastie was under NDA [18:13:40] Because it was embedded in a public task, and of course it just displays for me because I'm in the NDA group [18:13:47] yep [18:13:51] even sometimes, some values are sensitive [18:14:00] not only the ips [18:14:08] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 22 probes of 285 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [18:14:10] Yeah, I know [18:14:18] I censored the values in the revdel query earlier this morning [18:14:31] I know you know, I need to lecture you :-) [18:14:55] so something is creating blockage [18:15:02] and maybe it only shows [18:15:09] when there is a lot of SELECTs [18:16:29] it would also help if people didn't use the master for reads [18:16:44] db1029 is x1-replica [18:16:54] and it is idle at the time [18:17:07] so there is no wait to failover [18:17:10] *way [18:17:16] (automatically) [18:17:30] so hm [18:17:46] robh: maybe you know, I see the router alerts above are all telia [18:17:59] do you know how to open a report with them, or is that appropriate? [18:19:08] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 14 probes of 285 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [18:19:31] I am going for real this time [18:20:59] I'm having trouble connecting to gerrit, I think it's related to Telia (last time I had it, it was telia too) [18:22:00] * apergos tries cmjohnson1 [18:23:26] apergos: paravoid would probably be the best POC for any issues with telia [18:23:33] ok [18:23:42] but he may not be around (I hope not) [18:24:00] maybe XioNoX [18:24:20] Amir1 gerrit loads for me [18:24:33] It loads but it's flip flopping [18:24:41] what tz is XioNoX? I thought ours (eu something) [18:24:59] apergos: FR I believe [18:25:03] I have problems with ssh (git clone, and stuff) [18:25:04] Amir1 did you get an error when it flip flopped? [18:25:05] UTC+2 is that? [18:25:06] yeah, 8pm then [18:25:10] ah [18:25:13] well 8:30 pm [18:25:14] i can try ssh [18:26:06] ssh works pretty good for me. [18:26:33] apergos, I'm around, what's up? [18:26:38] ah [18:26:52] so in the scrollback there's three whines from icinga: [18:26:55] An example: [18:27:00] https://www.irccloud.com/pastebin/58aakRqC/ [18:27:01] (09:09:18 μμ) icinga-wm: PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/2/1: down - Core: cr1-eqord:xe-0/0/0 (Telia, IC-314534, 24ms) {#10694} [10Gbps wave]BR [18:27:07] (09:09:38 μμ) icinga-wm: PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 35, down: 2, dormant: 0, excluded: 0, unused: 0BRxe-0/0/0: down - Core: cr2-codfw:xe-5/2/1 (Telia, IC-314534, 29ms) {#11375} [10Gbps wave]BRxe-0/0/1: down - Core: cr1-ulsfo:xe-1/2/0 (Telia, IC-313592, 51ms) {#11372} [10Gbps wave]BR [18:27:14] (09:09:48 μμ) icinga-wm: PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/2/0: down - Core: cr1-eqord:xe-0/0/1 (Telia, IC-313592, 51ms) {#1502} [10Gbps wave]BR [18:27:34] robh: ping? [18:27:49] all telia, I don't see maintenance in the calendar, do we know for sure it's them, should we open a ticket, if so can you because I have no creds for any of that? [18:27:58] PROBLEM - cassandra-c SSL 10.64.0.34:7001 on restbase1016 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [18:28:01] Amir1 i will try cognate [18:28:02] well rob's at ulsfo, and they're all ulsfo related [18:28:18] PROBLEM - cassandra-c CQL 10.64.0.34:9042 on restbase1016 is CRITICAL: connect to address 10.64.0.34 and port 9042: Connection refused [18:28:18] oh wait, duh, they're not all ulsfo-related [18:28:23] no they are not [18:28:28] 1 out of three though :-D [18:28:29] PROBLEM - Check systemd state on restbase1016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:28:34] paladox: in the second try it worked [18:28:41] ok [18:28:48] I'm positive it's an issue in traffic [18:28:49] PROBLEM - cassandra-c service on restbase1016 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [18:28:55] jynus addshore: The only place in the Cognate code base where DB_REPLICA appears is in a maintenance script [18:28:58] PROBLEM - cassandra-b CQL 10.64.32.206:9042 on restbase1013 is CRITICAL: connect to address 10.64.32.206 and port 9042: Connection refused [18:29:05] RoanKattouw: no, there are two places [18:29:13] I made a patch but it's not going up [18:29:18] PROBLEM - cassandra-b SSL 10.64.32.206:7001 on restbase1013 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [18:29:19] Amir1 my network seems to be working with gerrit. Maybe it's a network issue on your end? [18:29:26] it just did [18:29:26] https://gerrit.wikimedia.org/r/351682 [18:29:36] XioNoX: what do you think? [18:29:57] Amir1 also there's the inline edit for when ever pushing over ssh dosen't work. [18:30:18] PROBLEM - cassandra-b service on restbase1013 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [18:30:28] PROBLEM - Check systemd state on restbase1013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:30:40] RoanKattouw: https://gerrit.wikimedia.org/r/#/c/351682/1/src/CognateStore.php [18:30:42] apergos: possibly related to: https://www.theregister.co.uk/2017/05/02/telia_hiccups_cloudflare_falls_over/ [18:30:43] Oooh [18:30:45] getWriteConnection [18:31:24] the problems declared fully resolved by 0853 though [18:31:27] (pacific time) [18:31:37] yeah supposedly :) [18:31:53] just saying, they may still be making changes related to whatever went wrong yesterday, and continuing to screw up [18:32:13] "We have a pretty good Ikea" :)))) [18:32:18] (03PS1) 10Cmjohnson: Removing dns entries for decom servers wmf3096 and db1057 [dns] - 10https://gerrit.wikimedia.org/r/351683 [18:33:08] (03CR) 10Cmjohnson: [C: 032] Removing dns entries for decom servers wmf3096 and db1057 [dns] - 10https://gerrit.wikimedia.org/r/351683 (owner: 10Cmjohnson) [18:33:24] heh one of the reddit comments on yesterday's Telia blip: "The internet has no defense for BGP typos from guys who haven't had enough coffee. We're one big BGP screwup from the digital apocalypse. Enjoy your weds morning!" [18:34:14] hahaha nice [18:34:33] and that's why the fundraising banners talk about "a cup of coffee" [18:34:45] !log restarting restbase1013-b [18:34:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:58] 06Operations, 06Labs, 10hardware-requests: Codfw: (1) hardware access request for labtestneutron refresh - https://phabricator.wikimedia.org/T154706#3232606 (10chasemp) [18:35:18] RECOVERY - cassandra-b service on restbase1013 is OK: OK - cassandra-b is active [18:35:23] !log restarting restbase1016-c [18:35:25] 06Operations, 06Labs, 10hardware-requests: Codfw: (1) hardware access request for labtestneutron refresh - https://phabricator.wikimedia.org/T154706#2921133 (10chasemp) @robh note respec @ 32 GB RAM [18:35:28] RECOVERY - Check systemd state on restbase1013 is OK: OK - running: The system is fully operational [18:35:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:38] RECOVERY - Check systemd state on restbase1016 is OK: OK - running: The system is fully operational [18:35:38] (03PS2) 10BBlack: cache_upload: add maps functionality [puppet] - 10https://gerrit.wikimedia.org/r/351663 [18:35:48] RECOVERY - cassandra-c service on restbase1016 is OK: OK - cassandra-c is active [18:35:51] (03PS1) 10BBlack: icinga checks: varnish mailbox tweaks [puppet] - 10https://gerrit.wikimedia.org/r/351684 [18:35:58] RECOVERY - cassandra-b CQL 10.64.32.206:9042 on restbase1013 is OK: TCP OK - 0.036 second response time on 10.64.32.206 port 9042 [18:36:09] apergos: so the links between ulsfo and eqord as well as eqord and codfw are down (both telia), everything seems to have routed around the issue [18:36:18] RECOVERY - cassandra-b SSL 10.64.32.206:7001 on restbase1013 is OK: SSL OK - Certificate restbase1013-b valid until 2017-09-12 15:34:20 +0000 (expires in 131 days) [18:36:23] routed around is good for us [18:36:57] should it be reported anyways? [18:36:59] RECOVERY - cassandra-c SSL 10.64.0.34:7001 on restbase1016 is OK: SSL OK - Certificate restbase1016-c valid until 2017-12-13 00:15:51 +0000 (expires in 223 days) [18:37:14] apergos: yeah, I'll open a ticket with telia [18:37:18] RECOVERY - cassandra-c CQL 10.64.0.34:9042 on restbase1016 is OK: TCP OK - 0.036 second response time on 10.64.0.34 port 9042 [18:37:20] Amir1: thatmehod is only called on main namespace page creations, deletions and moves (that are not redirects) buy could be part of the issue [18:37:27] 06Operations, 10hardware-requests: eqiad: (2) hardware access request for californium and silver (labweb1001/1002) - https://phabricator.wikimedia.org/T161752#3232612 (10chasemp) @robh note respec @ 64GB RAM [18:37:28] ok. thank you (especiallyin your evening) [18:37:39] they sent an email 10 min ago, fwiw [18:37:48] they did? [18:37:53] ah? [18:37:54] https://rt.wikimedia.org/Ticket/Display.html?id=11602 [18:37:58] I gotta reload my email every 5 mins I guess [18:37:58] Amir1: likewise having it read from a replica could have issues, but can't think / look at that while on a phone now [18:38:04] Our NOC has detected that circuits in our backbone link between Chicago and St Louis, USA are down. Investigations are ongoing. We will be providing you with more details shortly. [18:38:14] what's rt.wikimedia.org ? [18:38:25] ah yeah [18:38:30] legacy bug tracker [18:38:32] XioNoX: the old ticket system pre-phabricator [18:38:34] another one [18:38:44] XioNoX: that still receives mails to maint-announce@ [18:39:00] ah, well, now I know :) [18:39:10] 06Operations, 10ops-eqiad, 10DBA, 13Patch-For-Review: Decommission db1057 - https://phabricator.wikimedia.org/T162135#3232627 (10Cmjohnson) 05Open>03Resolved Removed from rack, removed switch cfg, removed from dns. disk were removed and wiped on a different server [18:39:11] those are the circuits all right [18:39:16] and then what we do is add those mails to the google calendar [18:39:20] so all I really had to do was dick around a little more :-P [18:39:28] PROBLEM - cassandra-b SSL 10.64.32.206:7001 on restbase1013 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [18:39:38] but for sometihng like this we don't add it, we add planned outages obviously [18:39:59] !log T160759: reducing tombstone threshold to 1000, restbase1013 [18:40:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:06] T160759: Cassandra OOMs - https://phabricator.wikimedia.org/T160759 [18:40:20] ok, well, thank you for looking at ti and verifying the outage and the routing around, XioNoX [18:40:32] thanks for pinging me! [18:40:37] :-) [18:40:48] also there is planned work by Telia, on May 5th, 13th, 14th... but not the one today [18:40:58] PROBLEM - cassandra-b CQL 10.64.32.206:9042 on restbase1013 is CRITICAL: connect to address 10.64.32.206 and port 9042: Connection refused [18:41:20] no, not today [18:41:23] I checked that first thing [18:41:28] RECOVERY - cassandra-b SSL 10.64.32.206:7001 on restbase1013 is OK: SSL OK - Certificate restbase1013-b valid until 2017-09-12 15:34:20 +0000 (expires in 131 days) [18:41:37] so I now reload that page automagically every 5 minutes :-P [18:41:58] RECOVERY - cassandra-b CQL 10.64.32.206:9042 on restbase1013 is OK: TCP OK - 0.036 second response time on 10.64.32.206 port 9042 [18:44:14] 06Operations, 10Ops-Access-Requests, 10Traffic, 13Patch-For-Review: Ops Onboarding for Arzhel Younsi - https://phabricator.wikimedia.org/T162073#3232646 (10Dzahn) we did not make an RT user, but RT is still used for maint-announce mails which affect networking. we should probably create one (and/or switch... [18:44:17] (03PS1) 10Ottomata: Initial debian packaging [debs/druid] (debian) - 10https://gerrit.wikimedia.org/r/351685 (https://phabricator.wikimedia.org/T134503) [18:44:19] (03PS1) 10Ottomata: s/etc/druid/middleManager/etc/druid/middlemanager/ in druid-middlemanager.dirs [debs/druid] (debian) - 10https://gerrit.wikimedia.org/r/351686 [18:44:21] (03PS1) 10Ottomata: Bump to 0.9.0-2 with middlemananger -> middleManager fix [debs/druid] (debian) - 10https://gerrit.wikimedia.org/r/351687 (https://phabricator.wikimedia.org/T134503) [18:44:23] (03PS1) 10Ottomata: Using 'upstream' as upstream branch' [debs/druid] (debian) - 10https://gerrit.wikimedia.org/r/351688 [18:44:25] (03PS1) 10Ottomata: Upstream release of 0.10.0 [debs/druid] (debian) - 10https://gerrit.wikimedia.org/r/351689 [18:44:47] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 3 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3232649 (10Papaul) Shipped back the bad main board. {F7895138} [18:45:41] (03CR) 10Ottomata: [C: 032] Initial debian packaging [debs/druid] (debian) - 10https://gerrit.wikimedia.org/r/351685 (https://phabricator.wikimedia.org/T134503) (owner: 10Ottomata) [18:45:48] (03CR) 10Ottomata: [C: 032] s/etc/druid/middleManager/etc/druid/middlemanager/ in druid-middlemanager.dirs [debs/druid] (debian) - 10https://gerrit.wikimedia.org/r/351686 (owner: 10Ottomata) [18:45:57] (03CR) 10Ottomata: [V: 032 C: 032] Bump to 0.9.0-2 with middlemananger -> middleManager fix [debs/druid] (debian) - 10https://gerrit.wikimedia.org/r/351687 (https://phabricator.wikimedia.org/T134503) (owner: 10Ottomata) [18:46:06] (03CR) 10Ottomata: [V: 032 C: 032] Using 'upstream' as upstream branch' [debs/druid] (debian) - 10https://gerrit.wikimedia.org/r/351688 (owner: 10Ottomata) [18:46:17] (03CR) 10Ottomata: [V: 032 C: 032] Upstream release of 0.10.0 [debs/druid] (debian) - 10https://gerrit.wikimedia.org/r/351689 (owner: 10Ottomata) [18:46:18] PROBLEM - cassandra-c CQL 10.64.0.34:9042 on restbase1016 is CRITICAL: connect to address 10.64.0.34 and port 9042: Connection refused [18:46:24] !log T160759: reducing tombstone threshold to 1000, restbase1016 [18:46:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:33] T160759: Cassandra OOMs - https://phabricator.wikimedia.org/T160759 [18:46:38] PROBLEM - Check systemd state on restbase1016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:46:59] PROBLEM - cassandra-c SSL 10.64.0.34:7001 on restbase1016 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [18:47:38] PROBLEM - cassandra-c SSL 10.64.48.137:7001 on restbase1014 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [18:47:38] RECOVERY - Check systemd state on restbase1016 is OK: OK - running: The system is fully operational [18:47:58] RECOVERY - cassandra-c SSL 10.64.0.34:7001 on restbase1016 is OK: SSL OK - Certificate restbase1016-c valid until 2017-12-13 00:15:51 +0000 (expires in 223 days) [18:48:08] PROBLEM - cassandra-c CQL 10.64.48.137:9042 on restbase1014 is CRITICAL: connect to address 10.64.48.137 and port 9042: Connection refused [18:48:16] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: codfw rack/setup 22 DB servers - https://phabricator.wikimedia.org/T162159#3232661 (10Papaul) I talked to @ayounsi, db2084 was in the Wrong VLAN so the installation is complete now running puppet and salt on the server [18:48:18] RECOVERY - cassandra-c CQL 10.64.0.34:9042 on restbase1016 is OK: TCP OK - 0.039 second response time on 10.64.0.34 port 9042 [18:48:23] (03CR) 10Dereckson: Limit FeaturedFeed on dewiki to last seven days (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341267 (https://phabricator.wikimedia.org/T159664) (owner: 10BearND) [18:48:30] !log T160759: reducing tombstone threshold to 1000, restbase1014 [18:48:33] !log db2084 - signing puppet certs, salt-key, initial run [18:48:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:38] RECOVERY - cassandra-c SSL 10.64.48.137:7001 on restbase1014 is OK: SSL OK - Certificate restbase1014-c valid until 2017-09-12 15:34:31 +0000 (expires in 131 days) [18:51:08] RECOVERY - cassandra-c CQL 10.64.48.137:9042 on restbase1014 is OK: TCP OK - 0.036 second response time on 10.64.48.137 port 9042 [18:56:33] (03CR) 10Ottomata: "Bug T164008" [debs/druid] (debian) - 10https://gerrit.wikimedia.org/r/351689 (owner: 10Ottomata) [18:56:48] (03PS1) 10Ottomata: Changes needed for upgrading to Druid 0.10 [puppet] - 10https://gerrit.wikimedia.org/r/351691 (https://phabricator.wikimedia.org/T164008) [18:59:22] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: codfw rack/setup 22 DB servers - https://phabricator.wikimedia.org/T162159#3232696 (10Papaul) [19:00:21] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: codfw rack/setup 22 DB servers - https://phabricator.wikimedia.org/T162159#3154083 (10Papaul) a:05Papaul>03Marostegui This complete. @Marostegui you can take over from here. Thanks. [19:07:09] (03PS2) 10Ottomata: Changes needed for upgrading to Druid 0.10 [puppet] - 10https://gerrit.wikimedia.org/r/351691 (https://phabricator.wikimedia.org/T164008) [19:13:17] 06Operations, 10ops-ulsfo, 10Traffic, 13Patch-For-Review: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327#3232726 (10RobH) IRC update: We'll split the numbering of odd and even in the odd and even racks. so rack 1.23 has cp2021, 1.22 has cp2022 | rack | hostname | | 1.23 | cp20... [19:15:25] ACKNOWLEDGEMENT - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 35, down: 2, dormant: 0, excluded: 0, unused: 0BRxe-0/0/1: down - Core: cr1-ulsfo:xe-1/2/0 (Telia, IC-313592, 51ms) {#11372} [10Gbps wave]BRxe-0/0/0: down - Core: cr2-codfw:xe-5/2/1 (Telia, IC-314534, 29ms) {#11375} [10Gbps wave]BR Ayounsi Telia outage [19:15:25] ACKNOWLEDGEMENT - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/2/0: down - Core: cr1-eqord:xe-0/0/1 (Telia, IC-313592, 51ms) {#1502} [10Gbps wave]BR Ayounsi Telia outage [19:15:25] ACKNOWLEDGEMENT - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/2/1: down - Core: cr1-eqord:xe-0/0/0 (Telia, IC-314534, 24ms) {#10694} [10Gbps wave]BR Ayounsi Telia outage [19:18:04] !log ppchelko@naos Started deploy [restbase/deploy@76d909f]: Blacklist a title to fix cassandra OOMs T160759 [19:18:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:12] T160759: Cassandra OOMs - https://phabricator.wikimedia.org/T160759 [19:18:45] ^^ sorry to break the deployment freeze, cassandra is OOMing because of that one title [19:19:14] 06Operations, 06Performance-Team: HTTP responses from app servers sometimes stall for >1s - https://phabricator.wikimedia.org/T164248#3232737 (10Krinkle) [19:25:43] !log ppchelko@naos Finished deploy [restbase/deploy@76d909f]: Blacklist a title to fix cassandra OOMs T160759 (duration: 07m 39s) [19:25:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:52] T160759: Cassandra OOMs - https://phabricator.wikimedia.org/T160759 [19:26:10] !log ppchelko@naos Started deploy [restbase/deploy@76d909f]: Blacklist a title to fix cassandra OOMs T160759 attempt #2 - checks timeout [19:26:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:49] !log ppchelko@naos Finished deploy [restbase/deploy@76d909f]: Blacklist a title to fix cassandra OOMs T160759 attempt #2 - checks timeout (duration: 01m 39s) [19:27:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:42] 06Operations, 06Performance-Team: HTTP responses from app servers sometimes stall for >1s - https://phabricator.wikimedia.org/T164248#3232800 (10Krinkle) p:05Triage>03Low [19:30:58] !log demon@naos Pruned MediaWiki: 1.29.0-wmf.20 [keeping static files] (duration: 00m 44s) [19:31:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:22] 06Operations, 10ops-eqiad, 06Labs, 13Patch-For-Review: setup promethium in eqiad in support of T95185 - https://phabricator.wikimedia.org/T120262#3232823 (10Dzahn) [19:31:25] 06Operations, 10ops-eqiad, 06Labs, 13Patch-For-Review: decom promethium - https://phabricator.wikimedia.org/T164395#3232821 (10Dzahn) 05Open>03stalled stalling, should have confirmation from @Andrew if this can really be fully decom'ed., because of this: https://gerrit.wikimedia.org/r/#/c/351678/1/hie... [19:32:32] !log demon@naos Pruned MediaWiki: 1.29.0-wmf.18 (duration: 00m 19s) [19:32:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:35] !log demon@naos Started scap: Cleaning up some unused branches, no-op [19:33:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:53] 06Operations, 10DBA, 06Performance-Team, 10Traffic: Cache invalidations coming from the JobQueue are causing slowdown on masters and lag on several wikis, and impact on varnish - https://phabricator.wikimedia.org/T164173#3232826 (10Gilles) p:05Triage>03Low [19:34:17] 06Operations, 10DBA, 06Performance-Team, 10Traffic: Cache invalidations coming from the JobQueue are causing slowdown on masters and lag on several wikis, and impact on varnish - https://phabricator.wikimedia.org/T164173#3224448 (10Gilles) a:03aaron [19:36:52] 06Operations, 10Ops-Access-Requests, 10Traffic, 13Patch-For-Review: Ops Onboarding for Arzhel Younsi - https://phabricator.wikimedia.org/T162073#3232843 (10Dzahn) 05Resolved>03Open [19:37:33] 06Operations, 10Ops-Access-Requests, 10Traffic, 13Patch-For-Review: Ops Onboarding for Arzhel Younsi - https://phabricator.wikimedia.org/T162073#3151609 (10Dzahn) [19:37:39] 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 05Multiple-active-datacenters: Prepare and improve the datacenter switchover procedure - https://phabricator.wikimedia.org/T154658#3232851 (10Krinkle) [19:37:47] 06Operations, 10MediaWiki-Configuration, 06MediaWiki-Platform-Team, 06Performance-Team, and 10 others: Allow integration of data from etcd into the MediaWiki configuration - https://phabricator.wikimedia.org/T156924#3232850 (10Krinkle) [19:37:49] 06Operations, 06Performance-Team, 06Services, 05Multiple-active-datacenters: Consider REST with SSL (HyperSwitch/Cassandra) for session storage - https://phabricator.wikimedia.org/T134811#3232856 (10Krinkle) [19:37:52] 06Operations, 10DBA, 06Performance-Team, 05Multiple-active-datacenters: Apache <=> mariadb SSL/TLS for cross-datacenter writes - https://phabricator.wikimedia.org/T134809#3232857 (10Krinkle) [19:37:56] 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 05Multiple-active-datacenters, and 6 others: Integrating MediaWiki (and other services) with dynamic configuration - https://phabricator.wikimedia.org/T149617#3232855 (10Krinkle) [19:38:00] 06Operations, 10Ops-Access-Requests, 10Traffic, 13Patch-For-Review: Ops Onboarding for Arzhel Younsi - https://phabricator.wikimedia.org/T162073#3151609 (10Dzahn) p:05Triage>03Low [19:38:01] 06Operations, 10Traffic, 05Multiple-active-datacenters: Create HTTP verb and sticky cookie DC routing in VCL - https://phabricator.wikimedia.org/T91820#3232863 (10Krinkle) [19:38:08] 06Operations, 10Ops-Access-Requests, 10Traffic, 13Patch-For-Review: Ops Onboarding for Arzhel Younsi - https://phabricator.wikimedia.org/T162073#3151609 (10Dzahn) a:05BBlack>03None [19:40:32] (03CR) 10Dzahn: decom promethium (!??) - do not merge (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/351678 (https://phabricator.wikimedia.org/T164395) (owner: 10Dzahn) [19:44:40] PROBLEM - HHVM jobrunner on mw1167 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:45:30] PROBLEM - HHVM jobrunner on mw1166 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:45:50] (03PS1) 10Dzahn: systemd: use logrotate::conf for logrotate config [puppet] - 10https://gerrit.wikimedia.org/r/351703 [19:46:21] 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 05Multiple-active-datacenters, 13Patch-For-Review: MediaWiki Datacenter Switchover automation - https://phabricator.wikimedia.org/T160178#3232911 (10Krinkle) [19:46:24] 06Operations, 10DBA, 05DC-Switchover-Prep-Q3-2016-17, 05Multiple-active-datacenters, 13Patch-For-Review: Decouple Mariadb semi-sync replication from $::mw_primary - https://phabricator.wikimedia.org/T161007#3232910 (10Krinkle) [19:46:26] 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 05Multiple-active-datacenters, 06Services (watching), 15User-mobrovac: Assess SCB@CODFW preparedness for the DC switchover - https://phabricator.wikimedia.org/T156361#3232913 (10Krinkle) [19:46:28] 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 05Multiple-active-datacenters, 13Patch-For-Review: Check the size of every cluster in codfw to see if it matches eqiad's capacity - https://phabricator.wikimedia.org/T156023#3232915 (10Krinkle) [19:46:32] 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 05Multiple-active-datacenters, and 6 others: DNS: dynamically generate entries for service discovery - https://phabricator.wikimedia.org/T156100#3232914 (10Krinkle) [19:46:34] 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 05Multiple-active-datacenters, and 2 others: Create an etcd cluster in codfw - https://phabricator.wikimedia.org/T156009#3232920 (10Krinkle) [19:46:41] 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 05DC-Switchover-Prep-Q3-2016-17, and 6 others: Expand conftool to support multiple objects via a schema definition. - https://phabricator.wikimedia.org/T155823#3232926 (10Krinkle) [19:46:41] PROBLEM - Nginx local proxy to apache on mw1287 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:46:46] 06Operations, 05Multiple-active-datacenters, 13Patch-For-Review, 07Performance, and 2 others: Package and deploy Mcrouter as a replacement for twemproxy - https://phabricator.wikimedia.org/T132317#3232929 (10Krinkle) [19:46:48] 06Operations, 10media-storage, 05Multiple-active-datacenters, 13Patch-For-Review: Enable HTTPS for Swift traffic - https://phabricator.wikimedia.org/T127455#3232931 (10Krinkle) [19:47:00] PROBLEM - HHVM rendering on mw1287 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:47:20] PROBLEM - Apache HTTP on mw1287 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:47:25] 06Operations, 10ops-eqiad, 06Labs, 13Patch-For-Review: decom promethium - https://phabricator.wikimedia.org/T164395#3232961 (10Andrew) I may be confused about naming but believe that @ssastry is still actively using this system. Maybe I made some mistake at some point that made it appear decom'd? [19:47:43] (03CR) 10jerkins-bot: [V: 04-1] systemd: use logrotate::conf for logrotate config [puppet] - 10https://gerrit.wikimedia.org/r/351703 (owner: 10Dzahn) [19:47:51] (03PS2) 10Dzahn: systemd: use logrotate::conf for logrotate config [puppet] - 10https://gerrit.wikimedia.org/r/351703 [19:48:42] I assume the jobrunner issues are new? (mw1167, 68) [19:48:44] (03CR) 10jerkins-bot: [V: 04-1] systemd: use logrotate::conf for logrotate config [puppet] - 10https://gerrit.wikimedia.org/r/351703 (owner: 10Dzahn) [19:48:48] !log demon@naos Finished scap: Cleaning up some unused branches, no-op (duration: 15m 13s) [19:48:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:07] I don't really have enough brain cells left for this at 11 pm, anyone who can take a look? [19:49:31] apergos: mw1287? ok, yea [19:49:43] 06Operations, 10ops-eqiad, 06Labs, 13Patch-For-Review: decom promethium - https://phabricator.wikimedia.org/T164395#3232966 (10ssastry) Yes please do not decommission. It is being used as part of https://www.mediawiki.org/wiki/Parsing/Visual_Diff_Testing. Thanks! [19:49:47] 1167, 1168 [19:50:01] subbu: can you comment on https://phabricator.wikimedia.org/T164395? I haven't had time to read the bug top to bottom yet but I think you're the one who knows things :) [19:50:01] HHVM jobrunner on mw1167 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:50:11] i already did. :) [19:50:33] and I guess there is the mw1287 nginx local proxy but I was noticing the job runners first [19:50:53] thanks mu tante [19:51:00] mw1167 is running and looks ok at first glance [19:51:38] jobrunner processes in the list [19:53:11] mw1287 is also up and running, an API server... all of those above are socket timeouts from icinga point of view. feels like maybe on that side and hoping that is just one flap [19:53:38] icinga-wm: try again :p [19:54:37] ok hope so [19:54:41] thanks for looknig [19:54:43] 06Operations, 10ops-eqiad, 06Labs, 13Patch-For-Review: decom promethium - https://phabricator.wikimedia.org/T164395#3232981 (10Andrew) This box is weird since it's in the labs VM subnet despite being bare-metal. It's a weird one-off host and we have long-term plans to get rid of it but those are blocked o... [19:54:44] subbu: so I see, thank you [19:55:05] 1168 already recovered. mw1167 rescheduled next check for now [19:56:31] 06Operations, 10ops-eqiad, 06Labs, 13Patch-For-Review: decom promethium - https://phabricator.wikimedia.org/T164395#3232987 (10ssastry) >>! In T164395#3232981, @Andrew wrote: > This box is weird since it's in the labs VM subnet despite being bare-metal. It's a weird one-off host and we have long-term plan... [19:56:33] !log mw1287 - restarted crashed apache (proxy_fcgi:error) [19:56:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:50] PROBLEM - puppet last run on etcd1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:57:16] !log mw1287 - also restarting hhvm (with systemctl restart) [19:57:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:54] 06Operations, 10DBA, 07Availability (Multiple-active-datacenters), 05DC-Switchover-Prep-Q3-2016-17, 13Patch-For-Review: Decouple Mariadb semi-sync replication from $::mw_primary - https://phabricator.wikimedia.org/T161007#3232994 (10Krinkle) [19:57:58] 06Operations, 07Availability (Multiple-active-datacenters), 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 13Patch-For-Review: MediaWiki Datacenter Switchover automation - https://phabricator.wikimedia.org/T160178#3232995 (10Krinkle) [19:58:12] 06Operations, 10MediaWiki-Configuration, 06MediaWiki-Platform-Team, 06Performance-Team, and 10 others: Allow integration of data from etcd into the MediaWiki configuration - https://phabricator.wikimedia.org/T156924#3232997 (10Krinkle) [19:58:13] 06Operations, 07Availability (Multiple-active-datacenters), 05DC-Switchover-Prep-Q3-2016-17, 06Services (watching), 15User-mobrovac: Assess SCB@CODFW preparedness for the DC switchover - https://phabricator.wikimedia.org/T156361#3233000 (10Krinkle) [19:58:15] 06Operations, 06Performance-Team, 07Availability (Multiple-active-datacenters), 05DC-Switchover-Prep-Q3-2016-17, and 3 others: Prepare a reasonably performant warmup tool for MediaWiki caches (memcached/apc) - https://phabricator.wikimedia.org/T156922#3232999 (10Krinkle) [19:58:17] 06Operations, 07Availability (Multiple-active-datacenters), 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 13Patch-For-Review: Check the size of every cluster in codfw to see if it matches eqiad's capacity - https://phabricator.wikimedia.org/T156023#3233002 (10Krinkle) [19:58:20] 06Operations, 07Availability (Multiple-active-datacenters), 05DC-Switchover-Prep-Q3-2016-17, 07Epic, and 2 others: Create an etcd cluster in codfw - https://phabricator.wikimedia.org/T156009#3233003 (10Krinkle) [19:58:21] 06Operations, 07Availability (Multiple-active-datacenters), 05DC-Switchover-Prep-Q3-2016-17, 07Epic: Prepare and improve the datacenter switchover procedure - https://phabricator.wikimedia.org/T154658#3233005 (10Krinkle) [19:58:26] 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 07Availability (Multiple-active-datacenters), and 6 others: Expand conftool to support multiple objects via a schema definition. - https://phabricator.wikimedia.org/T155823#3233004 (10Krinkle) [19:58:26] 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 07Availability (Multiple-active-datacenters), and 6 others: DNS: dynamically generate entries for service discovery - https://phabricator.wikimedia.org/T156100#3233001 (10Krinkle) [19:58:38] 06Operations, 10DBA, 06Performance-Team, 07Availability (Multiple-active-datacenters): Apache <=> mariadb SSL/TLS for cross-datacenter writes - https://phabricator.wikimedia.org/T134809#3233013 (10Krinkle) [19:58:40] 06Operations, 06Performance-Team, 06Services, 07Availability (Multiple-active-datacenters): Consider REST with SSL (HyperSwitch/Cassandra) for session storage - https://phabricator.wikimedia.org/T134811#3233012 (10Krinkle) [19:58:44] 06Operations, 07Availability (Multiple-active-datacenters), 13Patch-For-Review, 07Performance, and 2 others: Package and deploy Mcrouter as a replacement for twemproxy - https://phabricator.wikimedia.org/T132317#3233014 (10Krinkle) [19:58:48] 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 07Availability (Multiple-active-datacenters), and 6 others: Integrating MediaWiki (and other services) with dynamic configuration - https://phabricator.wikimedia.org/T149617#3233010 (10Krinkle) [19:58:50] 06Operations, 10media-storage, 07Availability (Multiple-active-datacenters), 13Patch-For-Review: Enable HTTPS for Swift traffic - https://phabricator.wikimedia.org/T127455#3233016 (10Krinkle) [19:58:50] RECOVERY - HHVM rendering on mw1287 is OK: HTTP OK: HTTP/1.1 200 OK - 72852 bytes in 0.355 second response time [19:59:00] 06Operations, 10Traffic, 07Availability (Multiple-active-datacenters): Create HTTP verb and sticky cookie DC routing in VCL - https://phabricator.wikimedia.org/T91820#3233024 (10Krinkle) [19:59:12] RECOVERY - Apache HTTP on mw1287 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.094 second response time [19:59:20] RECOVERY - MariaDB Slave Lag: s3 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 89946.20 seconds [19:59:30] RECOVERY - Nginx local proxy to apache on mw1287 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.169 second response time [20:02:16] (03PS1) 10Ema: BBR Congestion Control [puppet] - 10https://gerrit.wikimedia.org/r/351707 (https://phabricator.wikimedia.org/T147569) [20:04:32] (03CR) 10jerkins-bot: [V: 04-1] BBR Congestion Control [puppet] - 10https://gerrit.wikimedia.org/r/351707 (https://phabricator.wikimedia.org/T147569) (owner: 10Ema) [20:05:59] (03PS2) 10Ema: BBR Congestion Control [puppet] - 10https://gerrit.wikimedia.org/r/351707 (https://phabricator.wikimedia.org/T147569) [20:07:42] (03CR) 10jerkins-bot: [V: 04-1] BBR Congestion Control [puppet] - 10https://gerrit.wikimedia.org/r/351707 (https://phabricator.wikimedia.org/T147569) (owner: 10Ema) [20:10:56] (03PS3) 10Ema: BBR Congestion Control [puppet] - 10https://gerrit.wikimedia.org/r/351707 (https://phabricator.wikimedia.org/T147569) [20:11:35] jerkins you suck [20:12:38] * urandom gasps [20:13:19] !log T160759: restoring default tombstone thresholds, restbase10{3,4,6} [20:13:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:29] T160759: Cassandra OOMs - https://phabricator.wikimedia.org/T160759 [20:18:58] (03PS4) 10Ema: BBR Congestion Control [puppet] - 10https://gerrit.wikimedia.org/r/351707 (https://phabricator.wikimedia.org/T147569) [20:21:19] 06Operations: Allow Qualtrics to send @wikimedia.org emails using an SPF record or an SMTP relay - https://phabricator.wikimedia.org/T164424#3233141 (10Neil_P._Quinn_WMF) [20:24:50] RECOVERY - puppet last run on etcd1003 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [20:25:49] 06Operations: Allow Qualtrics to send @wikimedia.org emails using an SPF record or an SMTP relay - https://phabricator.wikimedia.org/T164424#3233156 (10Neil_P._Quinn_WMF) [20:27:00] (03PS1) 10RobH: setting up dns for cp4021-cp4032 [dns] - 10https://gerrit.wikimedia.org/r/351711 [20:28:23] (03CR) 10RobH: [C: 032] setting up dns for cp4021-cp4032 [dns] - 10https://gerrit.wikimedia.org/r/351711 (owner: 10RobH) [20:30:06] !log mw1166 - restart hhvm service (Fatal error: request has exceeded memory limit) [20:30:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:20] RECOVERY - HHVM jobrunner on mw1166 is OK: HTTP OK: HTTP/1.1 200 OK - 203 bytes in 0.077 second response time [20:32:33] 06Operations, 10MediaWiki-Configuration, 06MediaWiki-Platform-Team, 06Performance-Team, and 10 others: Allow integration of data from etcd into the MediaWiki configuration - https://phabricator.wikimedia.org/T156924#3233167 (10Krinkle) >>! In T156924#3229983, @gerritbot wrote: > Change 351539 merged by jen... [20:35:31] RECOVERY - HHVM jobrunner on mw1167 is OK: HTTP OK: HTTP/1.1 200 OK - 203 bytes in 0.079 second response time [20:35:53] !log mw1167 - same as mw1166 (jobrunners) - there was a hhvm[12547]: Fatal error: unknown exception followed by mysql slow query, SELECT MASTER_TID_WAIT... | systemctl restart hhvm recovers it [20:36:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:55] (03CR) 10Dzahn: "13:14 < ema> OMG jenkins and the stupid arrows" [puppet] - 10https://gerrit.wikimedia.org/r/351225 (owner: 10Dzahn) [20:50:04] 06Operations, 10ops-eqiad, 06Labs, 13Patch-For-Review: decom promethium - https://phabricator.wikimedia.org/T164395#3233353 (10Dzahn) @Muehlenhoff fyi about this special case that appeared in https://gerrit.wikimedia.org/r/#/c/351587/ [20:51:15] 06Operations, 10ops-eqiad, 06Labs, 13Patch-For-Review: decom promethium - https://phabricator.wikimedia.org/T164395#3233356 (10Dzahn) So it's normal that this machine does not appear in site.pp and isn't in puppet? [20:51:46] (03Abandoned) 10Dzahn: decom promethium (!??) - do not merge [puppet] - 10https://gerrit.wikimedia.org/r/351678 (https://phabricator.wikimedia.org/T164395) (owner: 10Dzahn) [20:52:06] (03CR) 10Chad: [C: 031] gerrit: move hiera lookup to profile parameter [puppet] - 10https://gerrit.wikimedia.org/r/351564 (owner: 10Dzahn) [20:52:12] 06Operations, 10ops-eqiad, 06Labs, 13Patch-For-Review: setup promethium in eqiad in support of T95185 - https://phabricator.wikimedia.org/T120262#3233359 (10Dzahn) [20:52:14] 06Operations, 10ops-eqiad, 06Labs, 13Patch-For-Review: decom promethium - https://phabricator.wikimedia.org/T164395#3233358 (10Dzahn) 05stalled>03Invalid [20:53:44] 06Operations, 10MediaWiki-Configuration, 06MediaWiki-Platform-Team, 06Performance-Team, and 10 others: Allow integration of data from etcd into the MediaWiki configuration - https://phabricator.wikimedia.org/T156924#3233363 (10tstarling) We tried using it in production yesterday, and tested connection refu... [20:54:54] (03CR) 10Dzahn: "https://phabricator.wikimedia.org/T164395 says that we should keep promethium, but not sure if we have to add it back, i don't think it's " [puppet] - 10https://gerrit.wikimedia.org/r/351587 (https://phabricator.wikimedia.org/T153918) (owner: 10Muehlenhoff) [20:58:45] 06Operations, 10ops-eqiad, 06Labs, 13Patch-For-Review: setup promethium in eqiad in support of T95185 - https://phabricator.wikimedia.org/T120262#3233370 (10Dzahn) [20:58:48] 06Operations, 10ops-eqiad, 06Labs, 13Patch-For-Review: decom promethium - https://phabricator.wikimedia.org/T164395#3233368 (10Dzahn) 05Invalid>03Open let me open that again for a second for follow-ups, but no worries the decom process is stopped [20:58:58] 06Operations, 10ops-eqiad, 06Labs, 13Patch-For-Review: (don't) decom promethium - https://phabricator.wikimedia.org/T164395#3233371 (10Dzahn) [20:59:51] 06Operations, 06MediaWiki-Platform-Team, 06Performance-Team, 07Availability (Multiple-active-datacenters), and 7 others: Allow integration of data from etcd into the MediaWiki configuration - https://phabricator.wikimedia.org/T156924#3233373 (10Krinkle) [21:00:02] 06Operations, 06MediaWiki-Platform-Team, 06Performance-Team, 07Availability (Multiple-active-datacenters), and 6 others: Allow integration of data from etcd into the MediaWiki configuration - https://phabricator.wikimedia.org/T156924#2991064 (10Krinkle) [21:01:00] (03PS3) 10Dzahn: systemd: use logrotate::conf for logrotate config [puppet] - 10https://gerrit.wikimedia.org/r/351703 [21:03:29] (03CR) 10Dzahn: [C: 04-1] "Error: join(): Requires array to work with" [puppet] - 10https://gerrit.wikimedia.org/r/351564 (owner: 10Dzahn) [21:06:00] !log demon@naos Synchronized README: No-op, forcing co-master sync (duration: 02m 28s) [21:06:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:48] (03PS3) 10Dzahn: gerrit: move hiera lookup to profile parameter [puppet] - 10https://gerrit.wikimedia.org/r/351564 [21:09:06] (03PS4) 10Dzahn: gerrit: move hiera lookup to profile parameter [puppet] - 10https://gerrit.wikimedia.org/r/351564 [21:11:19] (03CR) 10Dzahn: [C: 031] "now it works. it would be 100% noop except i changed a resource name http://puppet-compiler.wmflabs.org/6288/" [puppet] - 10https://gerrit.wikimedia.org/r/351564 (owner: 10Dzahn) [21:11:31] (03PS5) 10Dzahn: gerrit: move hiera lookup to profile parameter [puppet] - 10https://gerrit.wikimedia.org/r/351564 [21:12:45] (03CR) 10Dzahn: [C: 032] gerrit: move hiera lookup to profile parameter [puppet] - 10https://gerrit.wikimedia.org/r/351564 (owner: 10Dzahn) [21:37:56] 06Operations, 10ops-ulsfo, 10Traffic, 13Patch-For-Review: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327#3233451 (10RobH) cp4021-cp4032 have been racked, but ONLY cp4021 is accessible to mgmt and network. There isnt enough power overhead in the racks to wire up all the new systems... [21:37:59] (03CR) 10Paladox: "I dint think this will work until we upgrade gerrit or will it? This is using the replication plugin in gerrit so it will be using the ssh" [puppet] - 10https://gerrit.wikimedia.org/r/351566 (https://phabricator.wikimedia.org/T152525) (owner: 10Dzahn) [21:40:33] (03CR) 10Paladox: "gerrit 2.13 does not support ecdsa keys. Gerrit 2.14+ supports ecdsa keys." [puppet] - 10https://gerrit.wikimedia.org/r/351566 (https://phabricator.wikimedia.org/T152525) (owner: 10Dzahn) [21:41:25] (03CR) 10Paladox: [C: 04-1] gerrit: use new ecdsa key for replication, add pub key [puppet] - 10https://gerrit.wikimedia.org/r/351566 (https://phabricator.wikimedia.org/T152525) (owner: 10Dzahn) [21:43:12] (03CR) 10Paladox: [C: 04-1] "This https://gerrit-review.googlesource.com/#/c/100998/ change added support for it and and was required by https://gerrit-review.googleso" [puppet] - 10https://gerrit.wikimedia.org/r/351566 (https://phabricator.wikimedia.org/T152525) (owner: 10Dzahn) [21:44:11] (03CR) 10Paladox: [C: 031] "Wont affect labs or at least i hope not :)" [puppet] - 10https://gerrit.wikimedia.org/r/351565 (https://phabricator.wikimedia.org/T152525) (owner: 10Dzahn) [21:47:06] (03CR) 10Dzahn: [C: 04-1] "thanks paladox for pointing that out and also adding the details where support was added!" [puppet] - 10https://gerrit.wikimedia.org/r/351566 (https://phabricator.wikimedia.org/T152525) (owner: 10Dzahn) [21:47:24] (03CR) 10Paladox: [C: 04-1] "Your welcome :)" [puppet] - 10https://gerrit.wikimedia.org/r/351566 (https://phabricator.wikimedia.org/T152525) (owner: 10Dzahn) [21:52:29] Dereckson: BTW, do you know why site-requests no longer has a "special handling" column? There's a bunch of pretty serious changes (e.g. children of T159895) that I can't easily cookie-lick any more… [21:52:30] T159895: Support wikis in converting reference lists over to `responsive` - https://phabricator.wikimedia.org/T159895 [21:53:16] James_F: it has been renamed to "external" to be coherent with #dba etc. [21:53:39] Dereckson: … wait, seriously? That's very confusing. [21:53:53] "External" means pretty much the precise opposite "special handling". [21:54:06] Was there a discussion where I can see the justification? [21:54:25] James_F: we can rename it back if you wish, but it conveys the meaning "it's managed by a team external to the site requests project" [21:54:43] But that's not what "external" means. [21:54:57] "External" means "We need to work on this, but we're blocked on someone else doing something first". [21:55:15] I'm happy to go with whatever the consensus is. [21:55:24] I just missed any discussion/e-mail about it. [21:57:39] I'll run a discussion about the columns, by the way would you have a suggestion to express what kind of "special" handling it's? [22:00:52] "Tasks that need particularly careful liaison between teams, communities, or both" or something like that? [22:01:10] "high touch" :) [22:01:39] "Please for the love of god don't touch this unless you're Dereckson or James_F." might be a bit too blunt. ;-) [22:01:55] "Here be dragons" :) [22:04:14] (03CR) 10Dzahn: [C: 031] "http://puppet-compiler.wmflabs.org/6289/cobalt.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/351565 (https://phabricator.wikimedia.org/T152525) (owner: 10Dzahn) [22:07:39] (03CR) 10Chad: [C: 031] gerrit: use ssh::userkey to install ssh key in proper location [puppet] - 10https://gerrit.wikimedia.org/r/351565 (https://phabricator.wikimedia.org/T152525) (owner: 10Dzahn) [22:07:48] (03PS2) 10Dzahn: gerrit: use ssh::userkey to install ssh key in proper location [puppet] - 10https://gerrit.wikimedia.org/r/351565 (https://phabricator.wikimedia.org/T152525) [22:08:37] (03CR) 10Dzahn: [C: 032] gerrit: use ssh::userkey to install ssh key in proper location [puppet] - 10https://gerrit.wikimedia.org/r/351565 (https://phabricator.wikimedia.org/T152525) (owner: 10Dzahn) [22:09:19] (03CR) 10Chad: [C: 031] "Actually, this is wrong. We want the public key (that's currently in authorized_keys) to be done here. Not the private key. That stays whe" [puppet] - 10https://gerrit.wikimedia.org/r/351565 (https://phabricator.wikimedia.org/T152525) (owner: 10Dzahn) [22:09:59] (03CR) 10Chad: [C: 04-1] gerrit: use ssh::userkey to install ssh key in proper location [puppet] - 10https://gerrit.wikimedia.org/r/351565 (https://phabricator.wikimedia.org/T152525) (owner: 10Dzahn) [22:18:23] (03PS3) 10Dzahn: gerrit: use ssh::userkey to install ssh key in proper location [puppet] - 10https://gerrit.wikimedia.org/r/351565 (https://phabricator.wikimedia.org/T152525) [22:20:55] (03PS4) 10Dzahn: gerrit: use ssh::userkey to install ssh key in proper location [puppet] - 10https://gerrit.wikimedia.org/r/351565 (https://phabricator.wikimedia.org/T152525) [22:21:20] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/6290/cobalt.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/351565 (https://phabricator.wikimedia.org/T152525) (owner: 10Dzahn) [22:22:48] (03CR) 10Chad: [C: 031] gerrit: use ssh::userkey to install ssh key in proper location [puppet] - 10https://gerrit.wikimedia.org/r/351565 (https://phabricator.wikimedia.org/T152525) (owner: 10Dzahn) [22:23:03] (03CR) 10Dzahn: [C: 032] gerrit: use ssh::userkey to install ssh key in proper location [puppet] - 10https://gerrit.wikimedia.org/r/351565 (https://phabricator.wikimedia.org/T152525) (owner: 10Dzahn) [22:26:24] (03CR) 10Dzahn: "[cobalt:~] $ sudo -u gerrit2 ssh gerrit2@gerrit2001.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/351565 (https://phabricator.wikimedia.org/T152525) (owner: 10Dzahn) [22:27:50] (03CR) 10Dzahn: "ssh works from cobalt -> gerrit2001 :) (but not other way around?)" [puppet] - 10https://gerrit.wikimedia.org/r/351565 (https://phabricator.wikimedia.org/T152525) (owner: 10Dzahn) [22:33:57] (03CR) 10Dzahn: "both directions work, was wrong user name :)" [puppet] - 10https://gerrit.wikimedia.org/r/351565 (https://phabricator.wikimedia.org/T152525) (owner: 10Dzahn) [22:36:15] (03CR) 10Hashar: "I tend to like the arrows alignment, that makes it slightly more friendly to the eye." [puppet] - 10https://gerrit.wikimedia.org/r/351225 (owner: 10Dzahn) [22:38:32] (03CR) 10Chad: [C: 04-1] "I think we should hold off on this for now. We know the RSA keys work, so let's stick with that. We can revisit a key rotation (ecdsa or e" [puppet] - 10https://gerrit.wikimedia.org/r/351566 (https://phabricator.wikimedia.org/T152525) (owner: 10Dzahn) [22:42:51] (03Abandoned) 10Dzahn: gerrit: use new ecdsa key for replication, add pub key [puppet] - 10https://gerrit.wikimedia.org/r/351566 (https://phabricator.wikimedia.org/T152525) (owner: 10Dzahn) [22:46:41] (03PS1) 10Chad: Gerrit: Start replicating to slaves [puppet] - 10https://gerrit.wikimedia.org/r/351734 [22:48:33] (03PS2) 10Dzahn: Gerrit: Start replicating to slaves [puppet] - 10https://gerrit.wikimedia.org/r/351734 (https://phabricator.wikimedia.org/T152525) (owner: 10Chad) [22:48:58] (03CR) 10Dzahn: [C: 032] Gerrit: Start replicating to slaves [puppet] - 10https://gerrit.wikimedia.org/r/351734 (https://phabricator.wikimedia.org/T152525) (owner: 10Chad) [22:55:08] 06Operations, 10Traffic, 13Patch-For-Review: varnish backends start returning 503s after ~6 days uptime - https://phabricator.wikimedia.org/T145661#3233723 (10BBlack) I've re-run the binning analysis, with a few minor changes: 1. Fresher data (61 days ending 2017-05-02) 2. Included maps.wikimedia.org data a... [22:56:29] (03PS1) 10Chad: Gerrit: Disable custom log4j for awhile [puppet] - 10https://gerrit.wikimedia.org/r/351735 [22:57:47] (03CR) 10Dzahn: [C: 032] Gerrit: Disable custom log4j for awhile [puppet] - 10https://gerrit.wikimedia.org/r/351735 (owner: 10Chad) [22:59:41] !log gerrit: Quick restart to pick up logging config change [22:59:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:59] (03PS1) 10Chad: Gerrit: Disable replication to gerrit2001 for now [puppet] - 10https://gerrit.wikimedia.org/r/351737 [23:15:54] 06Operations, 10Traffic, 13Patch-For-Review: varnish backends start returning 503s after ~6 days uptime - https://phabricator.wikimedia.org/T145661#3233742 (10BBlack) Reformatting this a bit for comparison, and using the "new" binning (which splits 0-1K from 1K-16K): **Storage Size Percentages** (how much... [23:19:29] (03CR) 10Dzahn: [C: 032] Gerrit: Disable replication to gerrit2001 for now [puppet] - 10https://gerrit.wikimedia.org/r/351737 (owner: 10Chad) [23:35:53] (03PS1) 10BBlack: netboot cleanup for cache nodes [puppet] - 10https://gerrit.wikimedia.org/r/351739 [23:35:55] (03PS1) 10BBlack: add cp4021 dhcp and temporary site.pp entry [puppet] - 10https://gerrit.wikimedia.org/r/351740 (https://phabricator.wikimedia.org/T164327) [23:36:56] (03CR) 10jerkins-bot: [V: 04-1] add cp4021 dhcp and temporary site.pp entry [puppet] - 10https://gerrit.wikimedia.org/r/351740 (https://phabricator.wikimedia.org/T164327) (owner: 10BBlack) [23:38:07] (03PS2) 10BBlack: add cp4021 dhcp and temporary site.pp entry [puppet] - 10https://gerrit.wikimedia.org/r/351740 (https://phabricator.wikimedia.org/T164327) [23:42:34] (03CR) 10BBlack: [C: 032] netboot cleanup for cache nodes [puppet] - 10https://gerrit.wikimedia.org/r/351739 (owner: 10BBlack) [23:42:47] (03CR) 10BBlack: [C: 032] add cp4021 dhcp and temporary site.pp entry [puppet] - 10https://gerrit.wikimedia.org/r/351740 (https://phabricator.wikimedia.org/T164327) (owner: 10BBlack) [23:42:57] (03PS2) 10BBlack: r::c::base - support 800G disks for new ulsfo systems [puppet] - 10https://gerrit.wikimedia.org/r/351659 (https://phabricator.wikimedia.org/T164327) [23:43:19] (03CR) 10BBlack: [V: 032 C: 032] r::c::base - support 800G disks for new ulsfo systems [puppet] - 10https://gerrit.wikimedia.org/r/351659 (https://phabricator.wikimedia.org/T164327) (owner: 10BBlack) [23:46:01] wow the first "5" in host names appears, historic :) [23:46:11] afk,bbl