[00:00:04] twentyafterfour: Respected human, time to deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170511T0000). Please do the needful. [00:03:51] !log maxsem@tin Finished deploy [kartotherian/deploy@9401f38]: Try https://gerrit.wikimedia.org/r/#/c/352886/ and https://gerrit.wikimedia.org/r/#/c/353184/ on test hosts (duration: 145m 42s) [00:03:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:11:52] !log sending arabic election emails via terbium [00:11:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:29:18] !log sending bg and bn election emails via terbium [00:29:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:36:25] !log sending german election emails via terbium [00:36:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:36:55] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [00:45:45] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/2: down - Core: cr2-ulsfo:xe-1/3/0 (Zayo, OGYX/124337//ZYO, 38.8ms) {#11541} [10Gbps wave]BR [00:46:05] PROBLEM - swift-account-server on ms-be2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:46:05] PROBLEM - swift-container-replicator on ms-be2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:46:05] PROBLEM - swift-account-auditor on ms-be2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:46:05] PROBLEM - swift-object-replicator on ms-be2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:46:05] PROBLEM - swift-container-auditor on ms-be2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:46:05] PROBLEM - swift-object-server on ms-be2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:46:15] PROBLEM - swift-account-reaper on ms-be2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:46:45] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [00:46:55] RECOVERY - swift-account-server on ms-be2001 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [00:46:55] RECOVERY - swift-container-replicator on ms-be2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [00:46:55] RECOVERY - swift-object-replicator on ms-be2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [00:46:55] RECOVERY - swift-account-auditor on ms-be2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [00:46:55] RECOVERY - swift-container-auditor on ms-be2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [00:46:55] RECOVERY - swift-object-server on ms-be2001 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [00:47:05] RECOVERY - swift-account-reaper on ms-be2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [00:48:55] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Core: cr1-codfw:xe-5/0/2 (Zayo, OGYX/124337//ZYO, 38.8ms) {#?} [10Gbps wave]BR [00:50:15] !log sending Spanish election emails via terbium [00:50:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:51:05] PROBLEM - swift-container-updater on ms-be2012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:51:15] PROBLEM - swift-object-server on ms-be2012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:51:15] PROBLEM - swift-object-auditor on ms-be2012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:51:55] RECOVERY - swift-container-updater on ms-be2012 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [00:52:05] RECOVERY - swift-object-server on ms-be2012 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [00:52:06] RECOVERY - swift-object-auditor on ms-be2012 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [01:00:26] !log sending farsi election emails via terbium [01:00:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:04:29] (03PS1) 1020after4: group1 wikis to 1.30.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353202 [01:04:32] (03CR) 1020after4: [C: 032] group1 wikis to 1.30.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353202 (owner: 1020after4) [01:05:24] (03Merged) 10jenkins-bot: group1 wikis to 1.30.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353202 (owner: 1020after4) [01:05:35] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/2: down - Core: cr2-ulsfo:xe-1/3/0 (Zayo, OGYX/124337//ZYO, 38.8ms) {#11541} [10Gbps wave]BR [01:05:51] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.30.0-wmf.1 [01:05:55] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [01:06:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:06:11] (03CR) 10jenkins-bot: group1 wikis to 1.30.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353202 (owner: 1020after4) [01:06:35] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [01:08:48] !log sending French election emails via terbium [01:08:55] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Core: cr1-codfw:xe-5/0/2 (Zayo, OGYX/124337//ZYO, 38.8ms) {#?} [10Gbps wave]BR [01:08:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:20:14] Wikimedia\Rdbms\LoadBalancer::reuseConnection: got DBConnRef instance. [01:20:23] is that something to be worried about? [01:21:31] !log sending he, hi and id election emails via terbium [01:21:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:32:36] !log sending Italian and Japanese election emails via terbium [01:32:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:51:42] twentyafterfour: it means some caller is trying to mix the two basic methods of getting foreign connections (getConnection() with manual reuseConnection() or just getConnectionRef). DBConnRef already calls the "reuse" method on the Database it wraps; nothing should call reuseConnection() on the DBConnRef itself. It just no-ops, so it's not urgent. [01:55:03] !log sending polish and dutch election emails via terbium [01:55:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:10:38] !log sending pt,pt-br and ru election emails via terbium [02:10:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:17:35] PROBLEM - HP RAID on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [02:18:29] !log sending uk and vi election emails via terbium [02:18:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:21:37] !log sending Chinese election emails via terbium [02:21:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:35:39] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.21) (duration: 13m 22s) [02:35:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:41:40] !log Sending English and all other language election emails via terbium [02:41:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:46:25] !log all election emails out [02:46:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:01:10] 06Operations, 10MediaWiki-ResourceLoader, 10MediaWiki-extensions-CentralNotice, 06Performance-Team, and 2 others: Provide location, logged-in status and device information in ResourceLoaderContext - https://phabricator.wikimedia.org/T103695#3254003 (10Krinkle) 05Open>03declined Declining as I I don't t... [03:08:07] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.1) (duration: 13m 33s) [03:08:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:14:51] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu May 11 03:14:51 UTC 2017 (duration 6m 44s) [03:14:55] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [03:14:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:17:35] PROBLEM - HP RAID on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [03:17:55] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Core: cr1-codfw:xe-5/0/2 (Zayo, OGYX/124337//ZYO, 38.8ms) {#?} [10Gbps wave]BR [04:07:35] PROBLEM - HP RAID on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [04:37:35] PROBLEM - HP RAID on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [04:37:55] PROBLEM - HP RAID on ms-be1021 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [04:47:25] RECOVERY - HP RAID on ms-be1021 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller [05:13:25] 06Operations, 10netops: Zayo Circuit ulsfo<->codfw down - https://phabricator.wikimedia.org/T165006#3254023 (10ayounsi) [05:14:03] ACKNOWLEDGEMENT - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Core: cr1-codfw:xe-5/0/2 (Zayo, OGYX/124337//ZYO, 38.8ms) {#?} [10Gbps wave]BR Ayounsi https://phabricator.wikimedia.org/T165006 [05:32:22] (03PS2) 10Kaldari: Enable LoginNotify on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351195 (https://phabricator.wikimedia.org/T165007) (owner: 10Niharika29) [05:34:26] 06Operations, 10ops-codfw: Degraded RAID on elastic2020 - https://phabricator.wikimedia.org/T164953#3254060 (10MoritzMuehlenhoff) 05Open>03Invalid Ok, marked as Invalid, since it's a duplicate,then. [05:47:48] 06Operations, 06DC-Ops, 10netops: Interface errors on asw-c-eqiad:xe-8/0/38 - https://phabricator.wikimedia.org/T165008#3254070 (10ayounsi) [05:48:15] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=633.50 Read Requests/Sec=724.60 Write Requests/Sec=0.20 KBytes Read/Sec=46438.00 KBytes_Written/Sec=7.20 [05:52:32] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353216 [05:52:35] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353216 [05:54:24] 06Operations, 10netops: JSNMP flood of errors across multiple switches - https://phabricator.wikimedia.org/T83898#3254097 (10ayounsi) Not sure yet if related, but LibreNMS doesn't poll all the interfaces from at least asw-c-eqiad. For example, xe-8/0/38 is missing. [05:54:40] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353216 (owner: 10Marostegui) [05:55:42] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353216 (owner: 10Marostegui) [05:55:51] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353216 (owner: 10Marostegui) [05:55:55] PROBLEM - HP RAID on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [05:56:54] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1067 - T147166 T130067 (duration: 00m 57s) [05:57:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:57:07] T147166: Apply change_tag and tag_summary primary key schema change to Wikimedia wikis - https://phabricator.wikimedia.org/T147166 [05:57:07] T130067: Add wl_id to watchlist tables on production dbs - https://phabricator.wikimedia.org/T130067 [05:57:15] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=14.70 Read Requests/Sec=1.20 Write Requests/Sec=0.60 KBytes Read/Sec=26.80 KBytes_Written/Sec=6.00 [06:05:12] 06Operations, 10Deployment-Systems, 13Patch-For-Review, 10Scap (Scap3-MediaWiki-MVP), 15User-Joe: Install conftool on deployment masters - https://phabricator.wikimedia.org/T163565#3254105 (10Joe) >>! In T163565#3214272, @mmodell wrote: > @joe: That all seems reasonable. I don't particularly want to dupl... [06:06:05] (03PS1) 10Marostegui: db-eqiad.php: Repool db1056 with less load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353218 [06:08:56] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1056 with less load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353218 (owner: 10Marostegui) [06:09:50] (03PS3) 10Giuseppe Lavagetto: role::deployment::mediawiki: include ::profile::conftool::client [puppet] - 10https://gerrit.wikimedia.org/r/349498 (https://phabricator.wikimedia.org/T163565) (owner: 1020after4) [06:09:54] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1056 with less load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353218 (owner: 10Marostegui) [06:10:06] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1056 with less load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353218 (owner: 10Marostegui) [06:11:05] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1056 with less load (duration: 00m 43s) [06:11:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:30] (03CR) 10Giuseppe Lavagetto: [C: 032] role::deployment::mediawiki: include ::profile::conftool::client [puppet] - 10https://gerrit.wikimedia.org/r/349498 (https://phabricator.wikimedia.org/T163565) (owner: 1020after4) [06:15:55] PROBLEM - HP RAID on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [06:17:15] PROBLEM - HP RAID on ms-be1035 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [06:17:35] PROBLEM - HP RAID on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [06:19:36] <_joe_> uh [06:20:37] <_joe_> seems like a load surge [06:20:49] (03PS1) 10Marostegui: wikitech.pp: Remove mira [puppet] - 10https://gerrit.wikimedia.org/r/353224 (https://phabricator.wikimedia.org/T164968) [06:21:15] <_joe_> load average: 54.90 on ms-be1019 [06:25:18] 06Operations, 10Scap (Scap3-MediaWiki-MVP): Depool proxies temporarily while scap is ongoing to avoid taxing those nodes - https://phabricator.wikimedia.org/T125629#3254119 (10Joe) [06:25:20] 06Operations, 10Deployment-Systems, 13Patch-For-Review, 10Scap (Scap3-MediaWiki-MVP), 15User-Joe: Install conftool on deployment masters - https://phabricator.wikimedia.org/T163565#3254117 (10Joe) 05Open>03Resolved a:03Joe [06:25:22] 06Operations, 06Performance-Team, 07HHVM, 10Scap (Scap3-MediaWiki-MVP), 03releng-201617-q4: Make scap able to depool/repool servers via the conftool API - https://phabricator.wikimedia.org/T104352#3254120 (10Joe) [06:26:55] PROBLEM - nova-compute process on labvirt1013 is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [06:27:05] RECOVERY - HP RAID on ms-be1035 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [06:27:55] RECOVERY - nova-compute process on labvirt1013 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [06:30:48] !log Drop mira user on wikitech database - T164968 [06:30:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:57] T164968: remove mira wikitech grants - https://phabricator.wikimedia.org/T164968 [06:31:05] PROBLEM - swift-object-server on ms-be2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:31:05] PROBLEM - salt-minion processes on ms-be2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:31:15] PROBLEM - swift-container-server on ms-be2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:31:15] PROBLEM - swift-account-replicator on ms-be2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:31:15] PROBLEM - swift-account-auditor on ms-be2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:31:25] PROBLEM - swift-container-updater on ms-be2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:31:35] PROBLEM - swift-object-updater on ms-be2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:31:35] PROBLEM - swift-account-server on ms-be2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:31:35] PROBLEM - swift-object-auditor on ms-be2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:31:35] PROBLEM - swift-account-reaper on ms-be2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:31:45] PROBLEM - swift-object-replicator on ms-be2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:31:45] PROBLEM - dhclient process on ms-be2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:31:55] RECOVERY - swift-object-server on ms-be2004 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [06:31:55] RECOVERY - salt-minion processes on ms-be2004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [06:32:05] RECOVERY - swift-container-server on ms-be2004 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [06:32:05] RECOVERY - swift-account-replicator on ms-be2004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [06:32:06] RECOVERY - swift-account-auditor on ms-be2004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [06:32:15] RECOVERY - swift-container-updater on ms-be2004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [06:32:25] RECOVERY - swift-object-updater on ms-be2004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [06:32:25] RECOVERY - swift-account-server on ms-be2004 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [06:32:25] RECOVERY - swift-object-auditor on ms-be2004 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [06:32:25] RECOVERY - swift-account-reaper on ms-be2004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [06:32:35] RECOVERY - swift-object-replicator on ms-be2004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [06:32:35] RECOVERY - dhclient process on ms-be2004 is OK: PROCS OK: 0 processes with command name dhclient [06:34:02] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: decom mira - https://phabricator.wikimedia.org/T164588#3254131 (10Marostegui) [06:34:04] 06Operations, 10DBA, 13Patch-For-Review: remove mira wikitech grants - https://phabricator.wikimedia.org/T164968#3254128 (10Marostegui) 05Open>03Resolved a:03Marostegui I have dropped the mira user on silver (I have saved this info just in case we need to recreate it because something else has broken,... [06:37:16] 06Operations, 10ops-eqiad, 10DBA: Decommission db1024 - https://phabricator.wikimedia.org/T164702#3254133 (10Marostegui) Hello @Cmjohnson From the DBA side you can proceed whenever you like. We do not have to do anything else I believe. MySQL is down It has been added to spare role on site.pp Disabled and... [06:37:24] 06Operations, 10ops-eqiad, 10DBA: Decommission db1024 - https://phabricator.wikimedia.org/T164702#3254134 (10Marostegui) [06:37:36] 06Operations, 10ops-eqiad, 10DBA: Decommission db1024 - https://phabricator.wikimedia.org/T164702#3242621 (10Marostegui) [06:41:37] (03PS1) 10Tim Starling: For HHVM set LANG=C.UTF-8 [puppet] - 10https://gerrit.wikimedia.org/r/353228 (https://phabricator.wikimedia.org/T107128) [06:42:36] (03PS1) 10Giuseppe Lavagetto: profile::etcd::tlsproxy: allow read-only mode [puppet] - 10https://gerrit.wikimedia.org/r/353231 (https://phabricator.wikimedia.org/T159687) [06:42:38] (03PS1) 10Giuseppe Lavagetto: etcd: invert replication [puppet] - 10https://gerrit.wikimedia.org/r/353232 [06:49:25] PROBLEM - MariaDB Slave Lag: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:50:32] !log migrating mw1293 (image scaler) to HHVM 3.18 and Linux 4.9 [06:50:35] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:50:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:45] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:50:45] PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:50:45] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:50:45] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:50:45] PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:50:45] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:51:25] RECOVERY - MariaDB Slave IO: s7 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [06:51:35] RECOVERY - MariaDB Slave IO: m2 on dbstore1001 is OK: OK slave_io_state not a slave [06:51:35] RECOVERY - MariaDB Slave SQL: m2 on dbstore1001 is OK: OK slave_sql_state not a slave [06:51:35] RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [06:51:35] RECOVERY - MariaDB Slave IO: x1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [06:51:35] RECOVERY - MariaDB Slave SQL: s1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [06:51:35] RECOVERY - MariaDB Slave SQL: s2 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [06:54:36] 06Operations, 10Deployment-Systems, 13Patch-For-Review, 10Scap (Scap3-MediaWiki-MVP), 15User-Joe: Install conftool on deployment masters - https://phabricator.wikimedia.org/T163565#3254140 (10mmodell) Thanks @joe! [07:07:35] PROBLEM - HP RAID on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [07:13:05] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [07:16:15] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [07:17:05] https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes?var-site=All&var-cache_type=text&var-status_type=5&panelId=3&fullscreen&orgId=1 [07:17:24] two big spikes [07:19:15] (not super big but say relevant enough :) [07:19:15] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:20:05] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:23:51] ema: I see a lot of cp3040 ints from webrequests, worth to check whenever you have time [07:26:15] the second spike of 503s seems to match https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=29&fullscreen&orgId=1&from=now-3h&to=now&var-server=cp3040&var-datasource=esams%20prometheus%2Fops [07:43:37] 06Operations, 07HHVM: Nutcracker doesn't start at boot - https://phabricator.wikimedia.org/T163795#3254215 (10MoritzMuehlenhoff) That's somewhat expected, at this point nutcracker is not enabled for automatic service startup: ``` jmm@mw1293:~$ sudo systemctl is-enabled nutcracker disabled ``` And the puppet... [07:45:13] (03CR) 10Filippo Giunchedi: [C: 04-1] "LGTM, just a couple of comments" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/353064 (https://phabricator.wikimedia.org/T151971) (owner: 10Ayounsi) [07:47:35] PROBLEM - HP RAID on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [07:49:41] (03PS1) 10Filippo Giunchedi: Stop prerendering thumbs at 2560/2880 pixels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353244 (https://phabricator.wikimedia.org/T162796) [07:52:08] (03PS1) 10Marostegui: db-eqiad.php: Increase load db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353245 [07:53:04] !log roll-restart ms-fe1* for linux 4.9 upgrade - T162029 [07:53:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:13] T162029: Migrate all jessie hosts to Linux 4.9 - https://phabricator.wikimedia.org/T162029 [07:53:55] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase load db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353245 (owner: 10Marostegui) [07:54:55] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [07:54:56] 06Operations, 10netops: Zayo Circuit ulsfo<->codfw down - https://phabricator.wikimedia.org/T165006#3254242 (10ayounsi) > We are expecting to have a tech onsite in El Paso around 1:45 AM MST to swap out an optic. Will provide another update once optic has been replaced. [07:55:02] (03Merged) 10jenkins-bot: db-eqiad.php: Increase load db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353245 (owner: 10Marostegui) [07:55:22] (03CR) 10Giuseppe Lavagetto: [C: 031] Stop prerendering thumbs at 2560/2880 pixels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353244 (https://phabricator.wikimedia.org/T162796) (owner: 10Filippo Giunchedi) [07:55:47] (03CR) 10jenkins-bot: db-eqiad.php: Increase load db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353245 (owner: 10Marostegui) [07:56:45] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/2: down - Core: cr2-ulsfo:xe-1/3/0 (Zayo, OGYX/124337//ZYO, 38.8ms) {#11541} [10Gbps wave]BR [07:57:45] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [07:58:22] 06Operations, 07HHVM: Nutcracker doesn't start at boot - https://phabricator.wikimedia.org/T163795#3254246 (10MoritzMuehlenhoff) Also, since HHVM in our current config unconditionally tries to connect to unix::/var/run/nutcracker/redis_$DC.sock would could also add a "After=nutcracker.service" to HHMV's unit. [07:59:55] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Core: cr1-codfw:xe-5/0/2 (Zayo, OGYX/124337//ZYO, 38.8ms) {#?} [10Gbps wave]BR [08:02:16] (03PS4) 10Ayounsi: Add new logstash LVS service [puppet] - 10https://gerrit.wikimedia.org/r/353064 (https://phabricator.wikimedia.org/T151971) [08:11:45] (03CR) 10Ayounsi: "Replying to comments" (0321 comments) [puppet] - 10https://gerrit.wikimedia.org/r/353064 (https://phabricator.wikimedia.org/T151971) (owner: 10Ayounsi) [08:12:38] (03CR) 10Ayounsi: "Replies to comments" (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/353088 (https://phabricator.wikimedia.org/T164911) (owner: 10Ayounsi) [08:18:45] (03CR) 10Gilles: [C: 031] Stop prerendering thumbs at 2560/2880 pixels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353244 (https://phabricator.wikimedia.org/T162796) (owner: 10Filippo Giunchedi) [08:19:08] (03PS1) 10Elukey: Re-enable persistent connection to Redis for jobrunners in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353247 (https://phabricator.wikimedia.org/T125735) [08:23:58] (03CR) 10Filippo Giunchedi: [C: 04-1] Add new logstash LVS service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/353064 (https://phabricator.wikimedia.org/T151971) (owner: 10Ayounsi) [08:24:20] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM! Good job" [puppet] - 10https://gerrit.wikimedia.org/r/353064 (https://phabricator.wikimedia.org/T151971) (owner: 10Ayounsi) [08:26:00] !log migrating mw1189 (API server) to HHVM 3.18 and Linux 4.9 [08:26:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:30] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase db1056 load (duration: 00m 42s) [08:32:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:36] (03PS1) 10Volans: LVS: move pybal config to a separate class [puppet] - 10https://gerrit.wikimedia.org/r/353250 (https://phabricator.wikimedia.org/T163196) [08:35:52] !log Run pt-table-checksum on s7.kowiki - https://phabricator.wikimedia.org/T163190 [08:36:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:10] (03PS5) 10Ayounsi: Add new logstash LVS service [puppet] - 10https://gerrit.wikimedia.org/r/353064 (https://phabricator.wikimedia.org/T151971) [08:37:15] PROBLEM - HP RAID on ms-be1035 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [08:38:40] (03PS1) 10Marostegui: db-eqiad.php: Increase db1056 load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353252 [08:39:20] (03CR) 10Ayounsi: [C: 032] Add new logstash LVS service [puppet] - 10https://gerrit.wikimedia.org/r/353064 (https://phabricator.wikimedia.org/T151971) (owner: 10Ayounsi) [08:41:58] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase db1056 load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353252 (owner: 10Marostegui) [08:43:30] (03PS2) 10Gehel: Logstash match_mapping_type still uses string, not text [puppet] - 10https://gerrit.wikimedia.org/r/353150 (https://phabricator.wikimedia.org/T164823) (owner: 10EBernhardson) [08:46:19] (03Merged) 10jenkins-bot: db-eqiad.php: Increase db1056 load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353252 (owner: 10Marostegui) [08:46:28] (03CR) 10jenkins-bot: db-eqiad.php: Increase db1056 load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353252 (owner: 10Marostegui) [08:46:34] (03CR) 10Volans: "Here my quick proposal to cleanup and fix the deployment-prep issue." [puppet] - 10https://gerrit.wikimedia.org/r/353250 (https://phabricator.wikimedia.org/T163196) (owner: 10Volans) [08:47:18] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase db1056 load (duration: 00m 43s) [08:47:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:41] (03CR) 10Hashar: [C: 031] "Yes lets do it!!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353247 (https://phabricator.wikimedia.org/T125735) (owner: 10Elukey) [08:53:20] (03Abandoned) 10Alexandros Kosiaris: puppetmaster: /var/lib/puppet/ssl should be group puppet [puppet] - 10https://gerrit.wikimedia.org/r/248302 (owner: 10Alexandros Kosiaris) [08:55:02] !log migrating mw1161 (job runner) to HHVM 3.18 and Linux 4.9 [08:55:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:30] (03CR) 10Alexandros Kosiaris: [C: 032] ganeti: move 'include standard' to role [puppet] - 10https://gerrit.wikimedia.org/r/353125 (owner: 10Dzahn) [08:55:36] (03PS2) 10Alexandros Kosiaris: ganeti: move 'include standard' to role [puppet] - 10https://gerrit.wikimedia.org/r/353125 (owner: 10Dzahn) [08:55:41] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] ganeti: move 'include standard' to role [puppet] - 10https://gerrit.wikimedia.org/r/353125 (owner: 10Dzahn) [08:56:49] (03PS3) 10Alexandros Kosiaris: apertium: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352996 (owner: 10Dzahn) [08:56:55] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] apertium: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352996 (owner: 10Dzahn) [08:57:06] RECOVERY - HP RAID on ms-be1035 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [08:57:22] (03PS2) 10Alexandros Kosiaris: poolcounter: move 'include standard' to role [puppet] - 10https://gerrit.wikimedia.org/r/353121 (owner: 10Dzahn) [08:57:30] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] poolcounter: move 'include standard' to role [puppet] - 10https://gerrit.wikimedia.org/r/353121 (owner: 10Dzahn) [08:58:00] 06Operations, 07HHVM: Nutcracker doesn't start at boot - https://phabricator.wikimedia.org/T163795#3254415 (10MoritzMuehlenhoff) p:05Triage>03High [08:58:16] (03PS2) 10Alexandros Kosiaris: puppetmaster::backend: move 'include standard' to role [puppet] - 10https://gerrit.wikimedia.org/r/353122 (owner: 10Dzahn) [08:58:22] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] puppetmaster::backend: move 'include standard' to role [puppet] - 10https://gerrit.wikimedia.org/r/353122 (owner: 10Dzahn) [08:59:16] (03PS2) 10Alexandros Kosiaris: parsoid: move 'include standard' to role [puppet] - 10https://gerrit.wikimedia.org/r/353118 (owner: 10Dzahn) [08:59:23] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] parsoid: move 'include standard' to role [puppet] - 10https://gerrit.wikimedia.org/r/353118 (owner: 10Dzahn) [09:00:24] (03PS2) 10Alexandros Kosiaris: thumbor: move 'include standard' to role [puppet] - 10https://gerrit.wikimedia.org/r/353120 (owner: 10Dzahn) [09:00:30] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] thumbor: move 'include standard' to role [puppet] - 10https://gerrit.wikimedia.org/r/353120 (owner: 10Dzahn) [09:01:55] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [09:03:45] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/2: down - Core: cr2-ulsfo:xe-1/3/0 (Zayo, OGYX/124337//ZYO, 38.8ms) {#11541} [10Gbps wave]BR [09:03:47] ACKNOWLEDGEMENT - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/2: down - Core: cr2-ulsfo:xe-1/3/0 (Zayo, OGYX/124337//ZYO, 38.8ms) {#11541} [10Gbps wave]BR Ayounsi https://phabricator.wikimedia.org/T165006 [09:03:52] (03PS2) 10DCausse: [cirrus] Blacklist wikinews, wikiversity and multimedia from cross project search on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353043 (https://phabricator.wikimedia.org/T163463) [09:07:34] PROBLEM - HP RAID on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [09:10:23] !log upgrading mw1170-mw1188 to HHVM 3.18 / Linux 4.9 (also pruning HHVM CLI bytecode since downtimed anyway) [09:10:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:18] dcausse gehel FYI https://gerrit.wikimedia.org/r/#/c/353064/ (logstash.svc.eqiad.wmnet) is being deployed, courtesy of XioNoX ! I forgot to add you to the code review heh [09:11:39] godog: thanks! I saw it fly by... [09:12:01] and thanks XioNoX for moving that forward! This is great news! [09:12:05] !log ayounsi@puppetmaster1001 conftool action : set/pooled=yes; selector: name=logstash1001.eqiad.wmnet [09:12:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:14] RECOVERY - MariaDB Slave Lag: s1 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 89956.13 seconds [09:12:24] (03PS3) 10Gehel: Logstash match_mapping_type still uses string, not text [puppet] - 10https://gerrit.wikimedia.org/r/353150 (https://phabricator.wikimedia.org/T164823) (owner: 10EBernhardson) [09:12:29] godog:, XioNoX : thanks! [09:12:46] !log ayounsi@puppetmaster1001 conftool action : set/pooled=yes; selector: name=logstash1002.eqiad.wmnet [09:12:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:56] !log ayounsi@puppetmaster1001 conftool action : set/pooled=yes; selector: name=logstash1003.eqiad.wmnet [09:13:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:13] and now it is time to go through all those logging configurations and start using this new endpoint! [09:14:17] probably need some testing first :) [09:14:36] XioNoX: I trust you :) [09:15:46] there's some weirdness with icinga ATM logstash.svc.eqiad.wmnet [09:15:53] TCP CRITICAL - Invalid hostname, address or socket: logstash.svc.codfw.wmnet [09:16:03] but I've seen this before, checking what's up [09:17:04] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Core: cr1-codfw:xe-5/0/2 (Zayo, OGYX/124337//ZYO, 38.8ms) {#?} [10Gbps wave]BR [09:17:54] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [09:18:27] fyi, this is Zayo working on a circuit: https://phabricator.wikimedia.org/T165006 [09:20:04] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [09:23:40] godog: logstash.svc.codfw.wmnet does not resolve and I don't find it in the DNS repo [09:23:43] was it added? [09:24:04] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Core: cr1-codfw:xe-5/0/2 (Zayo, OGYX/124337//ZYO, 38.8ms) {#?} [10Gbps wave]BR [09:24:43] volans: no, logstash isn't in codfw [09:25:09] my best guess ATM is that using %{::site} in the check_command in this case doesn't do the right thing, because tegmen is in codfw [09:25:20] eheheh [09:25:38] so yeah :( [09:25:46] XioNoX: ^ [09:25:51] I don't have the code at hand but sounds reasonable [09:26:25] so what would be the best fix for that? [09:26:54] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - logstash-syslog_10514_udp - Could not depool server logstash1001.eqiad.wmnet because of too many down! [09:27:16] very good question, the short term fix IMO is to change ::site with eqiad [09:27:21] (03CR) 10Gehel: [C: 032] Logstash match_mapping_type still uses string, not text [puppet] - 10https://gerrit.wikimedia.org/r/353150 (https://phabricator.wikimedia.org/T164823) (owner: 10EBernhardson) [09:28:21] long term I'm not sure, there's probably the same bug elsewhere where ${::site} is used [09:30:53] ocg seems eqiad only too and has the same $sip['ocg'][$::site] [09:30:56] ah and 10514/tcp isn't open by ferm, so that explains the healthcheck above [09:32:28] yeah, there are plenty of eqiad only checks, why are those not alerting? [09:32:34] volans: heh but not in check_command in hieradata/common/lvs/configuration.yaml [09:33:13] right! [09:33:53] godog: although kibana has it with $::site :D [09:34:01] yeah I was going to say [09:34:06] and other services as well [09:34:24] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] [09:36:19] volans: yeah because the check there is interpreted differently, it is the Host: header [09:36:37] so basically depends on the check_command that's being used [09:37:39] and it's options, right [09:39:08] (03PS1) 10Ayounsi: Workaround for puppet/icinga issue [puppet] - 10https://gerrit.wikimedia.org/r/353259 (https://phabricator.wikimedia.org/T151971) [09:39:29] godog, volans ^ [09:39:54] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] [09:40:04] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [09:41:21] moritzm: ^ related to your upgrade ? cc elukey [09:41:25] XioNoX: thanks, I'll take a look [09:41:30] gneee checking [09:41:33] is cognate throwing all the errors? [09:41:54] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] [09:41:59] (https://logstash.wikimedia.org/app/kibana#/dashboard/memcached) [09:42:04] Wikimedia\Rdbms\LoadBalancer::reuseConnection: got DBConnRef instance. [09:42:04] godog: very unlikely, these are all depooled properly [09:42:38] top hosts [09:43:12] (paste nightmare, will type) [09:43:16] mw1161 [09:43:20] mw1174 [09:43:24] mw1171 [09:43:29] mw1173 [09:43:34] mw1170 [09:43:39] godog: for PYBAL CRITICAL - logstash-syslog_10514_udp - should we open the port in ferm, or remove the check? [09:43:40] but two big spikes [09:44:04] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Core: cr1-codfw:xe-5/0/2 (Zayo, OGYX/124337//ZYO, 38.8ms) {#?} [10Gbps wave]BR [09:44:10] indeed, seems have gone back down [09:44:12] mw1170-mw1174 were in fact rebooted [09:44:17] XioNoX: the former I'd say [09:44:46] moritzm: zoomed only on last spike, related only to mw1170-mw1174 [09:45:10] Memcached error for key "{memcached-key}" on server "{memcached-server}": SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY [09:45:41] I guess until nutcracker came back up? [09:45:54] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/2: down - Core: cr2-ulsfo:xe-1/3/0 (Zayo, OGYX/124337//ZYO, 38.8ms) {#11541} [10Gbps wave]BR [09:45:57] this is a good point, but only if the servers were not depooled [09:46:35] !log cp4010: downgrade varnish to 4.1.5-1wm4 and check frontend transient memory usage [09:46:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:57] they were all depooled (and in fact still are) [09:48:19] ah maybe health checks! [09:48:54] PROBLEM - puppet last run on tegmen is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:49:07] (03PS1) 10Ayounsi: Ferm to allow tcp/10514 on logstash nodes [puppet] - 10https://gerrit.wikimedia.org/r/353260 (https://phabricator.wikimedia.org/T151971) [09:49:39] godog, gehel ^ [09:49:43] godog, moritzm: on mw1174 I can see May 11 09:37:46 mw1174 systemd[1]: Started Nutcracker memcached/redis proxy, and from logstash the errors stopped right here [09:49:54] (for that host) [09:49:56] could be yeah, that and T163795 [09:49:56] T163795: Nutcracker doesn't start at boot - https://phabricator.wikimedia.org/T163795 [09:51:02] ok all good, health checks not happy and nutcracker being lay [09:51:04] *lazy [09:51:21] moritzm: all good! [09:51:25] (03CR) 10Filippo Giunchedi: [C: 031] Ferm to allow tcp/10514 on logstash nodes [puppet] - 10https://gerrit.wikimedia.org/r/353260 (https://phabricator.wikimedia.org/T151971) (owner: 10Ayounsi) [09:51:27] (03PS1) 10Muehlenhoff: Remove mira from tcpircbot config [puppet] - 10https://gerrit.wikimedia.org/r/353261 (https://phabricator.wikimedia.org/T164588) [09:52:42] elukey: not really good, though :-) that makes the HHVM upgrades needlessly noisy if only five reboots trigger a memcache alert [09:52:57] I've added some proposed fixes to T163795 earlier on, comments welcome [09:53:45] (03CR) 10Muehlenhoff: [C: 031] Ferm to allow tcp/10514 on logstash nodes [puppet] - 10https://gerrit.wikimedia.org/r/353260 (https://phabricator.wikimedia.org/T151971) (owner: 10Ayounsi) [09:55:12] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Consider using check_tcp instead of check_tcp_ip which does not have that problem" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/353259 (https://phabricator.wikimedia.org/T151971) (owner: 10Ayounsi) [09:55:54] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [09:56:07] godog: ack [09:56:08] (03PS1) 10Marostegui: db-eqiad.php: Restore db1056 original load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353262 [09:56:20] indeed what akosiaris said makes more sense [09:59:13] ebernhardson / gehel also thanks for fixing T164823 ! [09:59:14] T164823: Empty kibana dashboards after logstash upgrade - https://phabricator.wikimedia.org/T164823 [09:59:33] elukey: ^ [09:59:34] godog: as always, all credit goes to erik! [09:59:47] (03CR) 10Alexandros Kosiaris: [C: 031] Remove mira from tcpircbot config [puppet] - 10https://gerrit.wikimedia.org/r/353261 (https://phabricator.wikimedia.org/T164588) (owner: 10Muehlenhoff) [10:01:26] (03CR) 10Alexandros Kosiaris: [C: 031] Ferm to allow tcp/10514 on logstash nodes [puppet] - 10https://gerrit.wikimedia.org/r/353260 (https://phabricator.wikimedia.org/T151971) (owner: 10Ayounsi) [10:02:04] 06Operations, 10ops-eqiad: mw1172 failed to reboot - https://phabricator.wikimedia.org/T165023#3254529 (10MoritzMuehlenhoff) [10:02:53] ACKNOWLEDGEMENT - Host mw1172 is DOWN: PING CRITICAL - Packet loss = 100% Muehlenhoff T165023 [10:03:32] 06Operations, 05Goal, 15User-Joe, 07kubernetes: Upgrade calico to 2.1, document build process. - https://phabricator.wikimedia.org/T165024#3254545 (10Joe) [10:04:37] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Restore db1056 original load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353262 (owner: 10Marostegui) [10:06:14] (03Merged) 10jenkins-bot: db-eqiad.php: Restore db1056 original load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353262 (owner: 10Marostegui) [10:06:22] (03CR) 10jenkins-bot: db-eqiad.php: Restore db1056 original load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353262 (owner: 10Marostegui) [10:06:34] (03CR) 10Ayounsi: [C: 032] Ferm to allow tcp/10514 on logstash nodes [puppet] - 10https://gerrit.wikimedia.org/r/353260 (https://phabricator.wikimedia.org/T151971) (owner: 10Ayounsi) [10:07:01] (03PS2) 10Muehlenhoff: Remove mira from tcpircbot config [puppet] - 10https://gerrit.wikimedia.org/r/353261 (https://phabricator.wikimedia.org/T164588) [10:07:21] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Restore db1056 original load (duration: 00m 49s) [10:07:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:56] back shortly, reboot and run new power cable [10:09:54] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/2: down - Core: cr2-ulsfo:xe-1/3/0 (Zayo, OGYX/124337//ZYO, 38.8ms) {#11541} [10Gbps wave]BR [10:10:14] (03CR) 10Muehlenhoff: [C: 032] Remove mira from tcpircbot config [puppet] - 10https://gerrit.wikimedia.org/r/353261 (https://phabricator.wikimedia.org/T164588) (owner: 10Muehlenhoff) [10:10:29] Run pt-table-checksum on s7.rowiki - https://phabricator.wikimedia.org/T163190 [10:10:35] !log Run pt-table-checksum on s7.rowiki - https://phabricator.wikimedia.org/T163190 [10:10:40] (03PS2) 10Ayounsi: Workaround for puppet/icinga issue [puppet] - 10https://gerrit.wikimedia.org/r/353259 (https://phabricator.wikimedia.org/T151971) [10:10:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:47] XioNoX: can I puppet-merge your change along? [10:11:09] moritzm: yep! I was about to do it, thanks [10:11:14] ok :-) [10:11:55] (03CR) 10Ayounsi: Workaround for puppet/icinga issue (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/353259 (https://phabricator.wikimedia.org/T151971) (owner: 10Ayounsi) [10:13:24] RECOVERY - Check whether ferm is active by checking the default input chain on tegmen is OK: OK ferm input default policy is set [10:13:24] RECOVERY - Check systemd state on tegmen is OK: OK - running: The system is fully operational [10:13:54] RECOVERY - puppet last run on tegmen is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [10:14:54] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [10:15:14] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [10:15:52] !log reboot ganeti200{5,6,7,8} for network reconfiguration [10:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:04] PROBLEM - Host ganeti2006 is DOWN: PING CRITICAL - Packet loss = 100% [10:17:24] PROBLEM - Host ganeti2008 is DOWN: PING CRITICAL - Packet loss = 100% [10:17:25] PROBLEM - Host ganeti2005 is DOWN: PING CRITICAL - Packet loss = 100% [10:17:34] PROBLEM - Host ganeti2007 is DOWN: PING CRITICAL - Packet loss = 100% [10:18:40] (03CR) 10Alexandros Kosiaris: [C: 031] profile::etcd::tlsproxy: allow read-only mode [puppet] - 10https://gerrit.wikimedia.org/r/353231 (https://phabricator.wikimedia.org/T159687) (owner: 10Giuseppe Lavagetto) [10:19:44] PROBLEM - configured eth on ms-be1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:20:34] RECOVERY - configured eth on ms-be1021 is OK: OK - interfaces up [10:20:56] (03CR) 10Alexandros Kosiaris: [C: 031] Workaround for puppet/icinga issue [puppet] - 10https://gerrit.wikimedia.org/r/353259 (https://phabricator.wikimedia.org/T151971) (owner: 10Ayounsi) [10:23:37] (03CR) 10Ayounsi: [C: 032] Workaround for puppet/icinga issue [puppet] - 10https://gerrit.wikimedia.org/r/353259 (https://phabricator.wikimedia.org/T151971) (owner: 10Ayounsi) [10:23:44] (03PS3) 10Ayounsi: Workaround for puppet/icinga issue [puppet] - 10https://gerrit.wikimedia.org/r/353259 (https://phabricator.wikimedia.org/T151971) [10:25:54] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [10:29:22] (03PS1) 10Muehlenhoff: Remove mira from role::mariadb::wikitech ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/353264 (https://phabricator.wikimedia.org/T164588) [10:30:24] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [10:35:07] 06Operations, 10netops: Zayo Circuit ulsfo<->codfw down - https://phabricator.wikimedia.org/T165006#3254588 (10ayounsi) 05Open>03Resolved >Our equipment vendor performed cold restart on a card in El Paso TX which has restored your service. I am currently seeing two traffic passing on the circuit. If you ar... [10:37:10] 06Operations, 10fundraising-tech-ops, 10netops: BGP session between pfw clusters flapping - https://phabricator.wikimedia.org/T164777#3254591 (10ayounsi) Nothing explicit in the logs. I've open case 2017-0511-0002 with JTAC [10:37:34] (03PS2) 10Muehlenhoff: Remove mira from role::mariadb::wikitech ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/353264 (https://phabricator.wikimedia.org/T164588) [10:40:07] (03CR) 10Muehlenhoff: [C: 032] Remove mira from role::mariadb::wikitech ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/353264 (https://phabricator.wikimedia.org/T164588) (owner: 10Muehlenhoff) [10:43:35] RECOVERY - Host ganeti2006 is UP: PING OK - Packet loss = 0%, RTA = 1.09 ms [10:43:45] PROBLEM - ganeti-noded running on ganeti2006 is CRITICAL: Return code of 255 is out of bounds [10:44:15] PROBLEM - SSH on ganeti2006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:44:15] PROBLEM - puppet last run on ganeti2006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:44:15] PROBLEM - salt-minion processes on ganeti2006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:45:45] PROBLEM - Check the NTP synchronisation status of timesyncd on ganeti2006 is CRITICAL: Return code of 255 is out of bounds [10:45:45] RECOVERY - Host ganeti2007 is UP: PING OK - Packet loss = 0%, RTA = 0.91 ms [10:45:55] RECOVERY - Host ganeti2008 is UP: PING OK - Packet loss = 0%, RTA = 1.13 ms [10:46:05] RECOVERY - SSH on ganeti2006 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [10:46:05] RECOVERY - Host ganeti2005 is UP: PING OK - Packet loss = 0%, RTA = 1.91 ms [10:46:05] RECOVERY - salt-minion processes on ganeti2006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [10:46:29] akosiaris: 30m, so quick to reboot? :-P [10:46:45] RECOVERY - ganeti-noded running on ganeti2006 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-noded [10:46:51] volans: no, I actually stumbled across a limitation in /e/n/i [10:47:01] an interface can not have dashes in its name [10:47:05] RECOVERY - puppet last run on ganeti2006 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [10:47:16] well in /e/n/i .. probably the kernel [10:47:33] anyway.. fixed that and now ganeti complains that br0 is a missing bridge [10:48:03] which I am not sure why .. I was under the impression this check is per nodegroup [10:48:15] and indeed no node in that group needs or haves br0 [10:48:28] but nodes in the other group do.. anyway will look into it more after lunch [10:49:29] ok :) [10:58:39] (03PS1) 10Muehlenhoff: role::mariadb::wikitech: Switch to ferm constants [puppet] - 10https://gerrit.wikimedia.org/r/353266 [10:58:44] (03PS1) 10Marostegui: db-codfw.php: Depool db2064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353267 (https://phabricator.wikimedia.org/T162611) [11:00:40] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353267 (https://phabricator.wikimedia.org/T162611) (owner: 10Marostegui) [11:01:55] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353267 (https://phabricator.wikimedia.org/T162611) (owner: 10Marostegui) [11:02:07] (03CR) 10jenkins-bot: db-codfw.php: Depool db2064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353267 (https://phabricator.wikimedia.org/T162611) (owner: 10Marostegui) [11:03:10] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2064 - T162611 (duration: 00m 42s) [11:03:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:19] T162611: Unify revision table on s2 - https://phabricator.wikimedia.org/T162611 [11:03:26] !log Deploy alter table on s2 (revision table) db2064 - T162611 [11:03:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:36] PROBLEM - nova-compute process on labvirt1009 is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [11:04:35] RECOVERY - nova-compute process on labvirt1009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [11:08:17] 06Operations, 05Goal, 15User-Joe, 07kubernetes: Upgrade calico to 2.1, document build process. - https://phabricator.wikimedia.org/T165024#3254648 (10Joe) [11:10:50] (03CR) 10Marostegui: [C: 031] "You have actually also fixed: https://gerrit.wikimedia.org/r/#/c/353224/ :-)" [puppet] - 10https://gerrit.wikimedia.org/r/353266 (owner: 10Muehlenhoff) [11:11:31] (03Abandoned) 10Marostegui: wikitech.pp: Remove mira [puppet] - 10https://gerrit.wikimedia.org/r/353224 (https://phabricator.wikimedia.org/T164968) (owner: 10Marostegui) [11:15:45] RECOVERY - Check the NTP synchronisation status of timesyncd on ganeti2006 is OK: OK: synced at Thu 2017-05-11 11:15:36 UTC. [11:18:44] 06Operations, 05Goal, 15User-Joe, 07kubernetes: Upgrade calico to 2.1, document build process. - https://phabricator.wikimedia.org/T165024#3254693 (10Joe) I am re-doing our calico-containers repository from scratch, importing a version from upstream and managing the now-minimal changes to the Dockerfiles w... [11:45:55] PROBLEM - HP RAID on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [11:47:34] (03CR) 10Muehlenhoff: [C: 032] role::mariadb::wikitech: Switch to ferm constants [puppet] - 10https://gerrit.wikimedia.org/r/353266 (owner: 10Muehlenhoff) [11:51:40] (03PS1) 10Ema: varnish: limit varnishd transient storage [puppet] - 10https://gerrit.wikimedia.org/r/353274 (https://phabricator.wikimedia.org/T164768) [11:57:33] (03PS1) 10Giuseppe Lavagetto: docker::baseimages: separate build script for alpine linux [puppet] - 10https://gerrit.wikimedia.org/r/353275 (https://phabricator.wikimedia.org/T165024) [12:00:47] (03CR) 10Faidon Liambotis: [C: 032] "This needs further cleanups but looks OK as an incremental." [puppet] - 10https://gerrit.wikimedia.org/r/353250 (https://phabricator.wikimedia.org/T163196) (owner: 10Volans) [12:05:08] (03PS5) 10Faidon Liambotis: labs: remove the _eth0 suffix from ipaddress facts [puppet] - 10https://gerrit.wikimedia.org/r/350767 (https://phabricator.wikimedia.org/T163196) [12:05:12] (03PS6) 10Faidon Liambotis: Switch add_ip6_mapped to use interface_primary [puppet] - 10https://gerrit.wikimedia.org/r/345568 (https://phabricator.wikimedia.org/T163196) [12:05:13] (03PS4) 10Faidon Liambotis: Remove c/p interface argument to add_ip6_mapped [puppet] - 10https://gerrit.wikimedia.org/r/350768 (https://phabricator.wikimedia.org/T163196) [12:05:15] (03PS2) 10Faidon Liambotis: Move all add_ip6_mapped calls to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/353095 [12:05:17] (03PS4) 10Faidon Liambotis: lvs: remove support for <= trusty [puppet] - 10https://gerrit.wikimedia.org/r/350769 [12:05:19] (03PS6) 10Faidon Liambotis: interface/lvs: add an $interface parameter, remove hardcoded eth0 [puppet] - 10https://gerrit.wikimedia.org/r/350770 (https://phabricator.wikimedia.org/T163196) [12:05:21] (03PS6) 10Faidon Liambotis: cache: use interface_primary instead of eth0 [puppet] - 10https://gerrit.wikimedia.org/r/350771 (https://phabricator.wikimedia.org/T163196) [12:05:29] (03PS2) 10Volans: LVS: move pybal config to a separate class [puppet] - 10https://gerrit.wikimedia.org/r/353250 (https://phabricator.wikimedia.org/T163196) [12:05:55] PROBLEM - HP RAID on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [12:11:29] (03PS2) 10Ema: varnish: limit varnishd transient storage [puppet] - 10https://gerrit.wikimedia.org/r/353274 (https://phabricator.wikimedia.org/T164768) [12:11:38] (03CR) 10KartikMistry: "> Just as a reminder, updateCategoryCollation.php must be run on" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353099 (https://phabricator.wikimedia.org/T162823) (owner: 10Amire80) [12:12:05] PROBLEM - Host ganeti2005 is DOWN: PING CRITICAL - Packet loss = 100% [12:12:05] PROBLEM - Host ganeti2008 is DOWN: PING CRITICAL - Packet loss = 100% [12:12:15] PROBLEM - Host ganeti2007 is DOWN: PING CRITICAL - Packet loss = 100% [12:12:30] me again ^ [12:12:31] ignore [12:13:03] akosiaris: you just make me loose a couple of years of life... I just merged and run a noop patches on LVSes ;) [12:13:05] RECOVERY - Host ganeti2008 is UP: PING OK - Packet loss = 0%, RTA = 0.45 ms [12:13:05] RECOVERY - Host ganeti2007 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [12:13:20] hahahaha [12:13:25] RECOVERY - Host ganeti2005 is UP: PING OK - Packet loss = 0%, RTA = 0.51 ms [12:13:39] (03CR) 10Amire80: "> > Just as a reminder, updateCategoryCollation.php must be run on" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353099 (https://phabricator.wikimedia.org/T162823) (owner: 10Amire80) [12:14:52] (03CR) 10KartikMistry: "> > > Just as a reminder, updateCategoryCollation.php must be run on" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353099 (https://phabricator.wikimedia.org/T162823) (owner: 10Amire80) [12:17:26] (03CR) 10Amire80: "> > > > Just as a reminder, updateCategoryCollation.php must be run" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353099 (https://phabricator.wikimedia.org/T162823) (owner: 10Amire80) [12:19:25] (03PS3) 10Ema: varnish: limit varnishd transient storage size [puppet] - 10https://gerrit.wikimedia.org/r/353274 (https://phabricator.wikimedia.org/T164768) [12:19:46] !log reboot kafka100[23] for kernel upgrades (kafka main-eqiad, eventbus eqiad) [12:19:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:35] PROBLEM - HP RAID on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [12:33:04] (03CR) 10Ema: [C: 031] Drop cache/LVS NFS override [puppet] - 10https://gerrit.wikimedia.org/r/352748 (https://phabricator.wikimedia.org/T106477) (owner: 10Muehlenhoff) [12:33:58] !log Run pt-table-checksum on s7.ukwiki - https://phabricator.wikimedia.org/T163190 [12:34:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:11] (03PS6) 10Volans: labs: remove the _eth0 suffix from ipaddress facts [puppet] - 10https://gerrit.wikimedia.org/r/350767 (https://phabricator.wikimedia.org/T163196) (owner: 10Faidon Liambotis) [12:46:03] I'm earlier but I'm around for EU swat fyi [12:46:06] early* [12:47:01] (03CR) 10Volans: [C: 032] labs: remove the _eth0 suffix from ipaddress facts [puppet] - 10https://gerrit.wikimedia.org/r/350767 (https://phabricator.wikimedia.org/T163196) (owner: 10Faidon Liambotis) [12:49:33] (03PS1) 10Filippo Giunchedi: logstash: build http_request from webrequest fields [puppet] - 10https://gerrit.wikimedia.org/r/353282 (https://phabricator.wikimedia.org/T149451) [12:49:38] (03PS2) 10Gehel: elasticsearch - silence some loggers for elastic 5.3 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/353105 [12:50:53] jouncebot: refresh [12:50:56] I refreshed my knowledge about deployments. [12:50:57] jouncebot: next [12:50:57] In 0 hour(s) and 9 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170511T1300) [12:51:27] (03PS3) 10Gehel: elasticsearch - silence some loggers for elastic 5.3 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/353105 [12:51:32] I'm around too [12:51:44] how are you today godog? [12:51:55] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] [12:52:21] not bad Zppix ! working away [12:52:39] godog: me too, hope your patch doesnt cause a nuclear meltdowns :) [12:53:27] 06Operations, 07HHVM: Nutcracker doesn't start at boot - https://phabricator.wikimedia.org/T163795#3254966 (10fgiunchedi) >>! In T163795#3254246, @MoritzMuehlenhoff wrote: > Also, since HHVM in our current config unconditionally tries to connect to unix::/var/run/nutcracker/redis_$DC.sock would could also add... [12:53:44] Zppix: I'm fairly sure it isn't going to, if anything it'll help [12:53:55] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] [12:53:59] godog: Haha, I was just joking. Have a great day [12:54:14] hehe ok! you too [12:55:36] !log migrate sca2004 to ganeti nodegroup row_A [12:55:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:22] (03CR) 10Addshore: wgRevisionSliderAlternateSlider true everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350848 (owner: 10Addshore) [12:58:26] (03PS2) 10Addshore: wgRevisionSliderAlternateSlider true everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350848 [13:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170511T1300). [13:00:04] Zppix and godog: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:17] o. [13:00:18] o/ [13:01:19] * aude waves [13:01:27] * addshore has just added 2 patches [13:01:29] i'll have something to add that maybe i can do myself after [13:01:58] whose doing swat? [13:03:26] I can swat [13:03:56] addshore: godog or yourself can go first if you want... my patch isnt that major [13:04:17] Zppix: as your here I'll do yours first! [13:04:29] godog is here he poked in his head earlier fyi [13:04:31] but ok [13:04:47] addshore: any chance that we could also add https://gerrit.wikimedia.org/r/#/c/353247/ in the mix? [13:05:02] elukey: sure [13:05:11] thanks a lot! [13:05:15] (03PS3) 10Addshore: Correct alias(es) from es.wikisource to eo.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353059 (https://phabricator.wikimedia.org/T164888) (owner: 10Zppix) [13:05:21] (03CR) 10Addshore: [C: 032] Correct alias(es) from es.wikisource to eo.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353059 (https://phabricator.wikimedia.org/T164888) (owner: 10Zppix) [13:05:33] (03PS2) 10Addshore: Stop prerendering thumbs at 2560/2880 pixels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353244 (https://phabricator.wikimedia.org/T162796) (owner: 10Filippo Giunchedi) [13:05:46] (03PS3) 10Addshore: wgRevisionSliderAlternateSlider true everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350848 [13:05:58] (03PS2) 10Addshore: Re-enable persistent connection to Redis for jobrunners in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353247 (https://phabricator.wikimedia.org/T125735) (owner: 10Elukey) [13:06:17] Zppix: testable on mwdebug? [13:06:36] addshore: i can attempt to [13:06:46] addshore: wait one [13:06:51] (03CR) 10DCausse: [C: 04-1] elasticsearch - silence some loggers for elastic 5.3 upgrade (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/353105 (owner: 10Gehel) [13:06:57] (03Merged) 10jenkins-bot: Correct alias(es) from es.wikisource to eo.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353059 (https://phabricator.wikimedia.org/T164888) (owner: 10Zppix) [13:07:02] Zppix: it isnt there quite yet ;) [13:07:06] (03CR) 10jenkins-bot: Correct alias(es) from es.wikisource to eo.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353059 (https://phabricator.wikimedia.org/T164888) (owner: 10Zppix) [13:07:09] oh okay :P [13:07:13] ping me when it is [13:07:30] Zppix: its there now, mwdebug1002 [13:07:40] alrighty give me a min [13:07:44] (03PS4) 10Gehel: elasticsearch - silence some loggers for elastic 5.3 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/353105 [13:08:00] (03CR) 10Gehel: elasticsearch - silence some loggers for elastic 5.3 upgrade (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/353105 (owner: 10Gehel) [13:08:38] addshore: aliases work as expected any errors on mwdebug1002? [13:08:46] elukey: can you add your patch to the deployments page please? :) [13:08:51] Zppix: ack, ill sync now! [13:08:57] addshore: sure! [13:09:21] please run namespacedupes on eo wikisource and es wikisource addshore [13:10:01] Zppix: okay [13:10:03] (03CR) 10DCausse: [C: 031] elasticsearch - silence some loggers for elastic 5.3 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/353105 (owner: 10Gehel) [13:10:10] aude: what is the current server to run maintenance scripts on? [13:10:14] addshore: thanks for the swat :) [13:10:28] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: T164888 [[gerrit:353059|Correct alias(es) from es.wikisource to eo.wikisource]] (duration: 00m 42s) [13:10:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:37] T164888: Correct alias from es.wikisource to eo.wikisource - https://phabricator.wikimedia.org/T164888 [13:11:43] addshore: not sure [13:11:51] if we are back on tin, then maybe terbium ? [13:11:58] looks like it is :) [13:12:00] or they can be run on tin [13:12:08] if it's something simple [13:12:13] ack [13:12:47] addshore: it's terbium again since the DC switch back [13:13:28] addshore: can you let me know when namespacedupe is finished so I can resolve the task? [13:15:01] not run namespacedupes before [13:15:21] What is this add-prefix ;) [13:16:47] addshore: hold on ill find the documentation for you [13:16:57] https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#namespaceDupes doesnt mention it [13:17:08] Dupes will be renamed with correct namespace with prepended before the article name [13:17:21] There are only 2 dupes, I assume that is what we / you want though> [13:18:22] correct [13:18:30] the question is, what text? [13:19:09] addshore: well the issue was es wikisource was given incorrect aliases that were actually orginially meant for eo wikisource [13:19:16] so maybe the text would be the aliases? [13:19:38] I'm not quite sure all i know is that datguy suggested that the maintaince script to be ran when this was swatted [13:19:45] correction chad [13:19:47] not datguy [13:20:45] addshore: Chad May 10 10:08 AM Patch Set 2: Needs namespaceDupes (at least dry run) being done on the two wikis, in case there's any conflicts. [13:21:19] !log addshore@tin Synchronized php-1.30.0-wmf.1/extensions/Cognate/src/CognateStore.php: SWAT: T165005 [[gerrit:353249|Dont pass ConnectionRefs to ConnectionManager::releaseConnection]] (duration: 00m 42s) [13:21:20] From the SAL 13:10 hashar: zh_classicalwiki : renamed broken page via namespaceDupes.php : id=73504 ns=0 dbk=模板:Protected_logo -> 模板:Protected_logobroken [13:21:26] maybe I should just add broken? :) [13:21:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:27] T165005: Wikimedia\Rdbms\LoadBalancer::reuseConnection: got DBConnRef instance. - https://phabricator.wikimedia.org/T165005 [13:21:29] there are only 2 pages [13:21:55] addshore: on the deployers page it doesnt show maybe just do the project name and then add --fix to the command? [13:22:17] godog: are you ready for your patch? [13:22:17] addshore: which pages ill just add the info to the task and maybe they can fix it on the project manually [13:22:28] id=2268 ns=0 dbk=Auxtoro:Hendrik_Conscience *** dest title exists and --add-prefix not specified [13:22:29] id=2276 ns=0 dbk=Auxtoro:Lawrence_Lessig *** dest title exists and --add-prefix not specified [13:22:38] on eowikisource [13:22:42] addshore: yep [13:22:47] Yeah, fix fixes if it can [13:22:52] Reedy: thanks! [13:22:56] Otherwise it needs rules to follow [13:22:59] (03CR) 10Addshore: [C: 032] Stop prerendering thumbs at 2560/2880 pixels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353244 (https://phabricator.wikimedia.org/T162796) (owner: 10Filippo Giunchedi) [13:23:03] So suffix/prefix [13:23:05] !log rebooting restbase2005 for update to Linux 4.9 / new openjdk [13:23:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:46] Reedy: thanks, one of these days maybe scripts wont take a whole irc channel to decode :P [13:24:00] I only just came in [13:24:09] reedy knows aoo ;) [13:24:09] I didn't even read backscroll [13:24:11] *all [13:24:26] Reedy: fancy adding a little not to the swat deploy notes about what to do if --fix doesnt work? ;) [13:24:34] addshore: my motto is if i can type commands and nothing breaks then somethings working :P [13:24:37] Sounds like effort? [13:24:39] (03Merged) 10jenkins-bot: Stop prerendering thumbs at 2560/2880 pixels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353244 (https://phabricator.wikimedia.org/T162796) (owner: 10Filippo Giunchedi) [13:24:45] Where's it documented? [13:24:52] godog: is this testable on mwdebug? or? [13:24:53] Reedy: https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#namespaceDupes [13:25:19] Unfortunately it's not so simple [13:25:25] If it's RTL.. [13:25:30] Or non Latin alphabet [13:25:45] You risk making shitty hard titles to manually move [13:25:47] addshore: I'm not 100% sure, I can test an upload there [13:25:50] (03CR) 10jenkins-bot: Stop prerendering thumbs at 2560/2880 pixels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353244 (https://phabricator.wikimedia.org/T162796) (owner: 10Filippo Giunchedi) [13:26:02] godog: okay! it is on mwdebug1002 now! [13:26:10] I imagine it's not [13:26:12] addshore: thanks again, im going to resolve the task on phab now unless you have any further problems you have to report to me. [13:26:29] Zppix: nope, as long as reedy works his magic to fix those 2 ;) [13:26:39] Depends if anyone cares [13:26:42] Usually they don't [13:26:44] addshore: ack :) have a great day.... [13:26:57] Various wikis have stranded pages for years [13:27:09] addshore: ok I'll test an upload [13:27:29] PROBLEM - HP RAID on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [13:27:30] godog: will you be at the hackathon? I have to grab you to help me fix https://gerrit.wikimedia.org/r/#/c/322220/ at some point ;) [13:28:46] (03CR) 10Addshore: [C: 032] wgRevisionSliderAlternateSlider true everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350848 (owner: 10Addshore) [13:29:23] addshore: yes I will! we can take a look for sure :)) [13:30:10] addshore: anyways my WMF account isn't confirmed so no upload, I can double check the patch on swift once it is deployed everywhere though [13:30:12] (03Merged) 10jenkins-bot: wgRevisionSliderAlternateSlider true everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350848 (owner: 10Addshore) [13:30:23] godog: okay! syncing! [13:30:24] (03CR) 10jenkins-bot: wgRevisionSliderAlternateSlider true everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350848 (owner: 10Addshore) [13:30:45] addshore: thanks! [13:31:00] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: T162796 [[gerrit:353244|Stop prerendering thumbs at 2560/2880 pixels]] (duration: 00m 41s) [13:31:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:10] T162796: Delete non-used and/or non-requested thumbnail sizes periodically - https://phabricator.wikimedia.org/T162796 [13:32:44] addshore: checking [13:32:47] godog: ack! [13:33:04] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: (notask) [[gerrit:350848|wgRevisionSliderAlternateSlider true everywhere]] PT 1/2 (duration: 00m 43s) [13:33:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:20] (03CR) 10Addshore: [C: 032] Re-enable persistent connection to Redis for jobrunners in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353247 (https://phabricator.wikimedia.org/T125735) (owner: 10Elukey) [13:33:23] elukey: your up next [13:33:49] addshore: LGTM, thanks! [13:33:53] !log addshore@tin Synchronized wmf-config/InitialiseSettings-labs.php: SWAT: (notask) [[gerrit:350848|wgRevisionSliderAlternateSlider true everywhere]] PT 2/2 (duration: 00m 42s) [13:34:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:06] good! [13:34:12] (03PS3) 10Addshore: Re-enable persistent connection to Redis for jobrunners in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353247 (https://phabricator.wikimedia.org/T125735) (owner: 10Elukey) [13:34:16] (03CR) 10Addshore: [C: 032] Re-enable persistent connection to Redis for jobrunners in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353247 (https://phabricator.wikimedia.org/T125735) (owner: 10Elukey) [13:34:24] elukey: I made a tiny edit (added a comma) [13:34:30] ah thanks! [13:34:50] going to check on deployment-jobrunner02.deployment-prep.eqiad.wmflabs [13:34:55] not that it's needed, but it avoids it perhaps being missed in the future if anyone adds anything to the array [13:35:02] * elukey nods [13:35:31] now that I think about it, how is deployment-prep synced with the main repo? [13:35:43] I just need to go to deployment-tin and redo the deployment? [13:35:48] once it is merged the code will be deployed to labs [13:35:48] (03Merged) 10jenkins-bot: Re-enable persistent connection to Redis for jobrunners in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353247 (https://phabricator.wikimedia.org/T125735) (owner: 10Elukey) [13:35:57] (03CR) 10jenkins-bot: Re-enable persistent connection to Redis for jobrunners in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353247 (https://phabricator.wikimedia.org/T125735) (owner: 10Elukey) [13:36:00] ^^ [13:36:07] that last CR comment was the labs update i believe [13:36:19] awesome [13:36:21] beta-mediawiki-config-update-eqiad SUCCESS Change has been deployed on the EQIAD beta cluster in 1s [13:37:30] mmm /srv/mediawiki/wmf-config/jobqueue-labs.php has not been updated on deployment-jobrunner02.deployment-prep.eqiad.wmflabs [13:37:45] *looks* [13:37:52] (03CR) 10Alexandros Kosiaris: [C: 032] Renumber sca2004 in private1-a-codfw [dns] - 10https://gerrit.wikimedia.org/r/351304 (owner: 10Alexandros Kosiaris) [13:37:56] (03PS2) 10Alexandros Kosiaris: Renumber sca2004 in private1-a-codfw [dns] - 10https://gerrit.wikimedia.org/r/351304 [13:37:58] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Renumber sca2004 in private1-a-codfw [dns] - 10https://gerrit.wikimedia.org/r/351304 (owner: 10Alexandros Kosiaris) [13:38:09] PROBLEM - HHVM jobrunner on mw1161 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.072 second response time [13:38:10] elukey: (cc addshore )) any changes merged in mediawiki-config will automatically be uploaded to the "labs" version of the code [13:38:21] (03CR) 10Volans: [C: 031] "LGTM, see compiler runs below." [puppet] - 10https://gerrit.wikimedia.org/r/345568 (https://phabricator.wikimedia.org/T163196) (owner: 10Faidon Liambotis) [13:38:26] elukey: https://integration.wikimedia.org/ci/job/beta-mediawiki-config-update-eqiad/7510/console looks like it will have landed on tin [13:38:32] I think the scap is seperate [13:38:47] you can see everything @ https://integration.wikimedia.org/ci/view/Beta/ [13:39:09] RECOVERY - HHVM jobrunner on mw1161 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.075 second response time [13:39:13] you should also be able to trigger a scap from there [13:39:25] moritzm: are you working on mw1161? [13:39:38] addshore: I believe beta code configs are automated [13:39:44] addshore: let me know when you are done [13:39:45] changes ^ [13:39:47] no, seems have crashed, just logged a backtrace [13:39:48] aude: will do [13:39:49] addshore: sure! will scap sync-file from deployment-tin [13:39:50] k [13:40:24] Zppix: i beleive to update on the tin machine on deployment prep is automated [13:40:35] Ja [13:40:44] Few times an hour [13:41:15] ahh yes, it should be scaped, as that job then triggers a beta-scap-eqiad [13:41:47] elukey: thats odd that the code change didn't appear [13:42:20] elukey: at least it recovered/restarted, or did you restart it? [13:43:37] (03CR) 10Volans: [C: 031] "To be more explicit, I've verified that for each diff the only things that changes is the parameter of the interface::add_ip6_mapped, give" [puppet] - 10https://gerrit.wikimedia.org/r/345568 (https://phabricator.wikimedia.org/T163196) (owner: 10Faidon Liambotis) [13:44:15] May 11 13:37:00 mw1161 hhvm: Core dumped: Segmentation fault [13:44:16] May 11 13:37:00 mw1161 hhvm: Stack trace in /var/log/hhvm/stacktrace.22900.log [13:44:22] moritzm: --^ [13:44:29] nope didn't do anything [13:45:37] addshore: now it is on the host! [13:45:45] ack! [13:46:21] elukey: this is labs only right? [13:46:35] addshore: yep yep, I am trying to test it before going to prod [13:47:04] elukey: yeah, seems stable for appservers, but apparently the crash on the job runner is new, will open a task [13:47:19] PROBLEM - citoid endpoints health on scb2006 is CRITICAL: /api (Zotero alive) timed out before a response was received [13:48:15] !log addshore@tin Synchronized wmf-config/jobqueue-labs.php: SWAT: LABS ONLY [[gerrit:353247|Re-enable persistent connection to Redis for jobrunners in lab]] (duration: 00m 41s) [13:48:19] aude: its all yours [13:48:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:19] PROBLEM - citoid endpoints health on scb2003 is CRITICAL: /api (Zotero alive) is CRITICAL: Test Zotero alive returned the unexpected status 404 (expecting: 200) [13:49:45] addshore: thanks [13:50:19] RECOVERY - citoid endpoints health on scb2006 is OK: All endpoints are healthy [13:50:43] time to drive to prague :D [13:51:36] cool :) [13:51:50] * aude is driving to montreal (probably) [13:52:26] (03PS2) 10Filippo Giunchedi: logstash: build http_request from webrequest fields [puppet] - 10https://gerrit.wikimedia.org/r/353282 (https://phabricator.wikimedia.org/T149451) [13:53:20] RECOVERY - citoid endpoints health on scb2003 is OK: All endpoints are healthy [13:54:29] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: /api (Zotero alive) timed out before a response was received [13:54:53] testing my thing on mwdebug [13:55:19] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [13:55:20] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero alive) is CRITICAL: Test Zotero alive returned the unexpected status 404 (expecting: 200) [13:55:29] PROBLEM - citoid endpoints health on scb2004 is CRITICAL: /api (Zotero alive) timed out before a response was received [13:56:19] RECOVERY - citoid endpoints health on scb2004 is OK: All endpoints are healthy [13:56:31] !log mobrovac@tin Started restart [zotero/translation-server@6a4a828]: (no justification provided) [13:56:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:19] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy [13:59:19] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: /api (Zotero alive) is CRITICAL: Test Zotero alive returned the unexpected status 404 (expecting: 200) [13:59:30] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: /api (Zotero alive) is CRITICAL: Test Zotero alive returned the unexpected status 404 (expecting: 200) [13:59:50] !log aude@tin Synchronized php-1.30.0-wmf.1/extensions/Wikidata: Update quality constraints (duration: 02m 14s) [13:59:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:07] done [14:00:29] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [14:01:29] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [14:01:49] !log mobrovac@tin Started restart [zotero/translation-server@50f216a]: Zotero unresponsive [14:01:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:35] (03PS1) 10Giuseppe Lavagetto: role::deployment_server: generate dsh lists for zotero [puppet] - 10https://gerrit.wikimedia.org/r/353291 [14:02:41] <_joe_> mobrovac: ^^ [14:03:32] (03CR) 10jerkins-bot: [V: 04-1] role::deployment_server: generate dsh lists for zotero [puppet] - 10https://gerrit.wikimedia.org/r/353291 (owner: 10Giuseppe Lavagetto) [14:03:39] <_joe_> wat [14:04:29] PROBLEM - citoid endpoints health on scb2003 is CRITICAL: /api (Zotero alive) timed out before a response was received [14:05:13] (03PS2) 10Giuseppe Lavagetto: role::deployment_server: generate dsh lists for zotero [puppet] - 10https://gerrit.wikimedia.org/r/353291 [14:05:27] hm ok [14:05:30] RECOVERY - citoid endpoints health on scb2003 is OK: All endpoints are healthy [14:06:09] <_joe_> mobrovac: this only generates the lists [14:06:11] <_joe_> nothing more [14:06:27] so they won't end up on tin? [14:07:29] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: /api (Zotero alive) is CRITICAL: Test Zotero alive returned the unexpected status 404 (expecting: 200) [14:07:29] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero alive) is CRITICAL: Test Zotero alive returned the unexpected status 404 (expecting: 200) [14:07:54] !log deploying WDQS to fix T165029 [14:08:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:03] T165029: Examples are not displayed in the Query Service - https://phabricator.wikimedia.org/T165029 [14:08:05] (03PS3) 10Andrew Bogott: Nova policy: Open up quota-related queries [puppet] - 10https://gerrit.wikimedia.org/r/352606 (https://phabricator.wikimedia.org/T164332) [14:08:29] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [14:08:29] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy [14:08:30] PROBLEM - citoid endpoints health on scb2005 is CRITICAL: /api (Zotero alive) timed out before a response was received [14:08:37] !log gehel@tin Started deploy [wdqs/wdqs@bc30531]: (no justification provided) [14:08:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:18] (03CR) 10Andrew Bogott: [C: 032] Nova policy: Open up quota-related queries [puppet] - 10https://gerrit.wikimedia.org/r/352606 (https://phabricator.wikimedia.org/T164332) (owner: 10Andrew Bogott) [14:10:01] !log gehel@tin Finished deploy [wdqs/wdqs@bc30531]: (no justification provided) (duration: 01m 23s) [14:10:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:39] PROBLEM - citoid endpoints health on scb2003 is CRITICAL: /api (Zotero alive) timed out before a response was received [14:12:29] RECOVERY - citoid endpoints health on scb2003 is OK: All endpoints are healthy [14:12:39] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: /api (Zotero alive) timed out before a response was received [14:13:29] PROBLEM - citoid endpoints health on scb2004 is CRITICAL: /api (Zotero alive) is CRITICAL: Test Zotero alive returned the unexpected status 404 (expecting: 200) [14:13:29] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [14:14:16] (03PS3) 10Filippo Giunchedi: logstash: build http_request from webrequest fields [puppet] - 10https://gerrit.wikimedia.org/r/353282 (https://phabricator.wikimedia.org/T149451) [14:14:29] RECOVERY - citoid endpoints health on scb2004 is OK: All endpoints are healthy [14:15:37] 06Operations, 07HHVM: HHVM 3.18 crash on job runner / luasandbox - https://phabricator.wikimedia.org/T165043#3255165 (10MoritzMuehlenhoff) [14:16:29] RECOVERY - citoid endpoints health on scb2005 is OK: All endpoints are healthy [14:17:39] PROBLEM - citoid endpoints health on scb2003 is CRITICAL: /api (Zotero alive) timed out before a response was received [14:17:59] PROBLEM - dhclient process on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:18:29] PROBLEM - salt-minion processes on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:18:39] RECOVERY - citoid endpoints health on scb2003 is OK: All endpoints are healthy [14:18:39] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: /api (Zotero alive) timed out before a response was received [14:18:39] PROBLEM - citoid endpoints health on scb2004 is CRITICAL: /api (Zotero alive) timed out before a response was received [14:18:49] RECOVERY - dhclient process on thumbor1002 is OK: PROCS OK: 0 processes with command name dhclient [14:19:19] RECOVERY - salt-minion processes on thumbor1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:19:29] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [14:20:30] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: /api (Zotero alive) is CRITICAL: Test Zotero alive returned the unexpected status 404 (expecting: 200) [14:21:00] (03PS6) 10Ayounsi: Various LibreNMS improvements [puppet] - 10https://gerrit.wikimedia.org/r/353088 (https://phabricator.wikimedia.org/T164911) [14:21:39] RECOVERY - citoid endpoints health on scb2004 is OK: All endpoints are healthy [14:21:39] PROBLEM - citoid endpoints health on scb2006 is CRITICAL: /api (Zotero alive) timed out before a response was received [14:22:39] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [14:22:39] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero alive) is CRITICAL: Test Zotero alive returned the unexpected status 404 (expecting: 200) [14:22:39] RECOVERY - citoid endpoints health on scb2006 is OK: All endpoints are healthy [14:22:39] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: /api (Zotero alive) timed out before a response was received [14:23:29] <_joe_> mobrovac: should I restart zotero? [14:23:39] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy [14:23:51] wtf? _joe_, i have just restarted it like 10 mins ago [14:23:54] what is going on? [14:24:06] <_joe_> no idea tbh [14:24:14] PROBLEM - LVS HTTP IPv4 on zotero.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.16 and port 1969: No route to host [14:24:32] <_joe_> let's go take a look [14:24:33] <_joe_> sigh [14:24:39] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [14:24:44] _joe_: no cpu or mem utilisation by zotero [14:24:56] so network problems perhaps? [14:25:06] <_joe_> nope [14:25:11] <_joe_> no memory? [14:25:15] RECOVERY - LVS HTTP IPv4 on zotero.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.0 200 OK - 62 bytes in 0.009 second response time [14:25:18] <_joe_> it says 25% used [14:25:29] that was on sca2004 [14:25:35] sca2003 is much busier [14:26:08] it just died now? [14:26:10] you restarted it? [14:26:17] <_joe_> nope [14:26:49] PROBLEM - citoid endpoints health on scb2005 is CRITICAL: /api (Zotero alive) timed out before a response was received [14:26:49] <_joe_> oh my, logrotate doesn't work for zotero [14:27:20] <_joe_> so zotero on sca2003 seems to be working fine [14:27:36] if can be of any help LMK [14:27:49] RECOVERY - citoid endpoints health on scb2005 is OK: All endpoints are healthy [14:28:03] * akosiaris looking as well [14:28:14] PROBLEM - LVS HTTP IPv4 on zotero.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.16 and port 1969: No route to host [14:28:15] <_joe_> tbh, the logs on the two servers are different [14:28:40] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: /api (Zotero alive) is CRITICAL: Test Zotero alive returned the unexpected status 404 (expecting: 200) [14:28:41] <_joe_> akosiaris: No route to host [14:28:51] <_joe_> so yeah some kind of network problem I'd say [14:29:00] (03PS1) 10Ayounsi: Re-add mr1-esams v6 OOB IP [dns] - 10https://gerrit.wikimedia.org/r/353295 [14:29:03] I 've migrated sca2004 to a different network [14:29:11] hours ago but still [14:29:19] maybe pybal needs a refresh ? [14:29:26] sounds wrong though [14:29:34] <_joe_> oblivian@tegmen:~$ telnet 10.2.1.16 1969 [14:29:34] <_joe_> Trying 10.2.1.16... [14:29:34] <_joe_> telnet: Unable to connect to remote host: No route to host [14:29:35] (03CR) 10Ayounsi: [C: 032] Re-add mr1-esams v6 OOB IP [dns] - 10https://gerrit.wikimedia.org/r/353295 (owner: 10Ayounsi) [14:29:39] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [14:29:55] _joe_: with all endpoints depooled, that's expected, no ? [14:29:58] <_joe_> so for now lemme depool 2004 [14:30:04] <_joe_> akosiaris: not really [14:30:23] I had cumin failing to connect to sca2004.codfw.wmnet a minute ago [14:30:39] PROBLEM - citoid endpoints health on scb2004 is CRITICAL: /api (Zotero alive) is CRITICAL: Test Zotero alive returned the unexpected status 404 (expecting: 200) [14:30:49] PROBLEM - citoid endpoints health on scb2005 is CRITICAL: /api (Zotero alive) timed out before a response was received [14:31:11] lvs2003 does not have sca2004 pooled [14:31:14] RECOVERY - LVS HTTP IPv4 on zotero.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.0 200 OK - 62 bytes in 0.008 second response time [14:31:40] RECOVERY - citoid endpoints health on scb2005 is OK: All endpoints are healthy [14:31:40] akosiaris: DNS cache [14:31:53] I can still get 10.192.16.30 sometimes for sca2004 [14:31:57] from sarin [14:31:57] <_joe_> akosiaris: interstingly, it doesn't see the PTR for sca2004 [14:32:03] volans: yeah sure, but LVS should not have that problem [14:32:05] <_joe_> that ^^ [14:32:16] <_joe_> akosiaris: lvs still has 10.192.16.30 [14:32:39] RECOVERY - citoid endpoints health on scb2004 is OK: All endpoints are healthy [14:32:40] <_joe_> akosiaris: uhm let me try something [14:32:47] _joe_: hmmm depool threshold ? [14:32:59] or what ? it should have kicked it out long time ago [14:33:19] <_joe_> sca2004.codfw.wmnet: enabled/up/pooled [14:33:20] but at least that explains the no route to host [14:33:21] <_joe_> now [14:33:29] <_joe_> I think it has to do with the DNS change [14:33:41] <_joe_> I'm not sure but I remember a similar issue with pybal [14:34:15] I 'll restart pybal on lvs2006 [14:34:19] want to test something [14:34:33] (03CR) 10Paladox: "> E: Unable to locate package php7.0-gmp" [puppet] - 10https://gerrit.wikimedia.org/r/353194 (https://phabricator.wikimedia.org/T164977) (owner: 10Paladox) [14:34:37] <_joe_> akosiaris: wait a sec [14:34:48] yeah that's it [14:34:49] PROBLEM - citoid endpoints health on scb2005 is CRITICAL: /api (Zotero alive) timed out before a response was received [14:34:52] now it's fine [14:35:00] so pybal resolves only once that DNS ? [14:35:09] <_joe_> akosiaris: let me try one thing [14:35:14] PROBLEM - LVS HTTP IPv4 on zotero.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.16 and port 1969: No route to host [14:35:24] _joe_: yeah, but keep in mind I have already restart pybal on lvs2006 [14:35:37] <_joe_> akosiaris: this won't affect that [14:35:39] PROBLEM - swift-object-replicator on ms-be2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:35:40] PROBLEM - dhclient process on ms-be2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:35:40] PROBLEM - swift-account-server on ms-be2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:35:40] RECOVERY - citoid endpoints health on scb2005 is OK: All endpoints are healthy [14:35:49] PROBLEM - swift-container-replicator on ms-be2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:35:49] PROBLEM - swift-object-server on ms-be2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:36:14] RECOVERY - LVS HTTP IPv4 on zotero.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.0 200 OK - 62 bytes in 0.015 second response time [14:36:23] !log oblivian@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=sca2004.codfw.wmnet [14:36:29] RECOVERY - swift-account-server on ms-be2002 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [14:36:29] RECOVERY - swift-object-replicator on ms-be2002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [14:36:29] RECOVERY - dhclient process on ms-be2002 is OK: PROCS OK: 0 processes with command name dhclient [14:36:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:32] I 'll silence this while we depool [14:36:34] debug* [14:36:40] RECOVERY - swift-container-replicator on ms-be2002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [14:36:40] RECOVERY - swift-object-server on ms-be2002 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [14:37:12] 06Operations, 07HHVM: HHVM 3.18 crash on job runner / luasandbox - https://phabricator.wikimedia.org/T165043#3255256 (10MoritzMuehlenhoff) Here's the backtrace after installing the lua debug symbols: https://phabricator.wikimedia.org/P5423 It's unfortunate that our hhvm-luasandbox package strips debug symbols... [14:37:14] !log oblivian@puppetmaster1001 conftool action : set/pooled=yes; selector: name=sca2004.codfw.wmnet [14:37:17] 06Operations, 10ChangeProp, 10ORES, 10Scoring-platform-team-Backlog, 10Traffic: [Discuss] Split ORES scores in datacenters based on wiki - https://phabricator.wikimedia.org/T164376#3255257 (10Halfak) [14:37:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:28] 06Operations, 10ChangeProp, 10ORES, 10Scoring-platform-team-Backlog, 10Traffic: [Discuss] Split ORES scores in datacenters based on wiki - https://phabricator.wikimedia.org/T164376#3231630 (10Halfak) p:05Triage>03Low [14:37:29] PROBLEM - HP RAID on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [14:37:51] <_joe_> akosiaris: ok I set sca2004 to inactive, thus removing it from pybal's register [14:37:58] <_joe_> then I set it to pooled = yes [14:38:03] <_joe_> and that fixed the problem [14:38:14] has pybal some cache of IPs? [14:38:15] without a pybal restart ? nice [14:38:23] DNS -> IP ? clearly [14:38:30] <_joe_> yes [14:38:32] <_joe_> so basically [14:38:40] <_joe_> when you insert a server in the configuration [14:38:48] <_joe_> the dns lookup is performed [14:38:53] <_joe_> then it isn't done again [14:38:58] ok [14:39:13] <_joe_> for a good reason: when depooling to change IP, you should set it to inactive [14:39:25] <_joe_> so that pybal doesn't even try to contact the server in the meanwhile [14:39:30] inactive, not depooled ? [14:39:34] aah [14:39:37] ok that makes sense [14:39:39] <_joe_> inactive removes it from the config [14:39:46] <_joe_> depooled just removes it from the pool [14:39:49] cause with pooled=no it would still try to connect [14:39:49] ok [14:39:51] <_joe_> yes [14:40:08] and what happened that triggered all this ? zotero on sca2003 barfed ? [14:40:16] not withstanding the load or what ? [14:40:17] <_joe_> no idea tbh [14:40:27] <_joe_> the load is purely our monitoring of citoid [14:40:29] <_joe_> :P [14:40:31] exactly [14:41:41] ok we should note this down if we haven't [14:41:49] I honestly did not remember that detail [14:42:03] so, all back to good now? can I get my turn at breaking things? :-P [14:42:15] (03PS7) 10Volans: Switch add_ip6_mapped to use interface_primary [puppet] - 10https://gerrit.wikimedia.org/r/345568 (https://phabricator.wikimedia.org/T163196) (owner: 10Faidon Liambotis) [14:42:51] <_joe_> akosiaris: we have, but as you know that's not enough [14:43:05] volans: yeah, break stuff away [14:44:25] (03PS3) 10Giuseppe Lavagetto: role::deployment_server: generate dsh lists for zotero [puppet] - 10https://gerrit.wikimedia.org/r/353291 [14:44:27] ah, the beaty of zotero logs... no timestamp, useless empty lines, no level, no context [14:44:50] <_joe_> akosiaris: yeah I was appreciating that beauty too [14:44:59] but it's great software [14:45:08] <_joe_> clean the house man [14:45:13] :-) [14:53:45] 06Operations, 10ops-codfw: Degraded RAID on heze - https://phabricator.wikimedia.org/T163087#3255287 (10Papaul) a:05Papaul>03akosiaris Disk replacement complete. Bad disk return label attached. {F8037257} [14:54:05] 06Operations, 10ops-codfw: mw2098 failed to come up after reboot - https://phabricator.wikimedia.org/T164959#3255290 (10Papaul) p:05Triage>03Normal [14:54:30] 06Operations, 10ops-codfw: ms-be2002.codfw.wmnet has drac issues - https://phabricator.wikimedia.org/T155689#3255291 (10Papaul) p:05Normal>03Low [14:54:32] (03CR) 10Volans: [C: 032] Switch add_ip6_mapped to use interface_primary [puppet] - 10https://gerrit.wikimedia.org/r/345568 (https://phabricator.wikimedia.org/T163196) (owner: 10Faidon Liambotis) [14:56:32] 06Operations, 06Multimedia, 10media-storage, 15User-fgiunchedi: 404 error while accessing some images files (e.g. djvu, jpg, png, webm) on Commons and other sites - https://phabricator.wikimedia.org/T161836#3255309 (10fgiunchedi) 05Open>03Resolved p:05High>03Normal a:03fgiunchedi We are rebalanci... [15:00:45] (03PS5) 10Volans: Remove c/p interface argument to add_ip6_mapped [puppet] - 10https://gerrit.wikimedia.org/r/350768 (https://phabricator.wikimedia.org/T163196) (owner: 10Faidon Liambotis) [15:14:33] 06Operations, 10ops-codfw, 10hardware-requests: reclaim tempdb2001(WMF6407) to spares - https://phabricator.wikimedia.org/T164513#3255384 (10Papaul) [15:14:59] 06Operations, 10ops-codfw, 10DBA, 10hardware-requests, 13Patch-For-Review: codfw: (1) spare pool system for temp allocation as database failover - https://phabricator.wikimedia.org/T161712#3255387 (10Papaul) [15:15:02] 06Operations, 10ops-codfw, 10hardware-requests: reclaim tempdb2001(WMF6407) to spares - https://phabricator.wikimedia.org/T164513#3236011 (10Papaul) 05Open>03Resolved Closing this task. Complete. [15:16:49] PROBLEM - mediawiki-installation DSH group on mw2146 is CRITICAL: Host mw2146 is not in mediawiki-installation dsh group [15:16:55] (03PS1) 10Catrope: Set oresDamagingPref default to values that actually exist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353311 (https://phabricator.wikimedia.org/T165011) [15:17:14] 06Operations, 07HHVM: HHVM 3.18 crash on job runner / luasandbox - https://phabricator.wikimedia.org/T165043#3255412 (10Anomie) > It's unfortunate that our hhvm-luasandbox package strips debug symbols, we should probably add these to a separate package. According to https://wiki.debian.org/DebugPackage, that'... [15:18:01] Can I deploy this? [15:18:02] https://phabricator.wikimedia.org/T165011 [15:18:56] jouncebot: next [15:18:57] In 0 hour(s) and 41 minute(s): Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170511T1600) [15:18:59] jouncebot: now [15:18:59] No deployments scheduled for the next 0 hour(s) and 41 minute(s) [15:19:11] Amir1: Is stuff pretty broken? [15:19:34] Reedy: you can't open preferences in wikidata [15:19:44] that would satisfy it? [15:19:49] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: decom mira - https://phabricator.wikimedia.org/T164588#3255421 (10Papaul) Disk wipe in progress [15:20:00] 06Operations, 10ops-codfw: Degraded RAID on heze - https://phabricator.wikimedia.org/T163087#3255422 (10akosiaris) Thanks!!! The controller started an automatic rebuild (since it's configured that way) and I see everything is proceeding fine. A bit slow, but fine ``` sudo megacli -PDRbld -ShowProg -PhysDrv [... [15:20:08] 06Operations, 07HHVM: HHVM 3.18 segfault on jobrunner / string handling - https://phabricator.wikimedia.org/T165051#3255423 (10MoritzMuehlenhoff) [15:20:11] (03CR) 10Ladsgroup: [C: 04-1] Set oresDamagingPref default to values that actually exist (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353311 (https://phabricator.wikimedia.org/T165011) (owner: 10Catrope) [15:20:44] https://www.wikidata.org/wiki/Special:Preferences [15:21:37] (03CR) 10Catrope: Set oresDamagingPref default to values that actually exist (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353311 (https://phabricator.wikimedia.org/T165011) (owner: 10Catrope) [15:21:58] _joe_ akosiaris ^ [15:22:03] sorry for pinging [15:22:27] <_joe_> Amir1: in meetings, what should I look at? [15:22:40] (03PS2) 10Catrope: Set oresDamagingPref default to values that actually exist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353311 (https://phabricator.wikimedia.org/T165011) [15:22:46] _joe_: not much, When opening preferences in wikidata, it errors out [15:22:55] Can I deploy? [15:23:11] <_joe_> Amir1: I can't really look at it while in a meeting sorry [15:23:44] Amir1: fawiki => 'hard' ? [15:23:59] should it not be wikidata ? [15:24:00] akosiaris: I'm waiting for Roan to fix that [15:24:03] ah ok [15:24:09] I already made -1 for that [15:24:20] I believe it's actually right [15:24:26] but akosiaris, once it's fixed, can I deploy? [15:24:29] See comment on Gerrit, and amendment to commit msg [15:24:42] I see no reason not to if it's fixing something that's pretty broken [15:24:46] Thanks for taking care of the deployment Amir1 [15:24:48] in fact you should [15:24:55] I'm supposed to be on my way to the airport :D [15:25:09] RoanKattouw: It's possible to deploy from a plane [15:25:11] We all know this [15:25:13] RoanKattouw: no worries, thanks for fixing it [15:25:35] Reedy: I deployed in airport once, ores had three hours of full down time [15:25:39] (03CR) 10Volans: [C: 031] "LGTM, compiler is noop for all but dataset1001 that uses eth2 and the diff is only in the title of the resource:" [puppet] - 10https://gerrit.wikimedia.org/r/350768 (https://phabricator.wikimedia.org/T163196) (owner: 10Faidon Liambotis) [15:26:00] akosiaris: thanks [15:26:15] (03CR) 10Ladsgroup: [C: 032] Set oresDamagingPref default to values that actually exist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353311 (https://phabricator.wikimedia.org/T165011) (owner: 10Catrope) [15:27:00] Reedy: s/We all know this/It is known/ (preferably in Dothraki) [15:27:11] heh [15:27:47] No ferry rides planned for me today though [15:28:16] (03Merged) 10jenkins-bot: Set oresDamagingPref default to values that actually exist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353311 (https://phabricator.wikimedia.org/T165011) (owner: 10Catrope) [15:28:25] (03CR) 10jenkins-bot: Set oresDamagingPref default to values that actually exist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353311 (https://phabricator.wikimedia.org/T165011) (owner: 10Catrope) [15:30:11] !log rotate novaadmin in /labtest/ ldappasswd -H ldap://labtestservices2001.wikimedia.org -x -D "uid=novaadmin,ou=people,dc=wikimedia,dc=org" -W -A -S [15:30:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:09] PROBLEM - swift-container-updater on ms-be2012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:31:09] PROBLEM - swift-account-reaper on ms-be2012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:31:09] PROBLEM - swift-container-server on ms-be2012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:31:19] PROBLEM - swift-object-replicator on ms-be2012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:31:19] PROBLEM - swift-object-server on ms-be2012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:32:00] RECOVERY - swift-container-updater on ms-be2012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [15:32:00] RECOVERY - swift-account-reaper on ms-be2012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [15:32:00] RECOVERY - swift-container-server on ms-be2012 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [15:32:09] RECOVERY - swift-object-replicator on ms-be2012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [15:32:09] RECOVERY - swift-object-server on ms-be2012 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [15:35:10] !log starts of ladsgroup@tin:/srv/mediawiki-staging$ scap sync-dir wmf-config 'Set oresDamagingPref default to values that actually exist (T165011)' [15:35:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:18] T165011: Global default 'hard' is invalid for field oresDamagingPref - https://phabricator.wikimedia.org/T165011 [15:35:38] !log ladsgroup@tin Synchronized wmf-config: Set oresDamagingPref default to values that actually exist (T165011) (duration: 00m 44s) [15:35:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:12] Thanks Amir1 [15:38:18] Did that fix it? [15:38:24] Yup [15:38:28] I'm closing the bug [15:38:31] Thank you [15:39:00] (03PS4) 10Filippo Giunchedi: logstash: build http_request from webrequest fields [puppet] - 10https://gerrit.wikimedia.org/r/353282 (https://phabricator.wikimedia.org/T149451) [15:39:28] 06Operations, 10Ops-Access-Requests: add Arzhel Younsi to datacenter access lists - https://phabricator.wikimedia.org/T165054#3255501 (10RobH) [15:40:51] (03CR) 10Volans: [C: 032] Remove c/p interface argument to add_ip6_mapped [puppet] - 10https://gerrit.wikimedia.org/r/350768 (https://phabricator.wikimedia.org/T163196) (owner: 10Faidon Liambotis) [15:41:10] 06Operations, 10ops-codfw: mw2098 failed to come up after reboot - https://phabricator.wikimedia.org/T164959#3255517 (10Papaul) This system is out of warranty and this issue happen now 5 times when after reboot the system doesn't come back up and we need to pull the power for a couple of minutes. Please see... [15:44:09] PROBLEM - puppet last run on labtestservices2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/labs-ip-alias-dump.py] [15:44:35] 06Operations, 10Ops-Access-Requests: add Arzhel Younsi to datacenter access lists - https://phabricator.wikimedia.org/T165054#3255527 (10RobH) An email has been sent to both UnitedLayer and CyrusOne. Equinix requires setup in their portal, where both esams and knams will be emails as well. [15:44:54] 06Operations, 07HHVM: HHVM 3.18 crash on job runner / luasandbox - https://phabricator.wikimedia.org/T165043#3255528 (10MoritzMuehlenhoff) I'll build an unstripped hhvm-luasandbox package tomorrow morning (can be done manually, we in fact don't have the automatic dbgsym packages from stretch) [15:45:14] 06Operations, 10Ops-Access-Requests: add Arzhel Younsi to datacenter access lists - https://phabricator.wikimedia.org/T165054#3255529 (10mark) Yeah, that makes sense. Approved. [15:47:55] (03PS3) 10Volans: Move all add_ip6_mapped calls to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/353095 (owner: 10Faidon Liambotis) [15:49:09] RECOVERY - puppet last run on labtestservices2001 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [15:55:59] PROBLEM - HP RAID on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [15:56:02] (03CR) 10DCausse: logstash: build http_request from webrequest fields (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/353282 (https://phabricator.wikimedia.org/T149451) (owner: 10Filippo Giunchedi) [15:56:31] (03CR) 10Volans: [C: 031] "LGTM, they are all noop: https://puppet-compiler.wmflabs.org/6395/" [puppet] - 10https://gerrit.wikimedia.org/r/353095 (owner: 10Faidon Liambotis) [15:58:54] (03CR) 10Volans: [C: 032] Move all add_ip6_mapped calls to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/353095 (owner: 10Faidon Liambotis) [16:00:04] godog, moritzm, and _joe_: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170511T1600). [16:01:12] 06Operations, 07HHVM: HHVM 3.18 crash on job runner / luasandbox - https://phabricator.wikimedia.org/T165043#3255584 (10Anomie) Based on the surrounding functions in the stack trace, I'd guess frame #1 is most likely [[https://phabricator.wikimedia.org/diffusion/MLUS/browse/debian/alloc.c;b4a80e2af15de16b18abb... [16:02:05] nothing up for puppet swat [16:03:08] godog: --^ [16:04:05] https://giphy.com/gifs/dancing-happy-will-smith-bTzFnjHPuVvva [16:06:05] (03PS5) 10Volans: lvs: remove support for <= trusty [puppet] - 10https://gerrit.wikimedia.org/r/350769 (owner: 10Faidon Liambotis) [16:08:46] 06Operations, 10ops-codfw: mw2098 failed to come up after reboot - https://phabricator.wikimedia.org/T164959#3255621 (10Papaul) a:05Papaul>03MoritzMuehlenhoff @MoritzMuehlenhoff system is back up. when done can you please assign this task to @Robh or @Joe ? Thanks. [16:10:18] 06Operations, 10ops-codfw: mw2098 failed to come up after reboot - https://phabricator.wikimedia.org/T164959#3255625 (10RobH) I'd advise we decommission the host if its failing constantly, as its warranty ended on 2016-01-24. Since it is part of the mw cluster, that cluster has a greater number of hosts than... [16:10:45] 06Operations, 10Pybal, 10Traffic, 10netops: Frequent RST returned by appservers to LVS hosts - https://phabricator.wikimedia.org/T163674#3255626 (10elukey) Compared the strace of two requests, one with Connection close and one without it. Something interesting came up: With Connection: close ``` [pid 4... [16:13:48] elukey: https://i.imgur.com/V83orFg.gifv [16:14:21] (03CR) 10Volans: [C: 031] "LGTM, they are all NOOP: https://puppet-compiler.wmflabs.org/6397/" [puppet] - 10https://gerrit.wikimedia.org/r/350769 (owner: 10Faidon Liambotis) [16:15:59] godog: you have a talent, I can't even try to replace you :D [16:16:05] 06Operations, 10Traffic: varnish frontend transient memory usage keeps growing - https://phabricator.wikimedia.org/T165063#3255683 (10ema) [16:16:38] (03CR) 10Filippo Giunchedi: logstash: build http_request from webrequest fields (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/353282 (https://phabricator.wikimedia.org/T149451) (owner: 10Filippo Giunchedi) [16:17:29] PROBLEM - HP RAID on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [16:18:04] elukey: haha don't sell yourself short! reddit is a trove for this sort of thing [16:18:08] (03CR) 10Volans: [C: 032] lvs: remove support for <= trusty [puppet] - 10https://gerrit.wikimedia.org/r/350769 (owner: 10Faidon Liambotis) [16:18:19] PROBLEM - mediawiki-installation DSH group on mw2098 is CRITICAL: Host mw2098 is not in mediawiki-installation dsh group [16:18:42] 06Operations, 10Traffic: varnish frontend transient memory usage keeps growing - https://phabricator.wikimedia.org/T165063#3255727 (10ema) p:05Triage>03High [16:18:47] (03PS5) 10Filippo Giunchedi: logstash: build http_request from webrequest fields [puppet] - 10https://gerrit.wikimedia.org/r/353282 (https://phabricator.wikimedia.org/T149451) [16:23:58] 06Operations, 10Traffic: varnish frontend transient memory usage keeps growing - https://phabricator.wikimedia.org/T165063#3255772 (10ema) [16:28:38] PROBLEM - HHVM rendering on mw1261 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.073 second response time [16:28:48] PROBLEM - Apache HTTP on mw1261 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.074 second response time [16:28:58] PROBLEM - Nginx local proxy to apache on mw1261 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.155 second response time [16:29:06] checking --^ [16:29:35] oh noes, this is 3.18 [16:29:38] RECOVERY - HHVM rendering on mw1261 is OK: HTTP OK: HTTP/1.1 200 OK - 74206 bytes in 0.241 second response time [16:29:39] moritzm: --^ [16:29:48] RECOVERY - Apache HTTP on mw1261 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.120 second response time [16:29:58] RECOVERY - Nginx local proxy to apache on mw1261 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.178 second response time [16:30:44] May 11 16:26:04 mw1261 hhvm: Core dumped: Segmentation fault [16:30:44] May 11 16:26:04 mw1261 hhvm: Stack trace in /var/log/hhvm/stacktrace.12256.log [16:30:50] /o\ [16:30:54] this one is a appserver [16:30:56] elukey: He's filed a couple of bugs already... [16:30:58] So might be a dupe [16:31:21] (03CR) 10Dzahn: DHCP/partman: Add dhcp and partman entries for kubernetes200[1-4] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/353098 (owner: 10Papaul) [16:31:42] Reedy: yeah but not for appservers, we thought only jobrunners were affected :( [16:32:07] going afk, will check later [16:35:16] (03PS7) 10Volans: interface/lvs: add an $interface parameter, remove hardcoded eth0 [puppet] - 10https://gerrit.wikimedia.org/r/350770 (https://phabricator.wikimedia.org/T163196) (owner: 10Faidon Liambotis) [16:37:29] PROBLEM - HP RAID on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [16:40:25] (03PS2) 10Dzahn: DHCP/partman: Add dhcp and partman entries for kubernetes200[1-4] [puppet] - 10https://gerrit.wikimedia.org/r/353098 (owner: 10Papaul) [16:48:03] (03CR) 10Volans: [C: 031] "LGMT, compiler diffs have just the new parameter added." [puppet] - 10https://gerrit.wikimedia.org/r/350770 (https://phabricator.wikimedia.org/T163196) (owner: 10Faidon Liambotis) [16:48:03] Reedy, elukey: that was https://phabricator.wikimedia.org/T162586 [16:48:20] I need to followup with some more information to the upstream bug [16:48:30] I did say it may be a dupe ;) [16:48:35] my bet it's also related to stat_cache .... [16:48:48] PROBLEM - swift-account-auditor on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:48:48] PROBLEM - swift-object-auditor on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:49:13] (03CR) 10Dzahn: [C: 032] "i took the liberty to just change netboot to "kubernetes[1-2]*". that will cover all the existing ones and new ones unless we really need " [puppet] - 10https://gerrit.wikimedia.org/r/353098 (owner: 10Papaul) [16:49:26] but usually it doesn't trip an Icinga alert since systemd restarts is quickly enough. so it's not that part for now (at least the deadlock in HPHP::Treadill is fixed which actually dealocked and required manual fixups) [16:49:38] RECOVERY - swift-object-auditor on ms-be1020 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [16:49:38] RECOVERY - swift-account-auditor on ms-be1020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [16:49:47] (03PS3) 10Dzahn: DHCP/partman: Add dhcp and partman entries for kubernetes200[1-4] [puppet] - 10https://gerrit.wikimedia.org/r/353098 (owner: 10Papaul) [16:53:08] PROBLEM - mediawiki originals uploads -hourly- for codfw-prod on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [3000.0] [16:53:16] (03PS1) 10Ayounsi: Add mr1-ulsfo v6 OOB IP [dns] - 10https://gerrit.wikimedia.org/r/353329 [16:53:55] (03CR) 10Ayounsi: [C: 032] Add mr1-ulsfo v6 OOB IP [dns] - 10https://gerrit.wikimedia.org/r/353329 (owner: 10Ayounsi) [16:54:28] PROBLEM - mediawiki originals uploads -hourly- for eqiad-prod on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [3000.0] [17:00:04] gwicke, cscott, arlolra, subbu, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170511T1700). [17:01:36] godog: issues with swift? ^^^ [17:02:03] * volans would really like to see a link to a grafana dashboard for each grafana-based alarm [17:02:08] RECOVERY - mediawiki originals uploads -hourly- for codfw-prod on graphite1001 is OK: OK: Less than 80.00% above the threshold [2000.0] [17:02:12] (03CR) 10Dzahn: [C: 04-1] "eh, but we didn't activate that ppa repo, or did we? and isn't that Ubuntu-only while contint is on Debian?" [puppet] - 10https://gerrit.wikimedia.org/r/353194 (https://phabricator.wikimedia.org/T164977) (owner: 10Paladox) [17:02:40] no parsoid deploy today [17:02:42] (03CR) 10Paladox: "@Dzahn see https://github.com/wikimedia/puppet/commit/8ec74841f4d5c3ab9a19100749bcecad3aa5c3cc (i may have got it wrong (wrong name) it's " [puppet] - 10https://gerrit.wikimedia.org/r/353194 (https://phabricator.wikimedia.org/T164977) (owner: 10Paladox) [17:03:57] (03CR) 10Dzahn: "aha! thanks for the link. well, i'll remove my -1 then but waiting for hashar to comment" [puppet] - 10https://gerrit.wikimedia.org/r/353194 (https://phabricator.wikimedia.org/T164977) (owner: 10Paladox) [17:04:28] RECOVERY - mediawiki originals uploads -hourly- for eqiad-prod on graphite1001 is OK: OK: Less than 80.00% above the threshold [2000.0] [17:04:55] (03CR) 10Dzahn: "maybe the commit message / subject could clarify that all of this is "labs-only" and not prod." [puppet] - 10https://gerrit.wikimedia.org/r/353194 (https://phabricator.wikimedia.org/T164977) (owner: 10Paladox) [17:05:00] volans: looks like a spike of uploads, I checked Special:Newfiles for commons and looks like it is legit, uploads from MET [17:05:16] ok [17:05:19] thanks for checking [17:05:37] np! good thing we've disabled two big prerenders today heh [17:06:28] * godog off [17:06:31] (03PS3) 10Paladox: Labs contint: Install php5-gmp and php7.0-gmp [puppet] - 10https://gerrit.wikimedia.org/r/353194 (https://phabricator.wikimedia.org/T164977) [17:07:11] (03PS8) 10Volans: interface/lvs: add an $interface parameter, remove hardcoded eth0 [puppet] - 10https://gerrit.wikimedia.org/r/350770 (https://phabricator.wikimedia.org/T163196) (owner: 10Faidon Liambotis) [17:07:13] (03PS7) 10Volans: cache: use interface_primary instead of eth0 [puppet] - 10https://gerrit.wikimedia.org/r/350771 (https://phabricator.wikimedia.org/T163196) (owner: 10Faidon Liambotis) [17:07:15] (03PS1) 10Volans: Interface: remove unused module [puppet] - 10https://gerrit.wikimedia.org/r/353332 (https://phabricator.wikimedia.org/T163196) [17:07:17] (03CR) 10Dzahn: "thanks! was totally on my list to do this now but volans you beat me to it :)" [puppet] - 10https://gerrit.wikimedia.org/r/353095 (owner: 10Faidon Liambotis) [17:08:26] mutante: it's part of a long list of patches I'm babysitting to prod, mostly noop but touching very delicate parts of the infrastructure :) [17:08:37] so they need to go in order [17:09:11] volans: yea, i figured that and saw the dependencies, that is exactly why i did not merge that yesterday but did only the part for deployment_server [17:09:20] volans: so thanks, i wanted to compile that right now :) [17:10:19] thank you :) [17:14:26] (03PS3) 10Dzahn: udp2log: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352998 [17:14:53] 06Operations, 06Labs, 10wikitech.wikimedia.org, 13Patch-For-Review: Update wikitech-static and develop procedures to keep it maintained - https://phabricator.wikimedia.org/T163721#3255983 (10Andrew) 05Open>03Resolved a:03Andrew [17:15:09] PROBLEM - mediawiki originals uploads -hourly- for codfw-prod on graphite1001 is CRITICAL: CRITICAL: 83.33% of data above the critical threshold [3000.0] [17:18:19] 06Operations, 07Documentation, 15User-Zppix: Update swat deployers documation - https://phabricator.wikimedia.org/T165069#3255999 (10Zppix) [17:18:28] PROBLEM - mediawiki originals uploads -hourly- for eqiad-prod on graphite1001 is CRITICAL: CRITICAL: 83.33% of data above the critical threshold [3000.0] [17:19:41] 06Operations, 07Documentation, 15User-Zppix: Update swat deployers documentation - https://phabricator.wikimedia.org/T165069#3256017 (10Zppix) [17:20:38] (03CR) 10Dzahn: [C: 032] udp2log: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352998 (owner: 10Dzahn) [17:22:17] (03CR) 10Dzahn: "confirmed no-op on mwlog1001/2001" [puppet] - 10https://gerrit.wikimedia.org/r/352998 (owner: 10Dzahn) [17:27:20] 06Operations, 07Documentation, 15User-Zppix: Update swat deployers documentation - https://phabricator.wikimedia.org/T165069#3256048 (10Reedy) [17:42:45] 06Operations, 07HHVM, 13Patch-For-Review, 07Upstream: Build / migrate to HHVM 3.18 - https://phabricator.wikimedia.org/T158176#3256128 (10daniel) [17:45:08] PROBLEM - puppet last run on labtestservices2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/labs-ip-alias-dump.py] [17:49:35] 06Operations, 10Continuous-Integration-Infrastructure, 10Wikidata, 07HHVM, 07Jenkins: CI tests failing with segfault - https://phabricator.wikimedia.org/T165074#3256158 (10greg) [17:50:38] 06Operations, 10Ops-Access-Requests: add Arzhel Younsi to datacenter access lists - https://phabricator.wikimedia.org/T165054#3256160 (10RobH) For clarity: I had gotten @mark's approval on this via IRC before emailing vendors! I've also taken this as an opportunity to audit the user lists for each of these v... [17:50:49] 06Operations, 10Ops-Access-Requests: add Arzhel Younsi to datacenter access lists - https://phabricator.wikimedia.org/T165054#3256161 (10RobH) [17:52:27] ACKNOWLEDGEMENT - puppet last run on labtestservices2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/labs-ip-alias-dump.py] andrew bogott This is probably due to a password change Im looking at it. [17:52:53] 06Operations, 10Ops-Access-Requests: add Arzhel Younsi to datacenter access lists - https://phabricator.wikimedia.org/T165054#3256168 (10RobH) [17:53:50] 06Operations, 10Continuous-Integration-Infrastructure, 10Wikidata, 07HHVM, 07Jenkins: CI tests failing with segfault - https://phabricator.wikimedia.org/T165074#3256111 (10Paladox) May be related T165043 [17:59:06] 06Operations, 10Continuous-Integration-Infrastructure, 10Wikidata, 07HHVM, 07Jenkins: CI tests failing with segfault - https://phabricator.wikimedia.org/T165074#3256198 (10greg) Can someone create a simple repo case? Or at least a backtrace? [18:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170511T1800). [18:00:04] matt_flaschen and Niharika: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:02:38] (03PS1) 10Ladsgroup: Enable OOUI in EditPage for fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353339 (https://phabricator.wikimedia.org/T162849) [18:02:55] I have another patch that I'm adding to SWAT in a moment [18:02:59] I hope that's fine for you [18:03:16] Present [18:03:23] I can SWAT [18:03:28] Amir1: sure [18:03:29] o/ [18:03:36] I'll also have a late SWAT in addition to the existing one, if there's room. [18:04:19] Thanks, added [18:06:04] uhhh, seeing a lot of Notice: Undefined variable: wmgOresDefaultSensitivityLevel in /srv/mediawiki/wmf-config/CommonSettings.php on line 33351 [18:06:24] thcipriani: I get that fixed [18:06:48] It doesn't have any prod impact AFAICT [18:06:54] only logspam [18:08:06] thcipriani: Amir1: Late SWAT addition to unbreak the currenet wmf.1 blockers - https://gerrit.wikimedia.org/r/#/c/353335/ [18:08:31] Krinkle: okie doke [18:10:01] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351195 (https://phabricator.wikimedia.org/T165007) (owner: 10Niharika29) [18:12:41] matt_flaschen: your wikimediaevents change is live on mwdebug1002, check please [18:13:48] (03PS3) 10Thcipriani: Enable LoginNotify on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351195 (https://phabricator.wikimedia.org/T165007) (owner: 10Niharika29) [18:14:01] (03CR) 10Thcipriani: Enable LoginNotify on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351195 (https://phabricator.wikimedia.org/T165007) (owner: 10Niharika29) [18:14:03] Looking [18:14:05] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351195 (https://phabricator.wikimedia.org/T165007) (owner: 10Niharika29) [18:16:44] (03Merged) 10jenkins-bot: Enable LoginNotify on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351195 (https://phabricator.wikimedia.org/T165007) (owner: 10Niharika29) [18:16:53] (03CR) 10jenkins-bot: Enable LoginNotify on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351195 (https://phabricator.wikimedia.org/T165007) (owner: 10Niharika29) [18:18:04] Amir1: do you have a patch for the wmgOresDefaultSensitivityLevel thing? Why is that happening? [18:18:36] Not yet, I'm working on it. Roan made the code so I need to talk to him before moving on [18:19:37] the biggest problem here is that CommonSettings.php is calling a variable that is in InitializeSettings.php [18:19:58] and it's not possible to move the main thing to InitializeSettings.php [18:20:06] Amir1: Roan's on a plane. You won't talk to him in the next 24 hours. [18:20:10] thcipriani, WikimediaEvents is good. [18:20:25] Amir1, he fixed the original bug, right? What's the current status? [18:20:29] RECOVERY - mediawiki originals uploads -hourly- for eqiad-prod on graphite1001 is OK: OK: Less than 80.00% above the threshold [2000.0] [18:20:33] matt_flaschen: ok, syncing [18:20:38] yeah, the original bug is fixed [18:20:52] but not in a good way, this logspam is result of that [18:21:37] I'm here [18:21:40] Amir1, sorry, didn't see that above. [18:22:12] RoanKattouw, can you take a look, and I'll try to get the fix in. I'm working on a change to gate saved filters behind a config (also per this SWAT), per discussion. [18:22:18] RECOVERY - mediawiki originals uploads -hourly- for codfw-prod on graphite1001 is OK: OK: Less than 80.00% above the threshold [2000.0] [18:22:21] Why is there logspam? I defined the variable in IS and used it in CS, does that not work any more? [18:22:54] I can't get on a computer right now, I'm mobile only [18:23:06] On the plane but not leaving for a little bit [18:23:09] jouncebot: You were saying? :P [18:23:12] James_F: [18:23:17] RoanKattouw, okay, will take a look after the saved filters one. [18:23:32] !log thcipriani@tin Synchronized php-1.30.0-wmf.1/extensions/WikimediaEvents/modules/ext.wikimediaEvents.recentChangesClicks.js: SWAT: [[gerrit:353211|RecentChangesClicks: Address minor performance concerns]] T158458 (duration: 00m 42s) [18:23:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:42] T158458: ERI Metrics: Measure click-through actions from RC page and create 'Productivity" baseline - https://phabricator.wikimedia.org/T158458 [18:23:49] Reedy: Roan-without-laptop isn't really speaking to Roan, just the offline quarter. ;-) [18:24:19] In the brave new world of extension registration, do wmg vars not work any more? [18:24:22] hrm, the error message just dropped off after that sync [18:24:36] thcipriani: I have seen that a lot recently [18:25:03] Needing to sync twice for interdependent IS/CS changes [18:25:11] RoanKattouw: I will take a look at scap logs after SWAT and try to figure out what's going on. [18:25:33] every sync without the --beta-only-update flag touches IS [18:25:38] Niharika had it happen to her twice [18:25:54] so that's probably what was needed here, dunno why yet. [18:26:13] Yeah. :( Glad to know it wasn't just me. [18:26:15] anyway, logs are happier for now. [18:26:20] She ran sync-dir wmf-config (I think?) and would get logspam [18:26:38] hrm, that may be the issue since rsync [18:26:44] Then synced one of the files (or both?) again and that'd fix it [18:26:54] (03PS1) 10Ladsgroup: Moving some ORES configs from InitialiseSettings to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353344 (https://phabricator.wikimedia.org/T165011) [18:27:07] 06Operations, 10Continuous-Integration-Infrastructure, 10Wikidata, 07HHVM, 07Jenkins: CI tests failing with segfault - https://phabricator.wikimedia.org/T165074#3256310 (10Reedy) >>! In T165074#3256198, @greg wrote: > Can someone create a simple repo case? Or at least a backtrace? Do we have a jessie hh... [18:27:07] sync-dir of wmf-config has no order guarantee [18:27:12] Re every sync touching IS, that's needed because of how MW's config caching works [18:27:27] Hmm I see [18:27:33] I synced the files individually but I synced CS before IS and then syncing CS once more would fix the logspam. [18:27:41] touch is second only to "have you tried turning it off on and again" [18:27:46] So probably my mistake. [18:27:48] Yeah, CS before IS would do that [18:28:02] But that should logspam only briefly [18:28:17] (03PS2) 10Dzahn: phabricator: switch to interface::alias [puppet] - 10https://gerrit.wikimedia.org/r/350777 (owner: 10Faidon Liambotis) [18:28:23] In at least one of her cases it kept spamming for 5-10 mins after the second sync [18:28:33] RoanKattouw: https://gerrit.wikimedia.org/r/#/c/353344/ [18:28:39] What do you think of this? [18:29:25] Niharika: LoginNotify on testwiki is live on mwdebug1002, check please [18:29:35] Amir1: Probably unnecessary if the logspam stays away [18:29:54] I'll defer to thcipriani though since I don't have much time left [18:30:07] Probably going to go off line in the next 5-10 mins [18:30:24] 4G works at over 10,000 feet [18:30:35] But not between 0 and FL100 [18:30:50] Yeah it does [18:31:01] Amir1: unneeded, logspam seems fine now [18:31:08] OK, OK, but let's not get Roan fined. :-) [18:31:14] thcipriani: I get "Error Our servers are currently under maintenance or experiencing a technical problem. Please try again in a few minutes. See the error message at the bottom of this page for more information." [18:31:19] What would you do with angry pane crews [18:31:24] *plane [18:31:38] (03Abandoned) 10Ladsgroup: Moving some ORES configs from InitialiseSettings to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353344 (https://phabricator.wikimedia.org/T165011) (owner: 10Ladsgroup) [18:31:41] They're Canadian, they'll be polite ;) [18:32:00] (03CR) 10Dzahn: [C: 031] "amended to use interface::alias, but still needs Change-Id: I26a0f6d882fb25b first" [puppet] - 10https://gerrit.wikimedia.org/r/350777 (owner: 10Faidon Liambotis) [18:32:17] * Amir1 happy to hear it's not United [18:33:04] Niharika: hrm, did I miss a table creation somewhere? [18:34:00] (03PS2) 10Dzahn: Phabricator monthly email: Also include Differential user activity [puppet] - 10https://gerrit.wikimedia.org/r/348238 (owner: 10Aklapper) [18:34:02] thcipriani: I don't think so. LoginNotify doesn't have any tables of its own afaik. [18:34:20] (03CR) 10Dzahn: "unblocked now. mysql grants have been added." [puppet] - 10https://gerrit.wikimedia.org/r/348238 (owner: 10Aklapper) [18:35:11] Amir1: funny story, it was going to be United! But their flight was delayed so I'd miss my connection, and they moved me to Air Canada [18:35:28] ...who just gave me a group of 2 seats to myself, so I'm happy [18:35:46] :)))) [18:36:12] thcipriani: Niharika: Try searching logstash for host: mwdebug1002 perhaps? [18:36:14] that'd do it /srv/mediawiki/php-1.30.0-wmf.1/extensions/LoginNotify/extension.json does not exist! [18:36:19] ^ Niharika [18:36:36] Aha I see you got it [18:37:21] Is it branched? [18:37:37] Nope [18:37:47] (03PS1) 10Thcipriani: Revert "Enable LoginNotify on testwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353346 [18:37:54] I see. I didn't add it as an extension on prod. Just labs. [18:38:07] Sorry for the trouble thcipriani. [18:38:28] It's not in https://github.com/wikimedia/mediawiki-tools-release/blob/master/make-wmf-branch/config.json :) [18:38:30] Niharika: np, just needs to be added to the tools/release repo for make-wmf-branch and we'll get it next train [18:39:00] !log Updated the Wikidata property suggester with data from last Monday's JSON dump and applied the T132839 workarounds [18:39:06] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353346 (owner: 10Thcipriani) [18:39:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:08] T132839: [RfC] Property suggester suggests human properties for non-human items - https://phabricator.wikimedia.org/T132839 [18:39:23] Never seen that repo before. Is that list generated using wmf-config/extension-list? [18:39:43] Nope [18:39:48] Niharika: we just manually futz with that list [18:40:12] (03Merged) 10jenkins-bot: Revert "Enable LoginNotify on testwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353346 (owner: 10Thcipriani) [18:40:18] thcipriani: Then what's wmf-config/extension-list for? [18:40:24] (03CR) 10jenkins-bot: Revert "Enable LoginNotify on testwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353346 (owner: 10Thcipriani) [18:40:39] localisationupdate [18:40:52] Ah. [18:41:15] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353339 (https://phabricator.wikimedia.org/T162849) (owner: 10Ladsgroup) [18:42:17] (03Merged) 10jenkins-bot: Enable OOUI in EditPage for fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353339 (https://phabricator.wikimedia.org/T162849) (owner: 10Ladsgroup) [18:42:26] (03CR) 10jenkins-bot: Enable OOUI in EditPage for fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353339 (https://phabricator.wikimedia.org/T162849) (owner: 10Ladsgroup) [18:43:10] Amir1: ^ is live on mwdebug1002, check please [18:43:34] on it [18:44:16] thcipriani: works just fine [18:44:23] Amir1: ok, going live [18:44:54] (03PS1) 10Mattflaschen: Enable saving RC Filters on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353349 [18:46:19] 06Operations, 10DBA, 06Performance-Team, 10Traffic: Cache invalidations coming from the JobQueue are causing slowdown on masters and lag on several wikis, and impact on varnish - https://phabricator.wikimedia.org/T164173#3256370 (10aaron) 05Open>03declined >>! In T164173#3253516, @jcrespo wrote: > I th... [18:47:44] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:353339|Enable OOUI in EditPage for fawiki]] T162849 (duration: 00m 42s) [18:47:45] godog: https://gerrit.wikimedia.org/r/#/c/353173/ [18:47:50] ^ Amir1 live everywhere [18:47:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:51] T162849: Support WMF communities in run-up to switching EditPage over to OOUI - https://phabricator.wikimedia.org/T162849 [18:48:38] PROBLEM - puppet last run on mw1163 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:49:16] Krinkle: Gadgets change is live on mwdebug1002, check please [18:50:22] thcipriani: Verified using mw.org [18:50:34] and mwdebug1002 [18:50:35] Go ahead :) [18:50:42] ok, going live :) [18:52:20] 06Operations, 10Continuous-Integration-Infrastructure, 10Wikidata, 07HHVM, 07Jenkins: CI tests failing with segfault - https://phabricator.wikimedia.org/T165074#3256387 (10Jdforrester-WMF) p:05Triage>03High This is at least High, as it's stopping merges into master in most repos. [18:52:21] thanks! [18:53:05] !log thcipriani@tin Synchronized php-1.30.0-wmf.1/extensions/Gadgets/includes/GadgetResourceLoaderModule.php: SWAT: [[gerrit:353335|Revert "Move gadget styles from main stylesheet request to site request"]] T165040 T165031 (duration: 00m 42s) [18:53:11] ^ Krinkle live everywhere [18:53:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:14] T165040: MediaWiki:Common.css not applied if it uses '@import' rules and user has any style-only gadgets enabled (works when using ?debug=true and when disabling all gadgets) - https://phabricator.wikimedia.org/T165040 [18:53:14] T165031: Gadgets that use both scripts and styles, but do not specify type=general, are never loaded (JS file not loaded but CSS file is) - https://phabricator.wikimedia.org/T165031 [18:53:28] PROBLEM - mediawiki originals uploads -hourly- for eqiad-prod on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [3000.0] [18:53:52] thcipriani: OK. verified again on mw.org, got the response from "wgHostname":"mw1241". Looks good. [18:54:00] 06Operations, 10ops-eqiad, 06Analytics-Kanban, 06DC-Ops: analytics1030 stuck in console while booting - https://phabricator.wikimedia.org/T162046#3256414 (10Cmjohnson) The new system board has been ordered and I will be contacted by a Dell tech to visit the cage and replace. In regards to Service Tag –... [18:54:04] cool, thanks for checking :) [18:54:18] RECOVERY - puppet last run on labtestservices2001 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [18:56:32] (03PS1) 10Niharika29: Add loginnotify to extension-list for prod deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353352 [18:57:08] PROBLEM - mediawiki originals uploads -hourly- for codfw-prod on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [3000.0] [18:58:02] 06Operations, 10Continuous-Integration-Infrastructure, 10Wikidata, 07HHVM, 07Jenkins: CI tests failing with segfault - https://phabricator.wikimedia.org/T165074#3256436 (10Paladox) [18:59:35] (03CR) 10Dzahn: [C: 032] Phabricator monthly email: Also include Differential user activity [puppet] - 10https://gerrit.wikimedia.org/r/348238 (owner: 10Aklapper) [19:00:05] twentyafterfour: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170511T1900). Please do the needful. [19:00:42] twentyafterfour, greg-g, et al, any word on Jenkins segfaults, e.g. T165064? [19:00:42] T165064: Segmentation fault in mwext-testextension-hhvm-composer-jessie builds - https://phabricator.wikimedia.org/T165064 [19:00:45] ^ moritzm [19:01:09] Not really [19:01:17] Trying to work out how to get a backtrace off one of the ci slaves [19:02:28] RECOVERY - mediawiki originals uploads -hourly- for eqiad-prod on graphite1001 is OK: OK: Less than 80.00% above the threshold [2000.0] [19:03:09] RECOVERY - mediawiki originals uploads -hourly- for codfw-prod on graphite1001 is OK: OK: Less than 80.00% above the threshold [2000.0] [19:10:17] (03CR) 10Dzahn: "deployed and tested. "Active Differential users (any activity) in (2017-04): 23"" [puppet] - 10https://gerrit.wikimedia.org/r/348238 (owner: 10Aklapper) [19:14:55] (03PS1) 10Dzahn: piwik: move 'include standard' to role [puppet] - 10https://gerrit.wikimedia.org/r/353354 [19:16:38] RECOVERY - puppet last run on mw1163 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [19:17:56] (03PS1) 1020after4: all wikis to 1.30.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353356 [19:17:58] (03CR) 1020after4: [C: 032] all wikis to 1.30.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353356 (owner: 1020after4) [19:18:25] (03PS1) 10Dzahn: ci::master: move 'include standard' to role [puppet] - 10https://gerrit.wikimedia.org/r/353357 [19:20:32] (03PS1) 10Dzahn: dumps::zim: move 'include standard' to role [puppet] - 10https://gerrit.wikimedia.org/r/353358 [19:20:50] (03Merged) 10jenkins-bot: all wikis to 1.30.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353356 (owner: 1020after4) [19:21:01] (03CR) 10jenkins-bot: all wikis to 1.30.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353356 (owner: 1020after4) [19:23:06] (03PS1) 10Dzahn: webperf: move 'standard' and 'base::firewall' to role [puppet] - 10https://gerrit.wikimedia.org/r/353359 [19:24:44] (03PS1) 10Dzahn: debug_proxy: move 'standard' and 'base::firewall' to role [puppet] - 10https://gerrit.wikimedia.org/r/353361 [19:26:42] (03PS1) 10Dzahn: backup: remove duplicate 'standard'-include [puppet] - 10https://gerrit.wikimedia.org/r/353362 [19:28:20] (03PS1) 10Dzahn: backup::offsite: move 'include standard' to role [puppet] - 10https://gerrit.wikimedia.org/r/353363 [19:32:55] (03PS1) 10Dzahn: graphite: move 'standard' and 'base::firewall' to role [puppet] - 10https://gerrit.wikimedia.org/r/353364 [19:35:04] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: all wikis to 1.30.0-wmf.1 [19:35:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:48] (03PS1) 10Chad: Add a few more filetypes to cleanup script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353365 [19:38:34] (03CR) 10Chad: [C: 032] Add a few more filetypes to cleanup script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353365 (owner: 10Chad) [19:39:40] (03Merged) 10jenkins-bot: Add a few more filetypes to cleanup script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353365 (owner: 10Chad) [19:39:52] (03CR) 10jenkins-bot: Add a few more filetypes to cleanup script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353365 (owner: 10Chad) [19:41:08] !log demon@tin Synchronized scap/plugins/clean.py: no-op, completeness (duration: 00m 42s) [19:41:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:00] (03PS2) 10Chad: Scap prep: Save network time by copying data locally [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351356 [19:52:07] (03PS2) 10Niharika29: Deploy and enable LoginNotify on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353352 [19:52:28] PROBLEM - mediawiki originals uploads -hourly- for eqiad-prod on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [3000.0] [19:55:08] PROBLEM - mediawiki originals uploads -hourly- for codfw-prod on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [3000.0] [19:59:48] (03PS3) 10Chad: Scap prep: Save network time by copying data locally [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351356 [20:02:28] RECOVERY - mediawiki originals uploads -hourly- for eqiad-prod on graphite1001 is OK: OK: Less than 80.00% above the threshold [2000.0] [20:03:08] RECOVERY - mediawiki originals uploads -hourly- for codfw-prod on graphite1001 is OK: OK: Less than 80.00% above the threshold [2000.0] [20:03:26] (03CR) 10Chad: [C: 032] Scap prep: Save network time by copying data locally [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351356 (owner: 10Chad) [20:04:33] (03CR) 10Reedy: [C: 04-1] "Two minor issues" (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353352 (owner: 10Niharika29) [20:04:42] (03Merged) 10jenkins-bot: Scap prep: Save network time by copying data locally [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351356 (owner: 10Chad) [20:06:16] (03PS3) 10Niharika29: Deploy and enable LoginNotify on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353352 [20:06:19] !log demon@tin Synchronized scap/plugins/prep.py: scap prep is fast now (duration: 00m 44s) [20:06:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:50] (03CR) 10jenkins-bot: Scap prep: Save network time by copying data locally [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351356 (owner: 10Chad) [20:09:25] (03PS1) 10Chad: Scap clean: active_wikiversions() returns a dict, not a list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353381 [20:10:28] (03CR) 10Chad: [C: 032] Scap clean: active_wikiversions() returns a dict, not a list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353381 (owner: 10Chad) [20:10:45] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 15 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#3256693 (10Tgr) We'll also need a way to display old versions of images. Clients can encounter old versions without expecting to due to FlaggedRevs hid... [20:11:41] (03Merged) 10jenkins-bot: Scap clean: active_wikiversions() returns a dict, not a list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353381 (owner: 10Chad) [20:11:49] (03CR) 10jenkins-bot: Scap clean: active_wikiversions() returns a dict, not a list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353381 (owner: 10Chad) [20:14:28] PROBLEM - mediawiki originals uploads -hourly- for eqiad-prod on graphite1001 is CRITICAL: CRITICAL: 83.33% of data above the critical threshold [3000.0] [20:15:08] PROBLEM - mediawiki originals uploads -hourly- for codfw-prod on graphite1001 is CRITICAL: CRITICAL: 83.33% of data above the critical threshold [3000.0] [20:29:08] PROBLEM - puppet last run on cp3036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:32:29] RECOVERY - mediawiki originals uploads -hourly- for eqiad-prod on graphite1001 is OK: OK: Less than 80.00% above the threshold [2000.0] [20:33:08] RECOVERY - mediawiki originals uploads -hourly- for codfw-prod on graphite1001 is OK: OK: Less than 80.00% above the threshold [2000.0] [20:39:04] (03PS1) 10Dzahn: return HTTP 503 if database connection fails [software/dbtree] - 10https://gerrit.wikimedia.org/r/353388 (https://phabricator.wikimedia.org/T163143) [20:40:20] 06Operations, 10Continuous-Integration-Infrastructure, 10Wikidata, 07HHVM, 07Jenkins: CI tests failing with segfault - https://phabricator.wikimedia.org/T165074#3256780 (10hashar) p:05High>03Unbreak! That is caused by the upgrade of HHVM {T158176}. 3.18 has been uploaded to apt.wikimedia.org under j... [20:42:21] 06Operations, 10Continuous-Integration-Infrastructure, 10Wikidata, 07HHVM, 07Jenkins: CI tests failing with segfault - https://phabricator.wikimedia.org/T165074#3256789 (10hashar) The snapshots we have: | ID | Provider | Image | Hostname | Version | Image ID... [20:44:40] 06Operations, 10Continuous-Integration-Infrastructure, 10Wikidata, 07HHVM, 07Jenkins: CI tests failing with segfault - https://phabricator.wikimedia.org/T165074#3256794 (10greg) @MoritzMuehlenhoff we should probably downgrade the HHVM version from Beta and CI and work on repro'ing elsewhere. This is prev... [20:46:33] hey, is there any particular reason why labwiki, aka wikitech.wikimedia.org, isn't included in the dumps on https://wikitech.wikimedia.org ? [20:47:04] you mean on dumps.wikimedia.org? :P [20:47:10] probably because the db server is seperate etc etc etc [20:47:13] apergos: ^ [20:48:09] Reedy: so we just don't dump wikitech? [20:48:17] yup [20:48:28] I think it's more of a don't, rather than a can't [20:48:32] huh. [20:48:49] i'd hope we still do backups of it, though? [20:48:59] https://wikitech-static.wikimedia.org/wiki/Main_Page [20:48:59] dump != our backups, and yes [20:49:11] 06Operations, 10Continuous-Integration-Infrastructure, 10Wikidata, 07HHVM, 07Jenkins: CI tests failing with segfault - https://phabricator.wikimedia.org/T165074#3256803 (10hashar) Jessie instances are now being booted from `snapshot-ci-jessie-1494425642` which should have the previous HHVM version. What... [20:49:34] separate db so that wikitech is still available even if armeggedon comes and all our main dbs go down? [20:50:18] actually it's "we can't" with the standard dumps process [20:50:21] 06Operations, 10Continuous-Integration-Infrastructure, 10Wikidata, 07HHVM, 07Jenkins: CI tests failing with segfault - https://phabricator.wikimedia.org/T165074#3256804 (10Reedy) Ok, so a clean vagrant vm (with 4GB ram!), will segfault by running phpunit with no extensions From gdb attached... ``` Cont... [20:50:24] but it's dumped locally and we serve those [20:50:32] would it be impossible to get a one-off XML dump of current article contents? [20:50:39] 06Operations, 10Continuous-Integration-Infrastructure, 10Wikidata, 07HHVM, 07Jenkins: CI tests failing with segfault - https://phabricator.wikimedia.org/T165074#3256805 (10MoritzMuehlenhoff) We can't easily downgrade the HHVM package in the main repo, it's otherwise working fine in production and running... [20:50:46] apergos: what does "dumped locally" mean? [20:51:20] on the server itself [20:51:22] https://dumps.wikimedia.org/other/wikitech/dumps/ [20:51:29] i'm trying to do an all-wiki grep for some problematic wikitext markup that needs to be fixed, and code fragments tend to have it, unfortunately. so i'd like to be able to fixup wikitech at the same time as all the other wikis. [20:51:31] there's a job that runs. [20:51:51] https://dumps.wikimedia.org/other/wikitech/dumps/ will probably do, thanks. [20:52:25] "making wikitech more like any other cluster wiki" is a thing in general, so adding it = good [20:52:37] yes it would be nice [20:53:32] 06Operations, 10Continuous-Integration-Infrastructure, 10Wikidata, 07HHVM, 07Jenkins: CI tests failing with segfault - https://phabricator.wikimedia.org/T165074#3256807 (10greg) We should really use the new HHVM in testing first before going to production. If the tests are broken it means fix the tests/t... [20:54:19] there's a permissions issue with the way it's set up now [20:54:32] I no longer remember if it's the db user/password but I expect so [20:56:15] 06Operations, 10DBA, 13Patch-For-Review: dbtree: don't return 200 on error pages - https://phabricator.wikimedia.org/T163143#3256809 (10Dzahn) How about [[ https://gerrit.wikimedia.org/r/#/c/353388/1/index.php | this ]]? [20:56:19] 06Operations, 10Continuous-Integration-Infrastructure, 10Wikidata, 07HHVM, 07Jenkins: CI tests failing with segfault - https://phabricator.wikimedia.org/T165074#3256810 (10hashar) p:05Unbreak!>03High a:03hashar CI instances have been rollbacked to the last known snapshot which uses HHVM 3.12.14. I... [20:56:58] nowadays mysql grants are in the puppet repo [20:57:38] We're closer to wikitech being a cluster wiki than last week :) [20:57:55] !log CI Phpunit jobs were segfaulting due to an upgrade of HHVM to 3.18. Got rolled back to 3.12 - T165074 [20:58:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:03] T165074: CI tests failing with segfault - https://phabricator.wikimedia.org/T165074 [20:58:08] RECOVERY - puppet last run on cp3036 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [20:58:37] the wikiadmin password itself may be different [20:58:44] I would expect it to be, tbh [20:59:09] 06Operations, 10Continuous-Integration-Infrastructure, 10Wikidata, 07HHVM, 07Jenkins: CI tests failing with segfault - https://phabricator.wikimedia.org/T165074#3256816 (10MoritzMuehlenhoff) The new HHVM version has been extensively tested on five canary servers in production for 5-6 week now. As per Ree... [21:00:05] Amir1: Respected human, time to deploy Clean up ores_classification table (again) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170511T2100). Please do the needful. [21:00:12] b19bfd9e243b515423743b3b3a2ea3f5df2b8df2 says no permissions from snapshot hosts for labswiki, labstestwiki [21:00:12] oh yeah [21:00:17] so that's how it was a year ago [21:02:32] !log start of cleaning up ores_classification in enwiki for two hours (T159753) [21:02:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:42] T159753: Concerns about ores_classification table size on enwiki - https://phabricator.wikimedia.org/T159753 [21:15:18] (03CR) 10Hashar: [C: 031] "MediaWiki core UIDGenerator uses gmp if available whenever running on a non 64 bits PHP:" [puppet] - 10https://gerrit.wikimedia.org/r/353194 (https://phabricator.wikimedia.org/T164977) (owner: 10Paladox) [21:19:30] 06Operations, 10DBA, 10Traffic: dbtree: make wasat a working backend and become active-active - https://phabricator.wikimedia.org/T163141#3256864 (10Dzahn) status update: nowadays terbium and wasat use the identical role and profile in site.pp, as in: ``` 2600 # mediawiki maintenance servers (https://wik... [21:19:42] !log twentyafterfour@tin Synchronized php-1.30.0-wmf.1/includes/specials/SpecialSearch.php: hotfix T165091 (duration: 00m 39s) [21:19:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:50] T165091: Call to a member function hasInterwikiResults() on a non-object (null) - https://phabricator.wikimedia.org/T165091 [21:22:01] 06Operations, 10Continuous-Integration-Infrastructure, 10Wikidata, 07HHVM, 07Jenkins: CI tests failing with segfault - https://phabricator.wikimedia.org/T165074#3256869 (10hashar) Might be {T156923} surfacing again which mentionned XMLReader. [21:22:11] 06Operations, 10DBA, 10Traffic: dbtree: make wasat a working backend and become active-active - https://phabricator.wikimedia.org/T163141#3256871 (10Dzahn) reason: `database connection to tendril on tendril-backend.eqiad.wmnet failed` [21:24:39] https://phabricator.wikimedia.org/T165100 [21:24:44] Wikimedia\Rdbms\Database::makeList: empty input for field rev_id [21:25:13] kinda generic, I'm seeing quite a few of these though ^ [21:35:56] 06Operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 06Performance-Team, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#3256951 (10Krinkle) 05Open>03Resolved a:03aaron [21:36:01] 06Operations, 10Traffic: Content purges are unreliable - https://phabricator.wikimedia.org/T133821#3256953 (10Krinkle) [21:45:54] Person reporting they are observing high latency for reaching Wikipedia https://ticket.wikimedia.org/otrs/index.pl?Action=AgentTicketZoom;TicketID=9992918 [21:49:19] From where? [21:49:23] Traceroute provided? [21:52:58] umm...I can't disclose per NDA (and my limited technical knowledge) - the ticket includes trace log, traceroute, forward path, IPs, etc. but I can't disclose unless to another OTRS agent etc... [21:54:04] I don't have otrs access anymore for some reason [21:54:17] did you go inactive? [21:54:21] Probably [21:54:34] Josve05a: and AFAIK, anyone under a WMF NDA... Should be able to view the information [21:54:49] You're gonna have to put it somewhere else (like a NDA'd phab ticket) to get opsen to look at it [21:55:22] Josve05a, ticket customer knows what they're talking about, forward to ops [21:57:04] Will create phab ticket [21:57:31] Josve05a: Create it as a security ticket [21:57:37] using the security form [22:00:24] * Josve05a have no idea which team should have it https://phabricator.wikimedia.org/T165103 [22:00:36] stick netops on there [22:01:03] ty [22:01:05] Sounds like that's someone from an actual ISP [22:03:48] 06Operations, 10Page-Previews, 06Performance-Team, 06Reading-Web-Backlog, and 3 others: Performance review #2 of Hovercards (Popups extension) - https://phabricator.wikimedia.org/T70861#3257036 (10Tbayer) Thanks @Gilles! Speaking for myself, I also found yesterday's meeting really useful to better understa... [22:05:23] Could be worth telling them to get onto phab too, maybe [22:06:28] Josve05a: It'd be worth asking them when they noticed it started. And if they know what it was previously [22:07:22] Reedy: Well, since it is a secirity tasks, they would have to create an account, tell their name, be added as a subscriber, beofre they could even ee the ticket [22:07:38] Yup [22:07:46] Which has been done a few times [22:08:43] yeah, but seems like a pain... [22:10:56] mutante: I'd like https://gerrit.wikimedia.org/r/#/q/status:open+project:operations/puppet+branch:production+topic:errorpage to land this week if possible, there are other changes principally blocked by it. [22:11:01] Is there a good way to test it somehow? [22:11:05] I suppose in beta, if we want. [22:11:09] Josve05a: Well, that or some contact details would help [22:11:12] I'm not sure what the process is for varnish patches like th ese [22:13:37] 06Operations, 10Traffic: AS43821 contact details not as up to date as AS14907 - https://phabricator.wikimedia.org/T165104#3257060 (10Reedy) [22:14:29] 06Operations, 10Traffic: AS43821 contact details not as up to date or as detailed as AS14907 - https://phabricator.wikimedia.org/T165104#3257074 (10Reedy) [22:16:04] 06Operations, 10Traffic, 10netops: AS43821 contact details not as up to date or as detailed as AS14907 - https://phabricator.wikimedia.org/T165104#3257060 (10Reedy) [22:16:23] (03PS2) 10Krinkle: varnish: Make errorpage.html balanced and use placeholder [puppet] - 10https://gerrit.wikimedia.org/r/350966 (https://phabricator.wikimedia.org/T113114) [22:16:32] (03PS3) 10Krinkle: varnish: Convert errorpage into re-usable template [puppet] - 10https://gerrit.wikimedia.org/r/350493 (https://phabricator.wikimedia.org/T113114) [22:17:26] (03PS6) 10Krinkle: dynamicproxy: Make use of errorpage template [puppet] - 10https://gerrit.wikimedia.org/r/350494 (https://phabricator.wikimedia.org/T113114) [22:24:14] 06Operations, 06DC-Ops, 10netops: mr1-ulsfo crashed - https://phabricator.wikimedia.org/T164970#3257116 (10ayounsi) RMA# R200124729 [22:35:48] 06Operations, 06DC-Ops, 10netops: mr1-ulsfo crashed - https://phabricator.wikimedia.org/T164970#3257123 (10RobH) Juniper emailed us the tracking info, and I've opened an inbound shipment ticket with unitedlayer. I'll plan to go onsite next Wednesday and swap them. [22:47:04] 06Operations, 10Traffic, 10netops: AS43821 contact details not as up to date or as detailed as AS14907 - https://phabricator.wikimedia.org/T165104#3257146 (10faidon) 05Open>03Invalid Just different databases (ARIn/RIPE) with different anti-spam measures. Nothing we can do about it :) [22:56:06] (03PS1) 10Jforrester: Enable OOUI for EditPage on MW.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353480 [22:56:58] 06Operations, 10Ops-Access-Requests: add Arzhel Younsi to datacenter access lists - https://phabricator.wikimedia.org/T165054#3257182 (10ayounsi) [23:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170511T2300). [23:00:04] Krinkle, James_F, and mooeypoo: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:16] * James_F waves. [23:01:27] * mooeypoo waves from right to left [23:04:17] I can SWAT [23:05:25] Krinkle: ping me when you're around for SWAT [23:05:33] * Krinkle is here [23:05:38] thcipriani: :) [23:06:12] okie doke :) [23:07:06] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353480 (owner: 10Jforrester) [23:08:11] (03Merged) 10jenkins-bot: Enable OOUI for EditPage on MW.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353480 (owner: 10Jforrester) [23:08:20] (03CR) 10jenkins-bot: Enable OOUI for EditPage on MW.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353480 (owner: 10Jforrester) [23:09:11] !log clean up for ores_classification is finished for now, 9M rows cleaned, current number of row: 55,959,017 (T159753) [23:09:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:20] T159753: Concerns about ores_classification table size on enwiki - https://phabricator.wikimedia.org/T159753 [23:09:51] James_F: OOUI for EditPage on MW live on mwdebug1002, check please [23:10:58] thcipriani: Yup, LGTM. [23:11:04] ok, going live [23:12:46] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:353480|Enable OOUI for EditPage on MW.org]] (duration: 00m 40s) [23:12:51] ^ James_F live now [23:12:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:30] thcipriani: Double-checked, looks good. [23:17:15] James_F: mw.Upload.Dialog: Define .static.name live on mwdebug1002, check please [23:19:26] thcipriani: Yup, works. [23:21:12] !log thcipriani@tin Synchronized php-1.30.0-wmf.1/resources/src/mediawiki/mediawiki.Upload.Dialog.js: SWAT: [[gerrit:353475|mw.Upload.Dialog: Define .static.name]] T164999 (duration: 00m 40s) [23:21:18] ^ James_F live now [23:21:23] thcipriani: Thanks! [23:21:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:29] T164999: Cross-wiki media upload tool throws JavaScript error - https://phabricator.wikimedia.org/T164999 [23:22:18] Krinkle: TemplateData change is live on mwdebug1002, check please [23:23:40] !log restart apache on iridium to apply hotfix for T163967 [23:23:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:23:48] T163967: Phabricator mouseover popups transparent and hence unreadable for closed tasks - https://phabricator.wikimedia.org/T163967 [23:28:31] thcipriani: Verified. [23:28:41] ok, going live [23:30:51] !log thcipriani@tin Synchronized php-1.30.0-wmf.1/extensions/TemplateData/extension.json: SWAT: [[gerrit:353343|Fix styles queue violation for "ext.templateData"]] T92459 (duration: 00m 39s) [23:30:57] ^ Krinkle live everywhere [23:30:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:59] T92459: ResourceLoader should restrict addModuleStyles() to modules that only provide styles - https://phabricator.wikimedia.org/T92459 [23:32:29] mooeypoo: Gate option to save RC filters to default false is live on mwdebug1002, check please [23:32:44] Checking [23:33:54] thcipriani, works! [23:34:56] alright, going live [23:38:12] woot [23:38:21] !log thcipriani@tin Synchronized php-1.30.0-wmf.1/includes/DefaultSettings.php: SWAT: [[gerrit:353479|Gate option to save RC filters to default false]] 1/3 (duration: 00m 39s) [23:38:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:09] !log thcipriani@tin Synchronized php-1.30.0-wmf.1/includes/specials/SpecialRecentchanges.php: SWAT: [[gerrit:353479|Gate option to save RC filters to default false]] 2/3 (duration: 00m 39s) [23:39:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:57] !log thcipriani@tin Synchronized php-1.30.0-wmf.1/resources/src/mediawiki.rcfilters: SWAT: [[gerrit:353479|Gate option to save RC filters to default false]] 3/3 (duration: 00m 39s) [23:40:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:40:04] ^ mooeypoo all live now! [23:40:12] * mooeypoo double checks [23:41:40] All good, thanks! [23:41:51] nice, thanks for double-checking :) [23:42:53] All good here too. [23:42:54] o/ [23:43:48] \o high-five [23:44:55] thcipriani, I am sorry, apparently this needs to be +2'ed and deployed too, even though it's config? https://gerrit.wikimedia.org/r/#/c/353349/1 [23:45:24] * thcipriani looks [23:45:26] That one is just for Beta Cluster. [23:45:45] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353349 (owner: 10Mattflaschen) [23:45:50] \o/ [23:46:02] ^ That's the combination of both high-fives from before [23:46:11] Thanks, thcipriani, mooeypoo. [23:46:29] thank you matt_flaschen ! [23:46:59] (03Merged) 10jenkins-bot: Enable saving RC Filters on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353349 (owner: 10Mattflaschen) [23:48:06] (03CR) 10jenkins-bot: Enable saving RC Filters on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353349 (owner: 10Mattflaschen) [23:49:08] mooeypoo: that one should be live on beta the next time beta-scap-eqiad runs which should be shortly [23:49:18] * mooeypoo nods [23:49:19] !log thcipriani@tin Synchronized wmf-config/CommonSettings-labs.php: SWAT: [[gerrit:353349|Enable saving RC Filters on Beta Cluster]] (beta-only-change) (duration: 00m 39s) [23:49:26] thanks thcipriani [23:49:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:36] yw :) [23:57:09] PROBLEM - Nginx local proxy to apache on mw1293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:57:18] PROBLEM - HHVM rendering on mw1293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:57:38] PROBLEM - Apache HTTP on mw1293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:57:58] RECOVERY - Nginx local proxy to apache on mw1293 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.191 second response time [23:58:08] RECOVERY - HHVM rendering on mw1293 is OK: HTTP OK: HTTP/1.1 200 OK - 74667 bytes in 0.256 second response time [23:58:28] RECOVERY - Apache HTTP on mw1293 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.124 second response time