[00:00:03] (03PS1) 10Platonides: Use https:// urls when communicating with PediaPress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336340 (https://phabricator.wikimedia.org/T157398) [00:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170207T0000). Please do the needful. [00:01:56] (03CR) 10Dzahn: [C: 031] Use https:// urls when communicating with PediaPress [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336340 (https://phabricator.wikimedia.org/T157398) (owner: 10Platonides) [00:03:17] (03CR) 10Dzahn: [C: 031] "but does pediapress have to fix their cert for tools.pp or does it not matter for this change?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336340 (https://phabricator.wikimedia.org/T157398) (owner: 10Platonides) [00:03:43] 06Operations, 10Collection, 10Traffic, 07HTTPS, 13Patch-For-Review: Book collections communicate with pediapress using http: - https://phabricator.wikimedia.org/T157398#3003811 (10Dzahn) So does pediapress have to fix their cert on tools.pediapress before this can be merged or is it unrelated to that cha... [00:06:02] RECOVERY - puppet last run on mc1019 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [00:08:29] (03CR) 10Dduvall: Gemfile: add xmlrpc for ruby 2.4 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/332981 (owner: 10Hashar) [00:13:08] Platonides: ping? [00:14:12] Platonides: I'd suggest we'd deploy * [config] {{Gerrit|336330}} Show again svwiki logo between 1.5x and 2x zoom ({{phabT|157387}}) [00:17:31] Josve05a: ping? [00:17:47] pong... [00:17:51] did I brake anything? [00:18:15] no, but you can help to test svwiki high resolution logos [00:18:31] the bug you reported at https://phabricator.wikimedia.org/T157387 [00:18:53] ACKNOWLEDGEMENT - puppet last run on copper is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues daniel_zahn https://gerrit.wikimedia.org/r/#/c/335299/ [00:19:05] Not sure what I can help with, but sure [00:20:07] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336330 (https://phabricator.wikimedia.org/T157387) (owner: 10Platonides) [00:21:02] Added to deployments table [00:21:46] Josve05a: do you have https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug installed? [00:21:53] (03Merged) 10jenkins-bot: Show again svwiki logo between 1.5x and 2x zoom [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336330 (https://phabricator.wikimedia.org/T157387) (owner: 10Platonides) [00:22:01] (03CR) 10jenkins-bot: Show again svwiki logo between 1.5x and 2x zoom [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336330 (https://phabricator.wikimedia.org/T157387) (owner: 10Platonides) [00:22:26] Dereckson: no, not yet [00:23:05] Josve05a: okay those two are extensions to inject specialized headers to tell the front-end to send the request to a specific debug server [00:23:47] Josve05a: install the extension, then on the menu choose mwdebug1002, put the trigger at on and visit sv.wikipedia.org, then check if the logo looks good to you in high resolution [00:24:34] yes, looks good to me [00:24:42] awesome, thanks for checking [00:25:51] Syncing. [00:26:23] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Show again svwiki logo between 1.5x and 2x zoom (T157387) (duration: 00m 40s) [00:26:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:26:27] T157387: Logo missing on Swedish Wikipedia (sv.wikipedia) on certain zoom ranges/screen resolution - https://phabricator.wikimedia.org/T157387 [00:27:58] * Josve05a is reading https://wikitech.wikimedia.org/wiki/Volunteer_NDA ... [00:28:00] 06Operations, 10ops-eqiad, 13Patch-For-Review: rack/setup prometheus100[3-4] - https://phabricator.wikimedia.org/T152504#3003921 (10Dzahn) a:05Dzahn>03None [00:28:23] Josve05a: oh by the way, don't forget to put the trigger at off now, if not, you'll flood the logs of the test server with every request you do [00:29:09] oh...I've turned it off and on (2-3 times now) to test one small thing...it's off now (I hope) [00:29:42] bblack: hi! Do the URLs Varnish uses internally as cache keys include protocol? Does purging http://page?foo=bla also purge https://page?foo=bla? [00:31:43] !log install1001 - re-enabled puppet, start DHCP service [00:31:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:37:10] AndyRussG: in almost all cases, http://foo/bar is a 301 ti https://foo/bar [00:37:13] s/ti/to/ [00:37:48] so it's kind of a non-question, but technically many things Vary (as in the header) on protocol, so technically they are separate, and you're probably looking for the https one [00:39:35] 06Operations, 10LDAP-Access-Requests: Assistance with LDAP Access for Transparency Report - https://phabricator.wikimedia.org/T140380#3003994 (10JbuattiWMF) Hi @Dzahn (again), Could we add three more accounts to this LDAP group? RStallman is our long-time paralegal, and the other two are legal fellows. All th... [00:39:51] 06Operations, 10LDAP-Access-Requests: Assistance with LDAP Access for Transparency Report - https://phabricator.wikimedia.org/T140380#3003995 (10JbuattiWMF) 05Resolved>03Open [00:40:11] bblack: hmmm K [00:41:35] bblack: basically, if I were to use the Title object to get URLs to purge... Do you know if getInternalURL() is the right method? https://doc.wikimedia.org/mediawiki-core/master/php/classTitle.html#a7f5158838132cde58f6837958bdb3761 [00:41:41] Appears so from other code [00:42:47] The dilemma I have is that the actual source for the URL that I need to purge is a global config variable with the full URL minus the params and the protocol [00:44:24] https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/CommonSettings.php#L1637-L1643 [00:45:02] (03PS1) 10Dereckson: Fix $wmgVisualEditorAvailableNamespaces code style [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336346 [00:45:04] (03PS1) 10Dereckson: Enable VE on fr.wiktionary Projet: namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336347 (https://phabricator.wikimedia.org/T156660) [00:45:17] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336346 (owner: 10Dereckson) [00:47:51] (03Merged) 10jenkins-bot: Fix $wmgVisualEditorAvailableNamespaces code style [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336346 (owner: 10Dereckson) [00:48:00] (03CR) 10jenkins-bot: Fix $wmgVisualEditorAvailableNamespaces code style [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336346 (owner: 10Dereckson) [00:48:02] on mwdebug1002 [00:48:55] works [00:49:30] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Fix $wmgVisualEditorAvailableNamespaces code style ([[Gerrit:336346]]) (no-op) (duration: 00m 40s) [00:49:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:52:57] SWAT done. [00:53:29] AndyRussG: I really don't know much about MW, sorry [00:53:44] AndyRussG: especially for broader re-useable code outside of here :) [00:54:04] bblack: K thanks... sorry for the bother..... [00:54:05] AndyRussG: but here, it's safe to assume https:// is the right protocol [00:54:14] K [00:55:29] I could just hack things up and add the protocol manually, but it just doesn't seem like the right way to go... Especially since, to get the equivalent mobile URL, I have to run a MW hook that requires a Title object as a param [00:55:37] 06Operations, 10Stashbot: [IDEA] Backup bot for morebots - https://phabricator.wikimedia.org/T148694#3004037 (10bd808) Stashbot has a cloak, so it is less likely to get kicked by banbot too I think. [00:55:42] 06Operations, 10ORES, 06Revision-Scoring-As-A-Service, 10Revision-Scoring-As-A-Service-Backlog, and 4 others: ORES Overloaded (particularly 2017-02-05 02:25-02:30) - https://phabricator.wikimedia.org/T157206#3004038 (10Aklapper) [00:55:49] I guess I should look who are the authors of the relevant bits of MW code [00:56:55] SAL restored [00:59:51] 06Operations, 06Labs, 10Stashbot: Make morebots run on a production host - https://phabricator.wikimedia.org/T94638#3004040 (10bd808) Stashbot could be a candidate for running on the planned production Kubernetes cluster. The code would still need a security audit I'm sure. [01:11:02] PROBLEM - puppet last run on dbproxy1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:13:40] (03PS1) 10Tim Landscheidt: Tools: Do not install adminbot [puppet] - 10https://gerrit.wikimedia.org/r/336351 (https://phabricator.wikimedia.org/T157400) [01:14:04] (03CR) 10Tim Landscheidt: [C: 04-1] "Pending T157399." [puppet] - 10https://gerrit.wikimedia.org/r/336351 (https://phabricator.wikimedia.org/T157400) (owner: 10Tim Landscheidt) [01:20:43] Dereckson: I archived a bunch of the SAL stuff -- https://wikitech.wikimedia.org/wiki/Server_admin_log/Archive_31 [01:31:11] !log prometheus1004 - installed OS, signing puppet cert, initial run.. (T152504) [01:31:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:31:17] T152504: rack/setup prometheus100[3-4] - https://phabricator.wikimedia.org/T152504 [01:34:52] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 1805.434659 Seconds [01:34:52] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 1807.562841 Seconds [01:35:12] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 1824.483888 Seconds [01:35:52] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 17.379532 Seconds [01:35:52] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 19.484103 Seconds [01:36:12] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 36.413696 Seconds [01:36:50] (03CR) 10Dzahn: [C: 032] "checked if addresses are valid and deliverable" [puppet] - 10https://gerrit.wikimedia.org/r/336222 (owner: 10Muehlenhoff) [01:37:01] (03PS2) 10Dzahn: First batch of entries for LDAP users [puppet] - 10https://gerrit.wikimedia.org/r/336222 (owner: 10Muehlenhoff) [01:40:02] RECOVERY - puppet last run on dbproxy1004 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [01:40:05] (03CR) 10Nuria: "Ah, sorry. Abandoning, will test on EL server as this package should already be present then" [puppet] - 10https://gerrit.wikimedia.org/r/335854 (https://phabricator.wikimedia.org/T153207) (owner: 10Nuria) [01:42:43] (03CR) 10Tim Landscheidt: Generate man page for collector-runner (033 comments) [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/336057 (https://phabricator.wikimedia.org/T156651) (owner: 10Tim Landscheidt) [01:46:11] 06Operations, 10ops-eqiad, 13Patch-For-Review: rack/setup prometheus100[3-4] - https://phabricator.wikimedia.org/T152504#3004112 (10Dzahn) Rob fixed the switch port config, then i could install prometheus1004 as well. It's done and sitting at login like 1003, but without a specific role so far. [01:47:44] 06Operations, 10ops-eqiad, 13Patch-For-Review: rack/setup prometheus100[3-4] - https://phabricator.wikimedia.org/T152504#3004113 (10Dzahn) [01:48:09] 06Operations, 10ops-eqiad, 13Patch-For-Review: rack/setup prometheus100[3-4] - https://phabricator.wikimedia.org/T152504#2850801 (10Dzahn) [01:48:37] 06Operations, 10ops-eqiad, 13Patch-For-Review: rack/setup prometheus100[3-4] - https://phabricator.wikimedia.org/T152504#2850801 (10Dzahn) a:03fgiunchedi [01:49:54] (03PS1) 10Dzahn: add prometheus1003/1004 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/336354 (https://phabricator.wikimedia.org/T152504) [01:53:04] (03PS3) 10Tim Landscheidt: Generate man page for collector-runner [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/336057 (https://phabricator.wikimedia.org/T156651) [01:55:00] (03PS1) 10Dzahn: wmflib/tests: replace install1001 with install1002 [puppet] - 10https://gerrit.wikimedia.org/r/336355 [01:57:27] (03PS4) 10Tim Landscheidt: Generate man page for collector-runner [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/336057 (https://phabricator.wikimedia.org/T156651) [01:58:11] (03PS1) 10Dzahn: openstack: switch installserver to install1002 [puppet] - 10https://gerrit.wikimedia.org/r/336356 [02:04:47] PROBLEM - configured eth on prometheus1004 is CRITICAL: Return code of 255 is out of bounds [02:04:47] PROBLEM - puppet last run on phab2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:04:57] PROBLEM - dhclient process on prometheus1004 is CRITICAL: Return code of 255 is out of bounds [02:04:57] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:04:57] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:05:07] PROBLEM - zotero on sca1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:05:07] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:05:07] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:05:07] PROBLEM - puppet last run on prometheus1004 is CRITICAL: Return code of 255 is out of bounds [02:05:27] PROBLEM - salt-minion processes on prometheus1004 is CRITICAL: Return code of 255 is out of bounds [02:06:47] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [02:06:48] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [02:06:57] RECOVERY - zotero on sca1004 is OK: HTTP OK: HTTP/1.0 200 OK - 62 bytes in 0.006 second response time [02:06:57] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [02:06:57] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [02:06:57] PROBLEM - Check systemd state on prometheus1004 is CRITICAL: Return code of 255 is out of bounds [02:07:07] PROBLEM - DPKG on prometheus1004 is CRITICAL: Return code of 255 is out of bounds [02:07:27] PROBLEM - Disk space on prometheus1004 is CRITICAL: Return code of 255 is out of bounds [02:11:20] prometheus1004 has just been installed. the others, dunno [02:13:57] (03PS5) 10Tim Landscheidt: Generate man page for collector-runner [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/336057 (https://phabricator.wikimedia.org/T156651) [02:15:47] PROBLEM - MegaRAID on prometheus1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:17:17] PROBLEM - Host prometheus1004 is DOWN: PING CRITICAL - Packet loss = 100% [02:19:37] RECOVERY - Host prometheus1004 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [02:23:40] (03PS6) 10Tim Landscheidt: Generate man page for collector-runner [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/336057 (https://phabricator.wikimedia.org/T156651) [02:32:19] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.10) (duration: 13m 23s) [02:32:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:32:47] RECOVERY - puppet last run on phab2001 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [02:33:07] PROBLEM - puppet last run on analytics1037 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:35:47] RECOVERY - configured eth on prometheus1004 is OK: OK - interfaces up [02:35:57] RECOVERY - dhclient process on prometheus1004 is OK: PROCS OK: 0 processes with command name dhclient [02:35:57] RECOVERY - Check systemd state on prometheus1004 is OK: OK - running: The system is fully operational [02:36:07] RECOVERY - puppet last run on prometheus1004 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [02:36:07] RECOVERY - DPKG on prometheus1004 is OK: All packages OK [02:36:27] RECOVERY - salt-minion processes on prometheus1004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:36:27] RECOVERY - Disk space on prometheus1004 is OK: DISK OK [02:37:35] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Feb 7 02:37:35 UTC 2017 (duration 5m 16s) [02:37:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:40:22] ACKNOWLEDGEMENT - MegaRAID on prometheus1004 is CRITICAL: Return code of 255 is out of bounds daniel_zahn https://phabricator.wikimedia.org/T152504 [02:40:22] ACKNOWLEDGEMENT - NTP on prometheus1004 is CRITICAL: NTP CRITICAL: No response from NTP server daniel_zahn https://phabricator.wikimedia.org/T152504 [02:40:47] 06Operations, 10ops-eqiad, 13Patch-For-Review: rack/setup prometheus100[3-4] - https://phabricator.wikimedia.org/T152504#3004211 (10Dzahn) [02:45:37] RECOVERY - MegaRAID on prometheus1004 is OK: OK: optimal, 2 logical, 6 physical [02:46:05] 06Operations, 10LDAP-Access-Requests: Assistance with LDAP Access for Transparency Report - https://phabricator.wikimedia.org/T140380#3004219 (10Dzahn) Hi @JbuattiWMF I could find "lmixter" and "Adavenport" and, based on your request and their wikimedia.org email address, i have added them to the "wmf" group... [02:58:45] (03PS1) 10Dzahn: ganglia: switch eqiad aggregator to install1002 [puppet] - 10https://gerrit.wikimedia.org/r/336361 [03:00:37] (03PS1) 10Dzahn: ganglia: switch codfw aggregator to install2002 [puppet] - 10https://gerrit.wikimedia.org/r/336362 [03:01:07] RECOVERY - puppet last run on analytics1037 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [03:02:02] (03PS1) 10Dzahn: netboot: remove install2001 [puppet] - 10https://gerrit.wikimedia.org/r/336363 [03:04:57] PROBLEM - puppet last run on mw1250 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:05:34] (03PS1) 10Dzahn: DHCP: switch install1001->1002, 2001->2002 as TFTP server [puppet] - 10https://gerrit.wikimedia.org/r/336364 [03:12:50] (03PS3) 10Tim Landscheidt: Add extended description to control [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/336056 (https://phabricator.wikimedia.org/T156651) [03:12:53] (03PS7) 10Tim Landscheidt: Generate man page for collector-runner [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/336057 (https://phabricator.wikimedia.org/T156651) [03:15:57] PROBLEM - puppet last run on lvs3003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:19:11] (03CR) 10Tim Landscheidt: [C: 032] Add extended description to control [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/336056 (https://phabricator.wikimedia.org/T156651) (owner: 10Tim Landscheidt) [03:19:47] (03Merged) 10jenkins-bot: Add extended description to control [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/336056 (https://phabricator.wikimedia.org/T156651) (owner: 10Tim Landscheidt) [03:22:41] (03CR) 10Tim Landscheidt: [C: 032] Generate man page for collector-runner [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/336057 (https://phabricator.wikimedia.org/T156651) (owner: 10Tim Landscheidt) [03:23:25] (03Merged) 10jenkins-bot: Generate man page for collector-runner [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/336057 (https://phabricator.wikimedia.org/T156651) (owner: 10Tim Landscheidt) [03:24:17] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 639.07 seconds [03:27:17] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 212.99 seconds [03:32:57] RECOVERY - puppet last run on mw1250 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [03:44:57] RECOVERY - puppet last run on lvs3003 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [03:54:07] PROBLEM - puppet last run on kafka1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:04:12] (03PS1) 10Tim Landscheidt: Do not manage service with package scripts [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/336367 (https://phabricator.wikimedia.org/T156651) [04:16:29] (03CR) 10Tim Landscheidt: [C: 032] Do not manage service with package scripts [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/336367 (https://phabricator.wikimedia.org/T156651) (owner: 10Tim Landscheidt) [04:16:59] (03Merged) 10jenkins-bot: Do not manage service with package scripts [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/336367 (https://phabricator.wikimedia.org/T156651) (owner: 10Tim Landscheidt) [04:18:29] (03PS1) 10Tim Landscheidt: Cut release 0.11 [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/336368 [04:18:57] PROBLEM - puppet last run on db1082 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:19:49] (03CR) 10Tim Landscheidt: [C: 032] Cut release 0.11 [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/336368 (owner: 10Tim Landscheidt) [04:20:22] (03Merged) 10jenkins-bot: Cut release 0.11 [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/336368 (owner: 10Tim Landscheidt) [04:22:07] RECOVERY - puppet last run on kafka1018 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [04:34:17] PROBLEM - puppet last run on db1030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:46:57] RECOVERY - puppet last run on db1082 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [05:03:17] RECOVERY - puppet last run on db1030 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [05:04:27] PROBLEM - Disk space on ms-be1012 is CRITICAL: DISK CRITICAL - free space: / 2120 MB (3% inode=90%) [05:18:27] PROBLEM - Disk space on ms-be1012 is CRITICAL: DISK CRITICAL - free space: / 1882 MB (3% inode=90%) [05:37:27] PROBLEM - Disk space on ms-be1012 is CRITICAL: DISK CRITICAL - free space: / 2127 MB (3% inode=90%) [05:41:28] RECOVERY - Disk space on ms-be1012 is OK: DISK OK [05:41:34] !log ms-be1012 running out of space on /, manually compressed /var/log/swift/server.log.1 and cleaned up apt cache T157237 [05:41:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:41:38] T157237: Degraded RAID on ms-be1012 - https://phabricator.wikimedia.org/T157237 [05:43:22] 06Operations, 10ops-eqiad: Degraded RAID on ms-be1012 - https://phabricator.wikimedia.org/T157237#3004428 (10Volans) @fgiunchedi swift it's logging ~1GB/hour... it will be full again in ~15h, could you take a look at it today please? [05:52:47] PROBLEM - puppet last run on labcontrol1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:10:07] PROBLEM - puppet last run on cp3044 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:21:47] RECOVERY - puppet last run on labcontrol1002 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:25:47] PROBLEM - puppet last run on terbium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:39:07] RECOVERY - puppet last run on cp3044 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:44:27] PROBLEM - puppet last run on elastic1017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:49:41] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: db2060 not accessible - https://phabricator.wikimedia.org/T156161#3004471 (10Marostegui) Thanks @jcrespo! Unfortunately the last thing I heard from @Papaul was that the HP technician didn't show up (he was still waiting for him) so I asked him to turn t... [06:53:06] (03PS1) 10Marostegui: db-eqiad.php: Repool db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336376 (https://phabricator.wikimedia.org/T156226) [06:54:47] RECOVERY - puppet last run on terbium is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [06:55:52] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336376 (https://phabricator.wikimedia.org/T156226) (owner: 10Marostegui) [06:57:23] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336376 (https://phabricator.wikimedia.org/T156226) (owner: 10Marostegui) [06:57:31] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336376 (https://phabricator.wikimedia.org/T156226) (owner: 10Marostegui) [06:58:27] PROBLEM - haproxy failover on dbproxy1011 is CRITICAL: CRITICAL check_failover servers up 2 down 1 [06:58:49] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1072 - T156226 (duration: 00m 50s) [06:58:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:54] T156226: Reimage and clone db1072 - https://phabricator.wikimedia.org/T156226 [06:59:17] PROBLEM - Check systemd state on labstore1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:59:27] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed [07:01:27] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1005 is OK: OK - maintain-dbusers is active [07:01:28] ^ got it [07:02:17] RECOVERY - Check systemd state on labstore1005 is OK: OK - running: The system is fully operational [07:06:33] !log Importing commonswiki tables on labsdb1010 - T153743 [07:06:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:37] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [07:08:19] !log Transferring commonswiki tables from db1064 to labsdb1009 - T153743 [07:08:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:27] RECOVERY - puppet last run on elastic1017 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [07:18:25] 06Operations, 10ops-eqiad, 13Patch-For-Review: Suspected faulty SSD on graphite1001 - https://phabricator.wikimedia.org/T157022#3004524 (10fgiunchedi) >>! In T157022#3002925, @RobH wrote: > The replacement SSDs have arrived onsite, and planning for replacing them can take place on this task. Thanks @robh an... [07:20:07] PROBLEM - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2393 [07:25:07] RECOVERY - check_mysql on frdb2001 is OK: Uptime: 657352 Threads: 1 Questions: 8849418 Slow queries: 3366 Opens: 5286 Flush tables: 1 Open tables: 565 Queries per second avg: 13.462 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [07:31:56] !log added "> /dev/null" manually to the carbon's root crontab (rsync job) to avoid cronspam. The change was already merged in https://gerrit.wikimedia.org/r/#/c/336218 but puppet is disabled on carbon. [07:31:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:32] (03CR) 10Elukey: "Applied the changes manually to the carbon's root crontab since puppet is disabled :)" [puppet] - 10https://gerrit.wikimedia.org/r/336218 (https://phabricator.wikimedia.org/T132324) (owner: 10Elukey) [07:50:25] (03CR) 10Muehlenhoff: [C: 032] Update to 4.4.47 [debs/linux44] - 10https://gerrit.wikimedia.org/r/336235 (owner: 10Muehlenhoff) [08:06:11] 06Operations, 07Wikimedia-log-errors: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735#3004660 (10elukey) @thcipriani Hi! From what I can see now the error seems different. I followed what is listed in the task's... [08:14:51] 06Operations, 10DBA: Move db1073 to B3 - https://phabricator.wikimedia.org/T156126#3004681 (10Marostegui) Hey @Cmjohnson let's move db1073 to B3 on Wednesday if you'd have time? [08:19:54] (03CR) 10Hashar: "\O/" [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/336057 (https://phabricator.wikimedia.org/T156651) (owner: 10Tim Landscheidt) [08:21:07] (03CR) 10Giuseppe Lavagetto: [C: 031] "https://puppet-compiler.wmflabs.org/5350/ only minor differences in the elasticsearch and kafka modules apparently." [puppet] - 10https://gerrit.wikimedia.org/r/336230 (owner: 10Giuseppe Lavagetto) [08:28:16] (03PS3) 10Elukey: Disable auto-restart for nutcracker when config.yaml changes [puppet] - 10https://gerrit.wikimedia.org/r/335780 (https://phabricator.wikimedia.org/T155755) [08:28:44] (03PS1) 10Giuseppe Lavagetto: Add debianization [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/336378 [08:29:29] (03CR) 10Hashar: "Maybe I should amend the commit message / add an inline comment." [puppet] - 10https://gerrit.wikimedia.org/r/332981 (owner: 10Hashar) [08:30:43] _joe_: buongiorno , do you want jenkins job that builds the .deb for etcd-mirror ? :) [08:31:00] <_joe_> hashar: when it works correctly, yes :) [08:31:06] <_joe_> hashar: I should also write tests [08:31:16] I will make it non-voting as a first step [08:31:47] maybe you can get the tests to run when the package is building ? [08:33:47] landing https://gerrit.wikimedia.org/r/#/c/336379/1/zuul/layout.yaml [08:33:54] (03CR) 10Elukey: "Applied the change manually to an1028, works really fine. After a day of work the space consumption is around 5MBs." [puppet/cdh] - 10https://gerrit.wikimedia.org/r/336203 (https://phabricator.wikimedia.org/T156932) (owner: 10Elukey) [08:35:07] (03CR) 10Elukey: [V: 032 C: 032] Enable Yarn's Node Manager recovery to allow graceful restarts [puppet/cdh] - 10https://gerrit.wikimedia.org/r/336203 (https://phabricator.wikimedia.org/T156932) (owner: 10Elukey) [08:35:10] (03CR) 10Hashar: "recheck" [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/336378 (owner: 10Giuseppe Lavagetto) [08:35:28] assuming gbp conventions are matched or there is a proper gbp.conf, that should work :} [08:35:36] delta all the lintian nitpicking [08:36:16] gbp:error: upstream/0.0.1 is not a valid treeish :D [08:37:17] PROBLEM - puppet last run on ganeti1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:39:52] (03PS1) 10Elukey: Update the cdh's module sha [puppet] - 10https://gerrit.wikimedia.org/r/336380 (https://phabricator.wikimedia.org/T156932) [08:40:36] !log Transferring commonswiki tables from db1064 to db1095 - T153743 [08:40:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:41] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [08:44:09] 06Operations: Fix config file handling for /etc/hhvm/php.ini - https://phabricator.wikimedia.org/T157306#3004713 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [08:45:16] 06Operations, 10Wikimedia-Logstash, 07HHVM: Fatal exception of type "Scribunto_LuaInterpreterNotFoundError" - https://phabricator.wikimedia.org/T157110#3004715 (10MoritzMuehlenhoff) 05Open>03Resolved The actual root cause for this error was T157306, I'm closing this task in favour of it. [08:47:50] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/5353/" [puppet] - 10https://gerrit.wikimedia.org/r/336380 (https://phabricator.wikimedia.org/T156932) (owner: 10Elukey) [08:48:14] 06Operations, 07Wikimedia-log-errors: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735#3004723 (10hashar) We talked about it during releng meeting yesterday. To capture some of the random ideas I had: a) the jo... [08:51:01] (03PS1) 10Ema: base::service_unit: chmod -x systemd overrides [puppet] - 10https://gerrit.wikimedia.org/r/336381 [08:52:57] PROBLEM - puppet last run on restbase-dev1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:52:57] PROBLEM - MD RAID on restbase-dev1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:52:57] PROBLEM - Disk space on restbase-dev1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:53:07] PROBLEM - Check whether ferm is active by checking the default input chain on restbase-dev1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:53:07] PROBLEM - cassandra-b service on restbase-dev1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:53:07] PROBLEM - restbase endpoints health on restbase-dev1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:53:27] PROBLEM - salt-minion processes on restbase-dev1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:53:27] PROBLEM - dhclient process on restbase-dev1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:53:28] PROBLEM - configured eth on restbase-dev1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:53:28] PROBLEM - DPKG on restbase-dev1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:53:28] PROBLEM - Check systemd state on restbase-dev1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:53:47] PROBLEM - Check size of conntrack table on restbase-dev1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:53:47] PROBLEM - cassandra-a service on restbase-dev1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:53:56] mmm [08:54:42] ssh not working for me, let's see the console [08:54:47] RECOVERY - Disk space on restbase-dev1001 is OK: DISK OK [08:54:47] RECOVERY - puppet last run on restbase-dev1001 is OK: OK: Puppet is currently enabled, last run 20 minutes ago with 0 failures [08:54:47] RECOVERY - cassandra-a service on restbase-dev1001 is OK: OK - cassandra-a is active [08:54:50] i can login via mgmt, but login stalls [08:54:57] RECOVERY - Check whether ferm is active by checking the default input chain on restbase-dev1001 is OK: OK ferm input default policy is set [08:54:57] RECOVERY - cassandra-b service on restbase-dev1001 is OK: OK - cassandra-b is active [08:54:57] RECOVERY - restbase endpoints health on restbase-dev1001 is OK: All endpoints are healthy [08:55:03] ah no [Tue Feb 7 08:54:44 2017] md/raid1:md0: Disk failure on sdc1, disabling device. [08:55:03] it's also spewing lots of i/o errors [08:55:14] yes yes it took a bit but it worked :) [08:55:16] disk failed [08:55:17] RECOVERY - salt-minion processes on restbase-dev1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:55:17] RECOVERY - dhclient process on restbase-dev1001 is OK: PROCS OK: 0 processes with command name dhclient [08:55:17] RECOVERY - configured eth on restbase-dev1001 is OK: OK - interfaces up [08:55:17] RECOVERY - DPKG on restbase-dev1001 is OK: All packages OK [08:55:17] RECOVERY - Check systemd state on restbase-dev1001 is OK: OK - running: The system is fully operational [08:55:37] RECOVERY - Check size of conntrack table on restbase-dev1001 is OK: OK: nf_conntrack is 0 % full [08:55:54] no ops bot that opens the phab task? [08:55:59] * elukey blames volans :D [08:56:34] ops and developers, the worst kind :D [08:57:05] all right opening a phab task! [08:58:16] elukey: or follow up on https://phabricator.wikimedia.org/T151075, these are still in setup [08:59:04] so it is not Riccardo's fault! [08:59:07] :D [08:59:16] thanks for the pointer! Will write in there [08:59:17] PROBLEM - cassandra-b CQL 10.64.0.37:9042 on restbase-dev1001 is CRITICAL: connect to address 10.64.0.37 and port 9042: Connection refused [08:59:27] (03PS3) 10Gehel: elasticsearch - codfw servers move to jessie and data on /srv [puppet] - 10https://gerrit.wikimedia.org/r/323157 (https://phabricator.wikimedia.org/T151326) [08:59:31] I thought they were already working [09:01:17] no idea, but at least the task is still not resolved [09:03:09] restarting cassandra-b [09:03:27] PROBLEM - MegaRAID on restbase-dev1001 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) [09:03:30] ACKNOWLEDGEMENT - MegaRAID on restbase-dev1001 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T157425 [09:03:33] 06Operations, 10ops-eqiad: Degraded RAID on restbase-dev1001 - https://phabricator.wikimedia.org/T157425#3004739 (10ops-monitoring-bot) [09:03:39] ta daaaaaa [09:04:06] now I have to hug Riccardo so he'll forgive me for the bad things that I said about him [09:04:58] elukey: :-P [09:05:17] RECOVERY - puppet last run on ganeti1003 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [09:05:56] volans: o/ [09:05:57] PROBLEM - cassandra-b service on restbase-dev1001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [09:06:12] (03CR) 10Hashar: "recheck" [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/336057 (https://phabricator.wikimedia.org/T156651) (owner: 10Tim Landscheidt) [09:06:57] RECOVERY - cassandra-b service on restbase-dev1001 is OK: OK - cassandra-b is active [09:07:07] PROBLEM - restbase endpoints health on restbase-dev1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:07:07] PROBLEM - cassandra-b SSL 10.64.0.37:7001 on restbase-dev1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [09:07:17] PROBLEM - restbase endpoints health on restbase-dev1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:07:17] PROBLEM - restbase endpoints health on restbase-dev1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:07:18] PROBLEM - Check systemd state on restbase-dev1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:08:34] ERROR 09:06:54 Doesn't have write permissions for /srv/cassandra-b/data directory [09:09:46] ahh maybe it is read only [09:09:57] PROBLEM - cassandra-b service on restbase-dev1001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [09:10:15] /dev/md0 / ext4 rw,relatime,errors=remount-ro,data=ordered 0 0 [09:10:41] /dev/mapper/restbase--dev1001--vg-srv /srv ext4 ro,relatime,stripe=512,data=ordered 0 0 [09:11:54] (03CR) 10Zfilipin: [C: 031] Remove Gemfile.lock [puppet] - 10https://gerrit.wikimedia.org/r/336262 (owner: 10Hashar) [09:12:49] 06Operations, 10ops-eqiad: Degraded RAID on restbase-dev1001 - https://phabricator.wikimedia.org/T157425#3004750 (10elukey) p:05Triage>03Normal a:03Cmjohnson [09:14:06] (03CR) 10Zfilipin: [C: 031] Gemfile: add xmlrpc for ruby 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/332981 (owner: 10Hashar) [09:14:52] (03CR) 10Muehlenhoff: [C: 031] "The change looks fine. But I think this will trigger a restart of services using an override and refresh=>true (AFAICT prometheus-node-exp" [puppet] - 10https://gerrit.wikimedia.org/r/336381 (owner: 10Ema) [09:17:18] (03PS1) 10Giuseppe Lavagetto: Add debianization [software/etcd-mirror] (debian) - 10https://gerrit.wikimedia.org/r/336383 [09:18:06] (03Abandoned) 10Giuseppe Lavagetto: Add debianization [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/336378 (owner: 10Giuseppe Lavagetto) [09:20:34] 06Operations, 07Wikimedia-log-errors: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735#3004754 (10hashar) Some mediawiki-config history: March 17 2016, timeout bump from 250ms to 300ms 4c7a6ec3de00256878df9f1299... [09:25:38] (03CR) 10Giuseppe Lavagetto: [C: 032] Add debianization [software/etcd-mirror] (debian) - 10https://gerrit.wikimedia.org/r/336383 (owner: 10Giuseppe Lavagetto) [09:31:45] (03PS1) 10Faidon Liambotis: aptrepo: add suite stretch-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/336386 [09:34:17] RECOVERY - Check systemd state on restbase-dev1001 is OK: OK - running: The system is fully operational [09:37:17] PROBLEM - Check systemd state on restbase-dev1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:37:37] PROBLEM - puppet last run on db1037 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:43:57] RECOVERY - restbase endpoints health on restbase-dev1001 is OK: All endpoints are healthy [09:44:07] RECOVERY - restbase endpoints health on restbase-dev1003 is OK: All endpoints are healthy [09:44:07] RECOVERY - restbase endpoints health on restbase-dev1002 is OK: All endpoints are healthy [09:44:29] mmmm [09:44:37] cassandra magic? [09:45:08] no I just scheduled downtime for 1001, what I wanted to do was shutting down and mask the two instances on 1001 [09:46:32] !log stopped and masked cassandra-{a,b} - T157425 [09:46:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:37] T157425: Degraded RAID on restbase-dev1001 - https://phabricator.wikimedia.org/T157425 [09:47:05] we have other two racks set up in cassandra, and replication 3 [09:47:09] so it shouldn't be a problem [09:47:18] plus this is dev :) [09:48:57] !log installing cairo security updates [09:49:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:21] all right I need to step afk for an hour, will check again later on if anything has changed.. thanks volans! [09:49:37] I've done close to nothing ;) [10:02:48] <_joe_> !log uploaded etcd-mirror 0.0.1 to jessie-wikimedia (T156009) [10:02:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:53] T156009: Create an etcd cluster in codfw - https://phabricator.wikimedia.org/T156009 [10:06:37] RECOVERY - puppet last run on db1037 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [10:34:38] !log preparing db2046 for reimage T152188 [10:34:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:43] T152188: Restart pending mysql hosts with old TLS cert - https://phabricator.wikimedia.org/T152188 [10:50:58] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 07Jenkins: Upload Jenkins LTS v2.7.4 to wikimedia-ex - https://phabricator.wikimedia.org/T157429#3004915 (10hashar) Need #operations to publish the package on apt.wikimedia.org and review the idea of using the `experimenta... [10:51:19] (03PS1) 10Jcrespo: mariadb: Depool db1036 for a quick reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336391 (https://phabricator.wikimedia.org/T152188) [10:51:51] 06Operations: Integrate jessie 8.4 point release - https://phabricator.wikimedia.org/T131746#3004921 (10MoritzMuehlenhoff) 05Open>03Resolved This is resolved. [10:53:02] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 07Jenkins: Upload Jenkins LTS v2.7.4 to jessie-wikimedia/experimental - https://phabricator.wikimedia.org/T157429#3004923 (10hashar) [10:57:12] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 07Jenkins: Upload Jenkins LTS v2.7.4 to jessie-wikimedia/experimental - https://phabricator.wikimedia.org/T157429#3004897 (10MoritzMuehlenhoff) @hashar: jessie-wikimedia/experimental seems fine, we also used that to stage... [10:57:32] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 07Jenkins: Upload Jenkins LTS v2.7.4 to jessie-wikimedia/experimental - https://phabricator.wikimedia.org/T157429#3004926 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [10:58:58] (03CR) 10Marostegui: [C: 031] mariadb: Depool db1036 for a quick reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336391 (https://phabricator.wikimedia.org/T152188) (owner: 10Jcrespo) [11:12:37] (03CR) 10Jcrespo: "Are you sure phuser is the right user to add the grants to, and not some of the others?" [puppet] - 10https://gerrit.wikimedia.org/r/335554 (owner: 10Jcrespo) [11:13:22] 06Operations, 10Traffic: varnish-be returning 503s upon VCL reload - https://phabricator.wikimedia.org/T157430#3004933 (10ema) [11:13:28] 06Operations, 10Traffic: varnish-be returning 503s upon VCL reload - https://phabricator.wikimedia.org/T157430#3004949 (10ema) p:05Triage>03High [11:13:44] !log installing libpng security updates [11:13:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:42] (03PS1) 10ArielGlenn: write hash sums, dumpruninfo, status report additionally in json [dumps] - 10https://gerrit.wikimedia.org/r/336395 [11:26:59] (03CR) 10jerkins-bot: [V: 04-1] write hash sums, dumpruninfo, status report additionally in json [dumps] - 10https://gerrit.wikimedia.org/r/336395 (owner: 10ArielGlenn) [11:27:33] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1036 for a quick reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336391 (https://phabricator.wikimedia.org/T152188) (owner: 10Jcrespo) [11:27:50] !log installing libgd security updates [11:27:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:58] (03Merged) 10jenkins-bot: mariadb: Depool db1036 for a quick reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336391 (https://phabricator.wikimedia.org/T152188) (owner: 10Jcrespo) [11:29:07] (03CR) 10jenkins-bot: mariadb: Depool db1036 for a quick reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336391 (https://phabricator.wikimedia.org/T152188) (owner: 10Jcrespo) [11:30:42] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1036 for a quick reboot" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336396 [11:31:55] (03PS2) 10ArielGlenn: write hash sums, dumpruninfo, status report additionally in json [dumps] - 10https://gerrit.wikimedia.org/r/336395 [11:32:06] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1036 (duration: 00m 40s) [11:32:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:27] PROBLEM - Nginx local proxy to apache on mw1226 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:34:47] PROBLEM - HHVM rendering on mw1226 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:35:07] PROBLEM - Apache HTTP on mw1226 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:35:51] hhvm locked again? [11:36:17] load 128, queued 84 :D [11:37:56] !log restarting hhvm on mw1226 (hhvm dump debug in /tmp/hhvm.33183.bt.) [11:37:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:19] hope that it will go away with 3.18 [11:38:20] (03CR) 10Marostegui: [C: 031] Revert "mariadb: Depool db1036 for a quick reboot" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336396 (owner: 10Jcrespo) [11:39:18] RECOVERY - Nginx local proxy to apache on mw1226 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.031 second response time [11:39:37] RECOVERY - HHVM rendering on mw1226 is OK: HTTP OK: HTTP/1.1 200 OK - 72784 bytes in 0.112 second response time [11:39:55] !log restarting and upgrading db1036 [11:39:57] RECOVERY - Apache HTTP on mw1226 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.037 second response time [11:39:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:24] 06Operations, 10Traffic: varnish-be returning 503s upon VCL reload - https://phabricator.wikimedia.org/T157430#3004993 (10ema) I've confirmed on cp3040 that the issue is not reproducible by doing either of the following: * set .initial to the same value as .threshold in the [[ https://github.com/wikimedia/ope... [11:40:38] 06Operations, 10ops-eqiad, 10Cassandra, 13Patch-For-Review, 06Services (blocked): setup/install restbase-dev100[123] - https://phabricator.wikimedia.org/T151075#3004994 (10elukey) FYI https://phabricator.wikimedia.org/T157425 [11:41:52] (03PS1) 10MarcoAurelio: Short aliases for Module/Module_talk for Malayalam Wikimedia projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336397 [11:42:05] !log installing libxpm security updates [11:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:06] (03PS2) 10MarcoAurelio: Short aliases for Module/Module_talk for Malayalam Wikimedia projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336397 (https://phabricator.wikimedia.org/T56951) [11:49:18] 06Operations, 10DBA, 13Patch-For-Review: Restart pending mysql hosts with old TLS cert - https://phabricator.wikimedia.org/T152188#3005005 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['db2046.codfw.wmnet'] ``` The log can be found in `/var/log/... [11:53:12] !log stop puppet on ms-be1012 and change rsyslog to avoid local syslog spam - T157237 [11:53:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:18] T157237: Degraded RAID on ms-be1012 - https://phabricator.wikimedia.org/T157237 [11:54:47] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:56:52] !log restarting hhvm on appserver canaries to pick up lcms, sqlite, libxpm, gnutls and glibc updates (from jessie 8.7 release and security updates) [11:56:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:36] (03PS1) 10MarcoAurelio: Create autopatrolled and rollbacker permissions for fa.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336401 (https://phabricator.wikimedia.org/T156163) [12:01:05] (03CR) 10MarcoAurelio: [C: 04-1] "Lacks addGroups / removeGroups" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336401 (https://phabricator.wikimedia.org/T156163) (owner: 10MarcoAurelio) [12:02:09] (03CR) 10jerkins-bot: [V: 04-1] Create autopatrolled and rollbacker permissions for fa.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336401 (https://phabricator.wikimedia.org/T156163) (owner: 10MarcoAurelio) [12:04:27] 06Operations, 10Traffic: varnish-be returning 503s upon VCL reload - https://phabricator.wikimedia.org/T157430#3005029 (10ema) [12:06:07] PROBLEM - puppet last run on cp4020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:07:27] (03CR) 10Marostegui: [C: 04-2] "Wait for this to be clarified: https://phabricator.wikimedia.org/T149418#3004834" [puppet] - 10https://gerrit.wikimedia.org/r/335816 (https://phabricator.wikimedia.org/T149418) (owner: 10Marostegui) [12:08:30] (03PS2) 10MarcoAurelio: Create autopatrolled and rollbacker permissions for fa.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336401 (https://phabricator.wikimedia.org/T156163) [12:10:40] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db1036 for a quick reboot" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336396 (owner: 10Jcrespo) [12:12:27] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1036 for a quick reboot" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336396 (owner: 10Jcrespo) [12:12:36] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1036 for a quick reboot" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336396 (owner: 10Jcrespo) [12:16:23] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1036 (duration: 00m 40s) [12:16:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:27] PROBLEM - puppet last run on dbproxy1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:17:15] 06Operations, 10DBA, 13Patch-For-Review: Restart pending mysql hosts with old TLS cert - https://phabricator.wikimedia.org/T152188#3005059 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db2046.codfw.wmnet'] ``` and were **ALL** successful. [12:21:47] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [12:31:05] 06Operations, 10netops: Review ACLs for the Analytics VLAN - https://phabricator.wikimedia.org/T157435#3005077 (10elukey) [12:31:22] 06Operations, 06Analytics-Kanban, 10netops: Review ACLs for the Analytics VLAN - https://phabricator.wikimedia.org/T157435#3005090 (10elukey) p:05Triage>03Normal [12:33:06] RECOVERY - puppet last run on cp4020 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [12:34:32] 06Operations, 13Patch-For-Review: Re-add intel-microcode - https://phabricator.wikimedia.org/T127825#3005106 (10MoritzMuehlenhoff) p:05Triage>03Normal [12:35:14] (03PS1) 10Jcrespo: mariadb: Depool db1021 for a quick reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336402 (https://phabricator.wikimedia.org/T152188) [12:37:30] (03CR) 10Marostegui: [C: 031] mariadb: Depool db1021 for a quick reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336402 (https://phabricator.wikimedia.org/T152188) (owner: 10Jcrespo) [12:37:53] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1021 for a quick reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336402 (https://phabricator.wikimedia.org/T152188) (owner: 10Jcrespo) [12:39:07] (03Merged) 10jenkins-bot: mariadb: Depool db1021 for a quick reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336402 (https://phabricator.wikimedia.org/T152188) (owner: 10Jcrespo) [12:39:16] (03CR) 10jenkins-bot: mariadb: Depool db1021 for a quick reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336402 (https://phabricator.wikimedia.org/T152188) (owner: 10Jcrespo) [12:41:02] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1021 (duration: 00m 41s) [12:41:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:54] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1021 for a quick reboot" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336403 [12:44:58] (03CR) 10Marostegui: [C: 031] Revert "mariadb: Depool db1021 for a quick reboot" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336403 (owner: 10Jcrespo) [12:45:26] RECOVERY - puppet last run on dbproxy1011 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [12:46:13] 06Operations, 06Analytics-Kanban, 10netops: Review ACLs for the Analytics VLAN - https://phabricator.wikimedia.org/T157435#3005127 (10elukey) Adding install1002's IP to the whitelist should be: ``` edit set firewall family inet filter analytics-in4 term analytics-publicIP-v4 from destination-address 208.80.... [12:53:10] (03PS12) 10Rush: nodepool: track and alert on age of instance states script [puppet] - 10https://gerrit.wikimedia.org/r/335373 [12:53:18] (03PS13) 10Rush: nodepool: track and alert on age of instance states script [puppet] - 10https://gerrit.wikimedia.org/r/335373 [12:53:56] (03CR) 10Gehel: [C: 032] elasticsearch - codfw servers move to jessie and data on /srv [puppet] - 10https://gerrit.wikimedia.org/r/323157 (https://phabricator.wikimedia.org/T151326) (owner: 10Gehel) [12:55:38] (03PS14) 10Rush: nodepool: track and alert on age of instance states script [puppet] - 10https://gerrit.wikimedia.org/r/335373 [12:59:20] (03PS15) 10Rush: nodepool: track and alert on age of instance states script [puppet] - 10https://gerrit.wikimedia.org/r/335373 [13:05:27] (03CR) 10Rush: [C: 032] nodepool: track and alert on age of instance states script [puppet] - 10https://gerrit.wikimedia.org/r/335373 (owner: 10Rush) [13:06:56] PROBLEM - puppet last run on db1078 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:09:06] 06Operations, 10Collection, 10OfflineContentGenerator, 10Reading-Community-Engagement, and 2 others: Replace OCG in collection extension with Electron - https://phabricator.wikimedia.org/T150872#3005242 (10TheDJ) [13:11:05] 06Operations, 10Collection, 10OfflineContentGenerator, 10Reading-Community-Engagement, and 2 others: Replace OCG in collection extension with Electron - https://phabricator.wikimedia.org/T150872#2799695 (10TheDJ) Has anyone considered creating a separate (restbase) endpoint for pdf renderings, that the ser... [13:12:31] (03PS4) 10Hashar: contint/zuul: skip Icinga monitoring if server not master [puppet] - 10https://gerrit.wikimedia.org/r/327695 (https://phabricator.wikimedia.org/T150771) (owner: 10Dzahn) [13:16:01] (03CR) 10Hashar: [C: 031] "Rebased. Puppet compile is : https://puppet-compiler.wmflabs.org/5354/" [puppet] - 10https://gerrit.wikimedia.org/r/327695 (https://phabricator.wikimedia.org/T150771) (owner: 10Dzahn) [13:20:47] (03PS1) 10Rush: nodepool: active check for node pool instance states [puppet] - 10https://gerrit.wikimedia.org/r/336404 [13:21:50] (03CR) 10jerkins-bot: [V: 04-1] nodepool: active check for node pool instance states [puppet] - 10https://gerrit.wikimedia.org/r/336404 (owner: 10Rush) [13:23:18] (03PS2) 10Rush: nodepool: active check for node pool instance states [puppet] - 10https://gerrit.wikimedia.org/r/336404 [13:25:29] 06Operations, 06Analytics-Kanban, 10netops: Review ACLs for the Analytics VLAN - https://phabricator.wikimedia.org/T157435#3005305 (10elukey) Stale things found while reviewing: * term udplog is probably not worth to keep * term kafka is missing kafka2003's IP * term archiva should contain meitnerium's IP,... [13:27:49] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 07Jenkins: Upload Jenkins LTS v2.7.4 to jessie-wikimedia/experimental - https://phabricator.wikimedia.org/T157429#3005314 (10MoritzMuehlenhoff) jenkins 2.7.4 has been uploaded to carbon in the jessie-wikimedia/experimental... [13:29:47] (03PS4) 10Tim Landscheidt: Wait for the Kubernetes pod to shut down after "stop" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/336049 (https://phabricator.wikimedia.org/T156626) [13:30:16] PROBLEM - Check systemd state on labstore1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:30:16] PROBLEM - puppet last run on elastic1038 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:30:26] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed [13:30:59] (03PS2) 10ArielGlenn: dataset1001 rsync to labs of dumps can now use explicit inclusion list [puppet] - 10https://gerrit.wikimedia.org/r/336204 (https://phabricator.wikimedia.org/T154798) [13:31:47] (03CR) 10Tim Landscheidt: [C: 032] Wait for the Kubernetes pod to shut down after "stop" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/336049 (https://phabricator.wikimedia.org/T156626) (owner: 10Tim Landscheidt) [13:34:56] RECOVERY - puppet last run on db1078 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [13:35:38] (03CR) 10Tim Landscheidt: [V: 032 C: 032] Wait for the Kubernetes pod to shut down after "stop" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/336049 (https://phabricator.wikimedia.org/T156626) (owner: 10Tim Landscheidt) [13:39:16] RECOVERY - Check systemd state on labstore1005 is OK: OK - running: The system is fully operational [13:39:26] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1005 is OK: OK - maintain-dbusers is active [13:41:13] (03Merged) 10jenkins-bot: Wait for the Kubernetes pod to shut down after "stop" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/336049 (https://phabricator.wikimedia.org/T156626) (owner: 10Tim Landscheidt) [13:41:20] ^ flapped but I'm not sure why atm [13:44:59] !log Importing commonswiki tables on db1095 - T153743 [13:45:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:05] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [13:48:25] jouncebot, next [13:48:25] In 0 hour(s) and 11 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170207T1400) [13:53:14] (03PS3) 10Hoo man: Search index article placeholders on cywiki up to Q2794 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336225 (https://phabricator.wikimedia.org/T144592) [13:54:08] !log restarting and upgrading db1021 [13:54:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:16] RECOVERY - puppet last run on elastic1038 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [14:00:01] (03PS1) 10Hashar: Support Jenkins install from 'experimental' component [puppet] - 10https://gerrit.wikimedia.org/r/336408 (https://phabricator.wikimedia.org/T157429) [14:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170207T1400). [14:00:04] hoo, frimelle, and Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [14:00:15] Present and ready for SWAT! [14:00:20] o/ [14:00:31] I can do my change myself, unless someone else wants to do the deploy [14:01:04] o/ [14:01:15] hoo: please do :-) [14:01:28] hashar: want to do the rest of the patches? [14:01:29] * aude waves [14:01:55] (03CR) 10Hoo man: [C: 032] Search index article placeholders on cywiki up to Q2794 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336225 (https://phabricator.wikimedia.org/T144592) (owner: 10Hoo man) [14:02:35] aude, hashar: ping me if neither of you wants to do swat, I can do it, but in the middle of something [14:02:46] 06Operations, 06Analytics-Kanban, 10netops: Review ACLs for the Analytics VLAN - https://phabricator.wikimedia.org/T157435#3005522 (10elukey) Most urgent fixes: * Remove old AQS IPs ``` delete firewall family inet filter analytics-in4 term aqs from destination-address 10.64.0.123/32 delete firewall family i... [14:02:57] hoo: hold on :D [14:03:01] Oh, why? [14:03:14] Removed the +2 [14:03:20] is $wmgArticlePlaceholderSearchEngineIndexed defined ? [14:03:34] wmf-config/Wikibase-production.php: $wgArticlePlaceholderSearchEngineIndexed = $wmgArticlePlaceholderSearchEngineIndexed; [14:03:38] is the sole occurrence I believe [14:04:01] Oh, I set the wg one directly it in InitialiseSettings :/ [14:04:06] I'll amend [14:04:06] ;D [14:04:15] that's why I want people to review these :( [14:04:24] I too! [14:04:58] and I think we gotta deploy the change in two passes [14:04:59] good catch [14:05:11] Yeah [14:05:32] zeljkof: yeah I can handle / support the swat today [14:05:53] !log Importing commonswiki tables on labsdb1009 - T153743 [14:05:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:57] (03PS4) 10Hoo man: Search index article placeholders on cywiki up to Q2794 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336225 (https://phabricator.wikimedia.org/T144592) [14:05:58] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [14:06:56] (03CR) 10Hoo man: [C: 032] Search index article placeholders on cywiki up to Q2794 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336225 (https://phabricator.wikimedia.org/T144592) (owner: 10Hoo man) [14:07:35] (03CR) 10Hashar: "Puppet compiler output is not so helpful since all the resources are new. contint1001 having them with ensure=>absent." [puppet] - 10https://gerrit.wikimedia.org/r/336408 (https://phabricator.wikimedia.org/T157429) (owner: 10Hashar) [14:08:16] (03CR) 10Hashar: [C: 031] Search index article placeholders on cywiki up to Q2794 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336225 (https://phabricator.wikimedia.org/T144592) (owner: 10Hoo man) [14:08:36] (03Merged) 10jenkins-bot: Search index article placeholders on cywiki up to Q2794 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336225 (https://phabricator.wikimedia.org/T144592) (owner: 10Hoo man) [14:08:44] (03CR) 10jenkins-bot: Search index article placeholders on cywiki up to Q2794 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336225 (https://phabricator.wikimedia.org/T144592) (owner: 10Hoo man) [14:09:27] hashar: thanks! [14:10:21] (03Abandoned) 10Hashar: Remove unneed permissions from enwiki bureaucrats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299354 (https://phabricator.wikimedia.org/T140550) (owner: 10Kharkiv07) [14:10:43] !log hoo@tin Synchronized wmf-config/InitialiseSettings.php: Introduce $wmgArticlePlaceholderSearchEngineIndexed (duration: 00m 52s) [14:10:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:05] (03PS1) 10Gehel: elasticsaerch - change the default data directory to /srv/elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/336409 (https://phabricator.wikimedia.org/T151328) [14:11:07] (03PS1) 10Gehel: elasticsearch - codfw cirrus cluster - move data to /srv/elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/336410 (https://phabricator.wikimedia.org/T151328) [14:12:14] !log hoo@tin Synchronized wmf-config/: Search index article placeholders on cywiki up to Q2794 (T144592) (duration: 00m 42s) [14:12:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:18] T144592: Search index a limited number of article placeholders on cywiki for testing and evaluation purposes - https://phabricator.wikimedia.org/T144592 [14:13:17] (03PS2) 10Hashar: Namespace changes for elwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336004 (https://phabricator.wikimedia.org/T157187) (owner: 10Urbanecm) [14:13:28] Verified, works as intended [14:13:36] thanks! [14:13:46] and especially, thank you hashar for catching that [14:16:46] (03CR) 10Hashar: Namespace changes for elwikisource (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336004 (https://phabricator.wikimedia.org/T157187) (owner: 10Urbanecm) [14:17:39] (03PS3) 10Hashar: Namespace changes for elwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336004 (https://phabricator.wikimedia.org/T157187) (owner: 10Urbanecm) [14:17:51] hoo: I have been lucky :} [14:19:31] (03CR) 10Hashar: [C: 032] Namespace changes for elwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336004 (https://phabricator.wikimedia.org/T157187) (owner: 10Urbanecm) [14:19:50] !log restarting all the Yarn Node Managers on the Hadoop worker nodes to pick up the new config - T156932 [14:19:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:54] T156932: Investigate if Node Managers can be restarted without impacting running containers - https://phabricator.wikimedia.org/T156932 [14:21:10] (03Merged) 10jenkins-bot: Namespace changes for elwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336004 (https://phabricator.wikimedia.org/T157187) (owner: 10Urbanecm) [14:21:18] (03CR) 10jenkins-bot: Namespace changes for elwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336004 (https://phabricator.wikimedia.org/T157187) (owner: 10Urbanecm) [14:22:25] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: Namespace changes for elwikisource - T157187 (duration: 00m 40s) [14:22:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:31] T157187: New namespace and aliases for el.wikisource - https://phabricator.wikimedia.org/T157187 [14:23:12] 06Operations, 06Analytics-Kanban, 10netops: Review ACLs for the Analytics VLAN - https://phabricator.wikimedia.org/T157435#3005589 (10Ottomata) +1 to all of these. But, seeing as there has been an IPv6 with the ACLs for a while, maybe we should ask Ops about the use of continuing to support this VLAN. Not... [14:23:36] PROBLEM - HHVM jobrunner on mw1304 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:24:46] checking.. [14:25:56] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=317.00 Read Requests/Sec=352.00 Write Requests/Sec=0.50 KBytes Read/Sec=45056.00 KBytes_Written/Sec=4.40 [14:26:56] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=135.60 Read Requests/Sec=97.90 Write Requests/Sec=1.30 KBytes Read/Sec=1216.80 KBytes_Written/Sec=91.20 [14:27:33] !log restarting hhvm on mw1304 (load very high, no queue, threads locked - /tmp/hhvm.62070.bt.) [14:27:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:45] !log European swat copleted [14:28:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:25] I actually prefer my mornings with less coplet :-) [14:29:26] RECOVERY - HHVM jobrunner on mw1304 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [14:29:45] ;D [14:31:47] (03CR) 10Gehel: "This is a noop: https://puppet-compiler.wmflabs.org/5357/" [puppet] - 10https://gerrit.wikimedia.org/r/336409 (https://phabricator.wikimedia.org/T151328) (owner: 10Gehel) [14:34:18] 06Operations, 10ops-eqiad, 06Services (watching): Degraded RAID on restbase-dev1001 - https://phabricator.wikimedia.org/T157425#3005626 (10mobrovac) [14:34:26] (03CR) 10Gehel: "This is a noop: https://puppet-compiler.wmflabs.org/5358/" [puppet] - 10https://gerrit.wikimedia.org/r/336410 (https://phabricator.wikimedia.org/T151328) (owner: 10Gehel) [14:37:14] (03CR) 10Hashar: "Added it to Puppet SWAT." [puppet] - 10https://gerrit.wikimedia.org/r/336262 (owner: 10Hashar) [14:37:17] (03CR) 10Hashar: [C: 031] "Added it to Puppet SWAT." [puppet] - 10https://gerrit.wikimedia.org/r/290895 (https://phabricator.wikimedia.org/T136276) (owner: 10Hashar) [14:37:22] (03CR) 10Hashar: [C: 031] "Added it to Puppet SWAT." [puppet] - 10https://gerrit.wikimedia.org/r/290896 (https://phabricator.wikimedia.org/T136276) (owner: 10Hashar) [14:40:30] (03CR) 10Gehel: [C: 032] elasticsaerch - change the default data directory to /srv/elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/336409 (https://phabricator.wikimedia.org/T151328) (owner: 10Gehel) [14:40:37] (03CR) 10Gehel: [C: 032] elasticsearch - codfw cirrus cluster - move data to /srv/elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/336410 (https://phabricator.wikimedia.org/T151328) (owner: 10Gehel) [14:41:22] 06Operations, 06Analytics-Kanban, 10netops: Review ACLs for the Analytics VLAN - https://phabricator.wikimedia.org/T157435#3005658 (10elukey) Old/New elastic search IP from Discovery: https://etherpad.wikimedia.org/p/analytics-acls [14:42:24] !log drain shards from elastic2001 / elastic2002 in preperation for reimage - T151326 [14:42:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:30] T151326: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326 [14:54:31] 06Operations, 10Collection, 10OfflineContentGenerator, 10Reading-Community-Engagement, and 2 others: Replace OCG in collection extension with Electron - https://phabricator.wikimedia.org/T150872#2799695 (10mobrovac) @TheDJ I believe these are captured in the subtasks. Once we have the complete functionalit... [14:56:24] !log preparing db2036 for reimage T152188 [14:56:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:29] T152188: Restart pending mysql hosts with old TLS cert - https://phabricator.wikimedia.org/T152188 [14:57:05] (03PS1) 10Rush: openstack: nova fullstack testing [puppet] - 10https://gerrit.wikimedia.org/r/336413 [14:58:20] (03CR) 10jerkins-bot: [V: 04-1] openstack: nova fullstack testing [puppet] - 10https://gerrit.wikimedia.org/r/336413 (owner: 10Rush) [14:58:39] (03PS1) 10Gehel: elasticsearch - reimage elastic200[12] to jessie and move data to /srv [puppet] - 10https://gerrit.wikimedia.org/r/336414 (https://phabricator.wikimedia.org/T151326) [15:02:37] (03PS2) 10Rush: openstack: nova fullstack testing [puppet] - 10https://gerrit.wikimedia.org/r/336413 [15:03:15] (03PS3) 10Rush: openstack: nova fullstack testing [puppet] - 10https://gerrit.wikimedia.org/r/336413 [15:04:17] (03CR) 10jerkins-bot: [V: 04-1] openstack: nova fullstack testing [puppet] - 10https://gerrit.wikimedia.org/r/336413 (owner: 10Rush) [15:08:40] (03PS4) 10Rush: openstack: nova fullstack testing [puppet] - 10https://gerrit.wikimedia.org/r/336413 [15:08:56] PROBLEM - puppet last run on cp4004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:09:36] PROBLEM - puppet last run on puppetmaster2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:10:43] (03PS5) 10Rush: openstack: nova fullstack testing [puppet] - 10https://gerrit.wikimedia.org/r/336413 [15:13:23] (03CR) 10Gehel: "puppet compiler looks good: https://puppet-compiler.wmflabs.org/5359/" [puppet] - 10https://gerrit.wikimedia.org/r/336414 (https://phabricator.wikimedia.org/T151326) (owner: 10Gehel) [15:14:37] (03PS1) 10Muehlenhoff: More email addresses of WMF staff/contractors with LDAP access [puppet] - 10https://gerrit.wikimedia.org/r/336417 [15:18:55] (03PS1) 10Elukey: Replace Memcached/Redis codfw shard12->16 [puppet] - 10https://gerrit.wikimedia.org/r/336419 (https://phabricator.wikimedia.org/T155755) [15:21:31] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3005800 (10Gehel) steps to reimage an elasticsearch node: 1. drain shards from that node: `es-tool ban-node ` 1. wait for all shards to... [15:29:11] (03PS4) 10Tjones: Deploy TextCat Improvements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334729 (https://phabricator.wikimedia.org/T149324) [15:29:52] (03PS1) 10Muehlenhoff: Make the experimental archive section generally available [puppet] - 10https://gerrit.wikimedia.org/r/336420 [15:33:06] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/5360 looks good" [puppet] - 10https://gerrit.wikimedia.org/r/336419 (https://phabricator.wikimedia.org/T155755) (owner: 10Elukey) [15:33:52] (03CR) 10Andrew Bogott: "I like this!" [puppet] - 10https://gerrit.wikimedia.org/r/336413 (owner: 10Rush) [15:35:17] (03CR) 10Andrew Bogott: [C: 032] toollabs: Install mktorrent [puppet] - 10https://gerrit.wikimedia.org/r/334962 (https://phabricator.wikimedia.org/T155470) (owner: 10Legoktm) [15:36:31] (03CR) 10Andrew Bogott: [C: 032] "I'll merge this. Keep in mind, though, that you might be better served by writing your new tool for the k8s backend instead (for which th" [puppet] - 10https://gerrit.wikimedia.org/r/334962 (https://phabricator.wikimedia.org/T155470) (owner: 10Legoktm) [15:36:56] RECOVERY - puppet last run on cp4004 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [15:38:28] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: labsdb1005 (mysql) maintenance for reimage - https://phabricator.wikimedia.org/T157358#3005848 (10jcrespo) [15:38:36] RECOVERY - puppet last run on puppetmaster2002 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [15:39:15] (03PS2) 10Andrew Bogott: toollabs: Install mktorrent [puppet] - 10https://gerrit.wikimedia.org/r/334962 (https://phabricator.wikimedia.org/T155470) (owner: 10Legoktm) [15:40:25] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 07User-notice: labsdb1005 (mysql) maintenance for reimage - https://phabricator.wikimedia.org/T157358#3002516 (10jcrespo) Adding user notice. In theory, no end users should be affected, but if some tools have not been properly programmed to reconnect, t... [15:42:56] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=2746.40 Read Requests/Sec=521.20 Write Requests/Sec=3.00 KBytes Read/Sec=44702.40 KBytes_Written/Sec=97.20 [15:44:41] 06Operations, 10ops-eqiad: mw1198.eqiad.wmnet kernel reports temperature issues - https://phabricator.wikimedia.org/T157459#3005877 (10hashar) [15:46:15] 06Operations, 10ops-eqiad: mw1198.eqiad.wmnet kernel reports temperature issues - https://phabricator.wikimedia.org/T157459#3005877 (10MoritzMuehlenhoff) We already have https://phabricator.wikimedia.org/T149287 for these [15:46:25] 06Operations, 10ops-eqiad: mw1198.eqiad.wmnet kernel reports temperature issues - https://phabricator.wikimedia.org/T157459#3005891 (10MoritzMuehlenhoff) 05Open>03Resolved a:03MoritzMuehlenhoff [15:47:09] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db1021 for a quick reboot" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336403 (owner: 10Jcrespo) [15:48:27] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1021 for a quick reboot" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336403 (owner: 10Jcrespo) [15:48:32] (03PS1) 10Reedy: Run GenerateFancyCaptchas.php against enwiki rather than aawiki [puppet] - 10https://gerrit.wikimedia.org/r/336423 [15:48:36] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1021 for a quick reboot" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336403 (owner: 10Jcrespo) [15:48:49] (03CR) 10Reedy: "reedy@tin:/srv/mediawiki-staging/php-1.29.0-wmf.10/extensions/ConfirmEdit/maintenance$ /usr/local/bin/mwscript extensions/ConfirmEdit/main" [puppet] - 10https://gerrit.wikimedia.org/r/336423 (owner: 10Reedy) [15:52:01] 06Operations, 10ops-eqiad: Degraded RAID on ms-be1012 - https://phabricator.wikimedia.org/T157237#2999842 (10Cmjohnson) I am assuming this is one of the ssds when I pull the pd list with megacli a ssd is missing. Please confirm. The system is out of warranty but we have spares on-site. [15:52:37] (03CR) 10Giuseppe Lavagetto: [C: 031] Remove Gemfile.lock [puppet] - 10https://gerrit.wikimedia.org/r/336262 (owner: 10Hashar) [15:52:56] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=13.00 Read Requests/Sec=1.70 Write Requests/Sec=1.40 KBytes Read/Sec=15.60 KBytes_Written/Sec=20.00 [15:53:56] (03CR) 10Giuseppe Lavagetto: [C: 031] Escape period in wiki.phtml rewrites [puppet] - 10https://gerrit.wikimedia.org/r/331944 (owner: 10Reedy) [15:57:36] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Not sure what is this needed for, but there are a couple of small things to fix too." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/290895 (https://phabricator.wikimedia.org/T136276) (owner: 10Hashar) [15:59:17] (03CR) 10Giuseppe Lavagetto: [C: 031] contint: disable DNS lookup for castor rsyncd [puppet] - 10https://gerrit.wikimedia.org/r/290896 (https://phabricator.wikimedia.org/T136276) (owner: 10Hashar) [16:00:01] (03CR) 10Giuseppe Lavagetto: [C: 031] Run GenerateFancyCaptchas.php against enwiki rather than aawiki [puppet] - 10https://gerrit.wikimedia.org/r/336423 (owner: 10Reedy) [16:02:03] (03CR) 10Chad: "Identical to what I did here, for example: https://gerrit.wikimedia.org/r/#/c/332665/2/w/api.php" [puppet] - 10https://gerrit.wikimedia.org/r/332648 (owner: 10Chad) [16:04:40] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: db2060 not accessible - https://phabricator.wikimedia.org/T156161#3005962 (10Papaul) Unfortunately the HP tech didn't show up. I m following up with HP on the case. [16:06:22] 06Operations, 06Analytics-Kanban, 06Performance-Team, 06Reading-Admin, 10Traffic: Preliminary Design document for A/B testing - https://phabricator.wikimedia.org/T143694#3005963 (10Nuria) a:03Nuria [16:08:17] !log restarting and upgrading db2064 T152188 [16:08:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:22] T152188: Restart pending mysql hosts with old TLS cert - https://phabricator.wikimedia.org/T152188 [16:09:58] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: db2060 not accessible - https://phabricator.wikimedia.org/T156161#3005970 (10Marostegui) Thanks @Papaul - I will leave the server depooled so we can shut it down anytime once you've arranged another day and time [16:20:55] (03CR) 10Andrew Bogott: [C: 031] "Looks good, will run through the compiler before merging." [puppet] - 10https://gerrit.wikimedia.org/r/334301 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [16:21:03] (03PS6) 10Andrew Bogott: openstack: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334301 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [16:22:01] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1021 (duration: 00m 57s) [16:22:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:05] (03CR) 10Andrew Bogott: [C: 04-1] "At least one issue, inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/334301 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [16:30:21] (03PS1) 10Jcrespo: mariadb: Depool db1022 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336429 (https://phabricator.wikimedia.org/T152188) [16:30:50] (03PS6) 10Andrew Bogott: labs modules linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334290 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [16:31:44] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3006017 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic2001.codfw.wmnet'] ```... [16:33:06] (03PS1) 10Elukey: Enable aqs1009-b (AQS Cassandra cluster) [puppet] - 10https://gerrit.wikimedia.org/r/336430 (https://phabricator.wikimedia.org/T155654) [16:33:31] (03CR) 10Marostegui: [C: 031] mariadb: Depool db1022 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336429 (https://phabricator.wikimedia.org/T152188) (owner: 10Jcrespo) [16:34:06] (03CR) 10Andrew Bogott: [C: 032] "The on-labs cases are hard to test with the puppet compiler, but at least the labstore cases are no-ops." [puppet] - 10https://gerrit.wikimedia.org/r/334290 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [16:35:20] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1022 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336429 (https://phabricator.wikimedia.org/T152188) (owner: 10Jcrespo) [16:36:46] (03CR) 10jenkins-bot: mariadb: Depool db1022 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336429 (https://phabricator.wikimedia.org/T152188) (owner: 10Jcrespo) [16:40:07] (03CR) 10Andrew Bogott: "Testing with the epic compiler job https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/5363/console" [puppet] - 10https://gerrit.wikimedia.org/r/334276 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [16:40:54] (03PS3) 10Andrew Bogott: ssl: delete ldap-eqiad/ldap-codfw.wikimedia.org certs [puppet] - 10https://gerrit.wikimedia.org/r/334211 (owner: 10Dzahn) [16:42:39] (03CR) 10Andrew Bogott: [C: 032] Linting fixes (Multiple modules) [puppet] - 10https://gerrit.wikimedia.org/r/334276 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [16:42:58] (03CR) 10Andrew Bogott: "oops, clicked in the wrong window! This is still in testing :)" [puppet] - 10https://gerrit.wikimedia.org/r/334276 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [16:43:15] (03CR) 10Andrew Bogott: [C: 032] ssl: delete ldap-eqiad/ldap-codfw.wikimedia.org certs [puppet] - 10https://gerrit.wikimedia.org/r/334211 (owner: 10Dzahn) [16:43:53] (03CR) 10Hashar: [C: 031] "First use case is in the child change https://gerrit.wikimedia.org/r/#/c/290896/3/modules/role/manifests/ci/castor/server.pp" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/290895 (https://phabricator.wikimedia.org/T136276) (owner: 10Hashar) [16:45:20] (03PS4) 10Hashar: rsync: allow extra settings in rsyncd.conf [puppet] - 10https://gerrit.wikimedia.org/r/290895 (https://phabricator.wikimedia.org/T136276) [16:47:28] (03CR) 10Hashar: "Cherry picked it on CI puppet master but:" [puppet] - 10https://gerrit.wikimedia.org/r/290895 (https://phabricator.wikimedia.org/T136276) (owner: 10Hashar) [16:49:36] (03PS3) 10Andrew Bogott: Tools: Remove redundant tools-db entry from /etc/hosts [puppet] - 10https://gerrit.wikimedia.org/r/328453 (https://phabricator.wikimedia.org/T139190) (owner: 10Tim Landscheidt) [16:54:54] (03CR) 10Andrew Bogott: [C: 032] Tools: Remove redundant tools-db entry from /etc/hosts [puppet] - 10https://gerrit.wikimedia.org/r/328453 (https://phabricator.wikimedia.org/T139190) (owner: 10Tim Landscheidt) [16:56:12] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: db2060 not accessible - https://phabricator.wikimedia.org/T156161#3006101 (10Papaul) The service was canceled, according to HP they couldn't get in touch with me; which is not true because i didn't received any calls or emails from them. Another service... [17:00:04] godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170207T1700). Please do the needful. [17:00:04] hashar, reedy, and RainbowSprinkles: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [17:01:54] <_joe_> ok I will do it [17:02:24] (03PS2) 10Giuseppe Lavagetto: Remove Gemfile.lock [puppet] - 10https://gerrit.wikimedia.org/r/336262 (owner: 10Hashar) [17:02:33] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Remove Gemfile.lock [puppet] - 10https://gerrit.wikimedia.org/r/336262 (owner: 10Hashar) [17:02:51] (03PS5) 10Hashar: rsync: allow extra settings in rsyncd.conf [puppet] - 10https://gerrit.wikimedia.org/r/290895 (https://phabricator.wikimedia.org/T136276) [17:03:59] (03CR) 10Hashar: "I went back with pachset 3 though I have removed the sort() calls. As for the newlines, I could not figure out locally how to get rid of" [puppet] - 10https://gerrit.wikimedia.org/r/290895 (https://phabricator.wikimedia.org/T136276) (owner: 10Hashar) [17:04:15] _joe_: the erb stuff terribly confuses me :/ [17:04:29] I tried several way to try to not alter the output when a variable is not set, but eventually I gotta give up for now [17:04:48] (03CR) 10Gehel: [C: 032] elasticsearch - reimage elastic200[12] to jessie and move data to /srv [puppet] - 10https://gerrit.wikimedia.org/r/336414 (https://phabricator.wikimedia.org/T151326) (owner: 10Gehel) [17:04:56] (03PS2) 10Gehel: elasticsearch - reimage elastic200[12] to jessie and move data to /srv [puppet] - 10https://gerrit.wikimedia.org/r/336414 (https://phabricator.wikimedia.org/T151326) [17:05:03] (03CR) 10Gehel: [V: 032 C: 032] elasticsearch - reimage elastic200[12] to jessie and move data to /srv [puppet] - 10https://gerrit.wikimedia.org/r/336414 (https://phabricator.wikimedia.org/T151326) (owner: 10Gehel) [17:05:04] <_joe_> hashar: as I told you, put a <%- if @rsyncd_conf != {} -%> ... <%- end -%> around your code [17:05:15] ohh extra dashes everywhere [17:05:34] <_joe_> hashar: I'll do the other patches while you fix that [17:05:35] will revisit / test that one later on so [17:05:52] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1022 for maintenance (duration: 00m 40s) [17:05:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:57] (03CR) 10Giuseppe Lavagetto: [C: 032] Run GenerateFancyCaptchas.php against enwiki rather than aawiki [puppet] - 10https://gerrit.wikimedia.org/r/336423 (owner: 10Reedy) [17:06:02] (03CR) 10Hashar: [C: 04-1] "Pending. Per joe on IRC:" [puppet] - 10https://gerrit.wikimedia.org/r/290895 (https://phabricator.wikimedia.org/T136276) (owner: 10Hashar) [17:06:04] (03PS2) 10Giuseppe Lavagetto: Run GenerateFancyCaptchas.php against enwiki rather than aawiki [puppet] - 10https://gerrit.wikimedia.org/r/336423 (owner: 10Reedy) [17:06:12] <_joe_> Reedy: ^^ [17:06:18] _joe_: just skip that patch, thx for the review :) [17:06:25] _joe_: cheers! [17:06:35] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Run GenerateFancyCaptchas.php against enwiki rather than aawiki [puppet] - 10https://gerrit.wikimedia.org/r/336423 (owner: 10Reedy) [17:07:47] !log restarting and upgrading db1022 T152188 [17:07:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:50] T152188: Restart pending mysql hosts with old TLS cert - https://phabricator.wikimedia.org/T152188 [17:08:16] <_joe_> Reedy: {{done}} [17:08:25] <_joe_> also ran on terbium [17:09:14] (03PS4) 10Giuseppe Lavagetto: Escape period in wiki.phtml rewrites [puppet] - 10https://gerrit.wikimedia.org/r/331944 (owner: 10Reedy) [17:10:53] (03CR) 10Giuseppe Lavagetto: [C: 032] Escape period in wiki.phtml rewrites [puppet] - 10https://gerrit.wikimedia.org/r/331944 (owner: 10Reedy) [17:13:49] <_joe_> Reedy: testing on mwdebug1001 in ~1 [17:14:32] (03PS1) 10Elukey: Fix retry_interval for check_leaked_hhvm_threads [puppet] - 10https://gerrit.wikimedia.org/r/336438 [17:15:26] https://en.wikipedia.org/w/wiki@phtml?title=foo&action=info gives 404 on mwdebug1001 [17:15:38] Which is expected behaviour with the patch :) [17:15:42] <_joe_> yes [17:16:00] as long as https://en.wikipedia.org/w/wiki.phtml?title=foo&action=info doesn't... [17:16:26] yup, that works [17:16:30] as horrible as the urls is [17:16:31] <_joe_> jynus: that's what I verified :P [17:17:09] riccardo may have mentioned that on the long term todo there was production http regression testing? [17:17:30] <_joe_> Reedy: I'm going to run puppet forcedly around the cluster [17:17:42] jynus: ??? [17:17:53] maybe I missinterpreted you [17:18:08] and you were tacking about something else [17:18:20] that is why I asked [17:18:44] *talking [17:19:06] <_joe_> jynus: I do my own rudimentary web regression testing with apache-fast-test [17:19:17] <_joe_> and have some software written to that end [17:19:35] ah <3 [17:19:35] (03CR) 10Elukey: [C: 032] Fix retry_interval for check_leaked_hhvm_threads [puppet] - 10https://gerrit.wikimedia.org/r/336438 (owner: 10Elukey) [17:20:19] <_joe_> jynus: nowhere near finished [17:20:56] <_joe_> it's rotting on my computer since the HHVM ages [17:21:47] PROBLEM - puppet last run on mw1267 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:22:05] <_joe_> Reedy: should be done [17:22:08] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: db2060 not accessible - https://phabricator.wikimedia.org/T156161#3006166 (10Marostegui) Thanks for the heads up! I will get the server ready by Thursday then! Thank you! [17:22:28] _joe_: LGTM, thanks [17:22:46] PROBLEM - puppet last run on puppetmaster1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:23:32] (03PS1) 10Ema: varnish: warm-up before switching to new VCL in reload script [puppet] - 10https://gerrit.wikimedia.org/r/336440 (https://phabricator.wikimedia.org/T157430) [17:25:21] <_joe_> ema: is that used by confd as well? [17:25:32] _joe_: it is [17:25:47] RECOVERY - puppet last run on mw1267 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [17:25:47] PROBLEM - puppet last run on mc1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:25:48] 06Operations, 10DBA, 13Patch-For-Review: Restart pending mysql hosts with old TLS cert - https://phabricator.wikimedia.org/T152188#3006176 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['db2036.codfw.wmnet'] ``` The log can be found in `/var/log/... [17:25:52] <_joe_> ema: that looks like a quality fix :P [17:26:02] <_joe_> but hey, if it works... [17:26:05] heh :) [17:26:20] (03PS3) 10Giuseppe Lavagetto: Swap ori's `mw` script to using proper entry point [puppet] - 10https://gerrit.wikimedia.org/r/332648 (owner: 10Chad) [17:26:33] _joe_, it is called "agile system administration" [17:26:38] _joe_: yeah it seems to work fine and doesn't have obvious drawbacks [17:26:46] RECOVERY - puppet last run on puppetmaster1001 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [17:27:06] <_joe_> jynus: I thought "duct tape system engineering" is not insulting and more accurate [17:27:30] <_joe_> ema: ack, I'm all for hacks once the root cause is understood and known upstream [17:27:43] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Swap ori's `mw` script to using proper entry point [puppet] - 10https://gerrit.wikimedia.org/r/332648 (owner: 10Chad) [17:27:45] _joe_: nah, it's called a "packaging concern" https://github.com/varnishcache/varnish-cache/issues/2195#issuecomment-274319758 [17:28:11] <_joe_> RainbowSprinkles ori ^^ FYI, it's merged [17:28:21] thanks [17:28:25] Coolio thx [17:28:43] I'll wait awhile before killing the back-compat, let it work its way everywhere first [17:30:04] <_joe_> 1 hour tops :P [17:30:16] Yeah no rush, I'll do it sometime this afternoon [17:31:51] (03CR) 10Madhuvishy: [C: 031] Remove madhuvishy from statistics-users [puppet] - 10https://gerrit.wikimedia.org/r/335787 (https://phabricator.wikimedia.org/T142836) (owner: 10Muehlenhoff) [17:33:46] <_joe_> puppet swat officially done [17:34:02] ema: we could make it _joe_-approved by creating a hieradata variable for the varnish probe time, and pulling that down through roles and profiles into the probe VCL and that sleep value so they change together :) [17:34:32] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1022 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336442 [17:39:31] (03CR) 10EBernhardson: [C: 031] "Configuration looks to all line up with the code. The dependent patch was shipped to prod in 1.29.0-wmf.10 so is live everywhere." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334729 (https://phabricator.wikimedia.org/T149324) (owner: 10Tjones) [17:39:38] PROBLEM - puppet last run on analytics1046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:39:50] <_joe_> bblack: PRRRR [17:39:55] better than putting it into etcd! :) [17:42:15] (03PS1) 10Milimetric: Enable Dashiki extension on meta.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336444 (https://phabricator.wikimedia.org/T156971) [17:43:20] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db1022 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336442 (owner: 10Jcrespo) [17:45:32] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3006247 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2001.codfw.wmnet'] ``` and were **ALL** successful. [17:45:39] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1022 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336442 (owner: 10Jcrespo) [17:47:07] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1022 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336442 (owner: 10Jcrespo) [17:47:58] 06Operations, 10DBA, 13Patch-For-Review: Restart pending mysql hosts with old TLS cert - https://phabricator.wikimedia.org/T152188#3006248 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db2036.codfw.wmnet'] ``` Of which those **FAILED**: ``` set(['db2036.codfw.wmnet']) ``` [17:49:19] (03PS2) 10Ema: varnish: warm-up before switching to new VCL in reload script [puppet] - 10https://gerrit.wikimedia.org/r/336440 (https://phabricator.wikimedia.org/T157430) [17:49:26] (03CR) 10Ema: [V: 032 C: 032] varnish: warm-up before switching to new VCL in reload script [puppet] - 10https://gerrit.wikimedia.org/r/336440 (https://phabricator.wikimedia.org/T157430) (owner: 10Ema) [17:49:35] Dereckson: Should #patch-for-review be removed from https://phabricator.wikimedia.org/T157387 ? [17:50:18] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=elastic2001.codfw.wmnet [17:50:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:15] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3006253 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic2002.codfw.wmnet'] ```... [17:52:13] strange binlog_disk_usage is high on db1022 [17:52:48] RECOVERY - puppet last run on mc1006 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [17:57:04] it is ok now [17:58:52] (03PS1) 10Milimetric: Remove labs-specific enable of Dashiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336446 [18:00:04] gwicke, cscott, arlolra, subbu, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170207T1800). [18:00:17] sooon [18:00:21] but not today [18:00:33] (03CR) 10Milimetric: "The next change, https://gerrit.wikimedia.org/r/#/c/336446/, fixes the beta config of this extension." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336444 (https://phabricator.wikimedia.org/T156971) (owner: 10Milimetric) [18:08:34] RECOVERY - puppet last run on analytics1046 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [18:09:30] !log arlolra@tin Started deploy [parsoid/deploy@c3a5c55]: Updating Parsoid to f0732260 [18:09:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:32] 06Operations, 10LDAP-Access-Requests: Assistance with LDAP Access for Transparency Report - https://phabricator.wikimedia.org/T140380#3006328 (10APalmer_WMF) Hi @Dzahn, looks like we had the wrong username. Are you able to find "raqstallman"? Thanks so much for taking care of this. [18:13:33] 06Operations, 10Traffic, 13Patch-For-Review: varnish-be returning 503s upon VCL reload - https://phabricator.wikimedia.org/T157430#3006343 (10ema) 05Open>03Resolved a:03ema Tested https://gerrit.wikimedia.org/r/336440 on a esams host, no 503 spikes. Closing. [18:18:35] !log arlolra@tin Finished deploy [parsoid/deploy@c3a5c55]: Updating Parsoid to f0732260 (duration: 09m 05s) [18:18:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:05] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1022 after maintenance (duration: 00m 40s) [18:20:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:06] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3006365 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2002.codfw.wmnet'] ``` and were **ALL** successful. [18:23:33] (03CR) 10EBernhardson: Enable Translation memories multi-DC support (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335824 (https://phabricator.wikimedia.org/T132076) (owner: 10DCausse) [18:24:41] !log Updated Parsoid to f0732260 (T109897) [18:24:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:45] T109897: Table parsing diffs: Parsoid adds implicit s after a |- if explicit pipe is not present - https://phabricator.wikimedia.org/T109897 [18:29:58] (03PS1) 10ArielGlenn: cleanup old files after dataset100 rsync of dumps to labs [puppet] - 10https://gerrit.wikimedia.org/r/336451 [18:33:18] (03PS1) 10Jcrespo: mariadb: depool db1015 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336453 (https://phabricator.wikimedia.org/T152188) [18:33:26] !log starting branch cut for MediaWiki and extensions 1.29.0-wmf.11 [18:33:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:29] PROBLEM - puppet last run on labsdb1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:42:53] 06Operations, 10Collection, 10OfflineContentGenerator, 10Reading-Community-Engagement, and 2 others: Remove deprecated features from book creator UI - https://phabricator.wikimedia.org/T150917#3006444 (10JKatzWMF) [18:43:06] 06Operations, 10ops-ulsfo, 10fundraising-tech-ops: upgrade backup4001 hard disk array - https://phabricator.wikimedia.org/T157473#3006447 (10RobH) [18:43:10] (03PS2) 10Jcrespo: mariadb: depool db1015 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336453 (https://phabricator.wikimedia.org/T152188) [18:44:57] (03PS3) 10Jcrespo: mariadb: depool db1015 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336453 (https://phabricator.wikimedia.org/T152188) [18:45:37] (03CR) 10jerkins-bot: [V: 04-1] cleanup old files after dataset100 rsync of dumps to labs [puppet] - 10https://gerrit.wikimedia.org/r/336451 (owner: 10ArielGlenn) [18:50:23] (03CR) 10Dzahn: "check that all the @wikimedia.org email addresses work:" [puppet] - 10https://gerrit.wikimedia.org/r/336417 (owner: 10Muehlenhoff) [18:52:01] (03CR) 10Jcrespo: [C: 032] mariadb: depool db1015 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336453 (https://phabricator.wikimedia.org/T152188) (owner: 10Jcrespo) [18:53:29] (03Merged) 10jenkins-bot: mariadb: depool db1015 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336453 (https://phabricator.wikimedia.org/T152188) (owner: 10Jcrespo) [18:53:42] (03CR) 10jenkins-bot: mariadb: depool db1015 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336453 (https://phabricator.wikimedia.org/T152188) (owner: 10Jcrespo) [18:55:00] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1015 after maintenance (duration: 00m 54s) [18:55:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:32] (03PS1) 10Ottomata: Reset refined webrequest retention back to 62 days [puppet] - 10https://gerrit.wikimedia.org/r/336457 [18:55:55] !log restarting and upgrading db1015 T152188 [18:55:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:59] T152188: Restart pending mysql hosts with old TLS cert - https://phabricator.wikimedia.org/T152188 [18:56:36] (03Abandoned) 10Ottomata: Reset refined webrequest retention back to 62 days [puppet] - 10https://gerrit.wikimedia.org/r/336457 (owner: 10Ottomata) [18:56:39] (03PS2) 10Dzahn: More email addresses of WMF staff/contractors with LDAP access [puppet] - 10https://gerrit.wikimedia.org/r/336417 (https://phabricator.wikimedia.org/T142826) (owner: 10Muehlenhoff) [18:56:55] (03PS1) 10Ottomata: Reset refined webrequest retention back to 62 days [puppet] - 10https://gerrit.wikimedia.org/r/336458 [18:58:18] !log preparing db2043 for reimage T152188 [18:58:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:40] 06Operations, 10Collection, 10Traffic, 07HTTPS, 13Patch-For-Review: Book collections communicate with pediapress using http: - https://phabricator.wikimedia.org/T157398#3006606 (10Ckepper) Thanks for the heads up. If there is no other option, we can get a wildcard SSL cert for tools.pediapress.com. [19:05:32] (03CR) 10Dzahn: "add: LDAP user "raqstallman" with email rstallman@ (because i just added the user to WMF group as requested on T140380)" [puppet] - 10https://gerrit.wikimedia.org/r/336417 (https://phabricator.wikimedia.org/T142826) (owner: 10Muehlenhoff) [19:06:01] (03CR) 10Ottomata: [C: 032] Reset refined webrequest retention back to 62 days [puppet] - 10https://gerrit.wikimedia.org/r/336458 (owner: 10Ottomata) [19:08:58] 06Operations, 10LDAP-Access-Requests: Assistance with LDAP Access for Transparency Report - https://phabricator.wikimedia.org/T140380#3006636 (10Dzahn) Hi @APalmer_WMF Yes, i can find that. Added to "wmf". The login should work now. [19:09:29] RECOVERY - puppet last run on labsdb1006 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [19:09:39] 06Operations, 10LDAP-Access-Requests: Assistance with LDAP Access for Transparency Report - https://phabricator.wikimedia.org/T140380#3006647 (10APalmer_WMF) Thanks so much, @Dzahn. Really appreciate it! [19:10:04] 06Operations, 10LDAP-Access-Requests: Assistance with LDAP Access for Transparency Report - https://phabricator.wikimedia.org/T140380#3006652 (10APalmer_WMF) 05Open>03Resolved [19:11:30] 06Operations, 10Collection, 10Traffic, 07HTTPS, 13Patch-For-Review: Book collections communicate with pediapress using http: - https://phabricator.wikimedia.org/T157398#3006662 (10Dzahn) @Ckepper How about using https://letsencrypt.org/ it's easy with https://certbot.eff.org/ , you don't have to spend an... [19:14:48] (03PS1) 10Jcrespo: Revert "mariadb: depool db1015 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336460 [19:17:48] (03PS3) 10Dzahn: More email addresses of WMF staff/contractors with LDAP access [puppet] - 10https://gerrit.wikimedia.org/r/336417 (https://phabricator.wikimedia.org/T142826) (owner: 10Muehlenhoff) [19:23:29] (03PS1) 10Chad: multiversion: Drop remaining MWVersion.php shim [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336461 [19:23:36] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: depool db1015 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336460 (owner: 10Jcrespo) [19:25:26] (03Merged) 10jenkins-bot: Revert "mariadb: depool db1015 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336460 (owner: 10Jcrespo) [19:27:20] (03CR) 10jenkins-bot: Revert "mariadb: depool db1015 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336460 (owner: 10Jcrespo) [19:30:00] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1015 after maintenance (duration: 01m 34s) [19:30:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:48] !log drain shards from elastic200[34] in preparation for reimage - T151326 [19:34:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:52] T151326: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326 [19:37:57] (03PS6) 10Dzahn: wikistats: cron for automatic miraheze table update (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/335971 (https://phabricator.wikimedia.org/T153930) [19:39:20] (03CR) 10jerkins-bot: [V: 04-1] wikistats: cron for automatic miraheze table update (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/335971 (https://phabricator.wikimedia.org/T153930) (owner: 10Dzahn) [19:39:57] (03PS6) 10Dzahn: puppet/puppet_compiler: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334307 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [19:40:45] (03CR) 10Dzahn: [C: 032] "it's all in "puppet::self" and tests, and also http://puppet-compiler.wmflabs.org/5364/" [puppet] - 10https://gerrit.wikimedia.org/r/334307 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [19:41:39] PROBLEM - puppet last run on elastic1036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:43:10] !log deploying analysis-stempel plugin on relforge and cluster restart [19:43:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:37] (03CR) 10Dzahn: "could you please separate the "interface" changes into a separate change" [puppet] - 10https://gerrit.wikimedia.org/r/334320 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [19:48:50] (03CR) 10Dzahn: "could you please separate bastionhost and authdns" [puppet] - 10https://gerrit.wikimedia.org/r/334310 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [19:52:17] (03CR) 10Dzahn: [C: 031] "compiler finished http://puppet-compiler.wmflabs.org/5363/" [puppet] - 10https://gerrit.wikimedia.org/r/334276 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [19:52:29] 06Operations, 10Collection, 10Traffic, 07HTTPS, 13Patch-For-Review: Book collections communicate with pediapress using http: - https://phabricator.wikimedia.org/T157398#3006787 (10Ckepper) Thank you @Dzahn, that's an excellent suggestion. I will look into it and try to set it up for tools.pediapress.com. [20:00:04] thcipriani: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170207T2000). Please do the needful. [20:00:42] * thcipriani does needful [20:07:47] (03PS1) 10Andrew Bogott: Add a bunch of dummy files for service-dev secrets [labs/private] - 10https://gerrit.wikimedia.org/r/336462 [20:07:50] (03PS1) 10Andrew Bogott: Add dummy passwords::mirrors [labs/private] - 10https://gerrit.wikimedia.org/r/336463 [20:08:01] mutante: ^ should fix sodium and restbase-dev [20:08:29] mayb [20:08:30] e [20:08:55] awesome, i was staring at that list too :) [20:09:02] thank you for that [20:09:32] (03CR) 10Dzahn: [C: 032] Add dummy passwords::mirrors [labs/private] - 10https://gerrit.wikimedia.org/r/336463 (owner: 10Andrew Bogott) [20:09:39] RECOVERY - puppet last run on elastic1036 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [20:10:02] !log thcipriani@tin Started scap: testwiki to php-1.29.0-wmf.11 and rebuild l10n cache [20:10:03] (03CR) 10Dzahn: [V: 032 C: 032] Add dummy passwords::mirrors [labs/private] - 10https://gerrit.wikimedia.org/r/336463 (owner: 10Andrew Bogott) [20:10:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:01] (03CR) 10Dzahn: [C: 031] Add a bunch of dummy files for service-dev secrets [labs/private] - 10https://gerrit.wikimedia.org/r/336462 (owner: 10Andrew Bogott) [20:14:31] (03CR) 10Andrew Bogott: [V: 032 C: 032] Add a bunch of dummy files for service-dev secrets [labs/private] - 10https://gerrit.wikimedia.org/r/336462 (owner: 10Andrew Bogott) [20:14:53] (03PS1) 10Andrew Bogott: Add docker/registry.yaml [labs/private] - 10https://gerrit.wikimedia.org/r/336465 [20:15:24] (03CR) 10Andrew Bogott: [V: 032 C: 032] Add docker/registry.yaml [labs/private] - 10https://gerrit.wikimedia.org/r/336465 (owner: 10Andrew Bogott) [20:16:37] (03PS1) 10Jdlrobson: Update footer logos on mobile site for various projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336466 (https://phabricator.wikimedia.org/T157476) [20:23:04] (03PS7) 10Dzahn: wikistats: cron for automatic miraheze table update (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/335971 (https://phabricator.wikimedia.org/T153930) [20:23:22] (03PS2) 10Jdlrobson: Update footer logos on mobile site for various projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336466 (https://phabricator.wikimedia.org/T157476) [20:25:13] (03PS6) 10Hashar: rsync: allow extra settings in rsyncd.conf [puppet] - 10https://gerrit.wikimedia.org/r/290895 (https://phabricator.wikimedia.org/T136276) [20:26:48] (03PS8) 10Dzahn: wikistats: cron for automatic miraheze table update (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/335971 (https://phabricator.wikimedia.org/T153930) [20:27:44] (03CR) 10jerkins-bot: [V: 04-1] wikistats: cron for automatic miraheze table update (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/335971 (https://phabricator.wikimedia.org/T153930) (owner: 10Dzahn) [20:28:58] (03PS9) 10Dzahn: wikistats: cron for automatic miraheze table update (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/335971 (https://phabricator.wikimedia.org/T153930) [20:29:31] (03CR) 10Hashar: "I did the magic command:" [puppet] - 10https://gerrit.wikimedia.org/r/290895 (https://phabricator.wikimedia.org/T136276) (owner: 10Hashar) [20:29:47] 06Operations, 10Ops-Access-Requests: Request for access to stat1003 for Sam Tarling - https://phabricator.wikimedia.org/T157483#3006952 (10Samtar) [20:29:48] (03PS10) 10Dzahn: wikistats: cron for automatic miraheze table update (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/335971 (https://phabricator.wikimedia.org/T153930) [20:30:04] (03PS11) 10Dzahn: wikistats: cron for automatic miraheze table update [puppet] - 10https://gerrit.wikimedia.org/r/335971 (https://phabricator.wikimedia.org/T153930) [20:30:09] PROBLEM - puppet last run on wtp1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:30:21] 06Operations, 10Ops-Access-Requests: Request for access to stat1003 for Sam Tarling - https://phabricator.wikimedia.org/T157483#3006968 (10Samtar) [20:30:46] Improperly owned (0:0) files in /srv/mediawiki-staging [20:30:50] looks [20:32:16] Files ownership is ok. [20:33:00] (03PS6) 10Rush: openstack: nova fullstack testing [puppet] - 10https://gerrit.wikimedia.org/r/336413 [20:33:13] hrm? I'm running train, any weird ownership thing may be me? [20:33:16] (03CR) 10Dzahn: [C: 032] wikistats: cron for automatic miraheze table update [puppet] - 10https://gerrit.wikimedia.org/r/335971 (https://phabricator.wikimedia.org/T153930) (owner: 10Dzahn) [20:33:20] (03PS7) 10Rush: openstack: nova fullstack testing [puppet] - 10https://gerrit.wikimedia.org/r/336413 [20:34:29] (03CR) 10jerkins-bot: [V: 04-1] openstack: nova fullstack testing [puppet] - 10https://gerrit.wikimedia.org/r/336413 (owner: 10Rush) [20:34:37] awww, really.. the error on my labs instance is back [20:34:41] File[/home/dzahn/.puppet/ssl]: change from 0755 to 0771 failed: failed to set mode 755 [20:34:41] (03CR) 10Rush: "@andrew the idea is to run this as a service on labnet constantly validating the pipeline at $interval(as well as tracking heurtistics for" [puppet] - 10https://gerrit.wikimedia.org/r/336413 (owner: 10Rush) [20:35:21] sigh.. what is this from.. it just happened yesterday and then went away when i fixed the config in horizon [20:35:52] Error: Could not set 'directory' on ensure: Permission denied @ dir_s_mkdir - /home/dzahn/.puppet/var [20:35:55] Error: Could not set 'directory' on ensure: Permission denied @ dir_s_mkdir - /home/dzahn/.puppet/var [20:35:58] Wrapped exception: [20:36:02] yea, but nobody said you should try creating that [20:36:07] (03PS8) 10Rush: openstack: nova fullstack testing [puppet] - 10https://gerrit.wikimedia.org/r/336413 [20:38:53] (03PS3) 10Rush: nodepool: active check for node pool instance states [puppet] - 10https://gerrit.wikimedia.org/r/336404 [20:39:10] (03PS9) 10Rush: openstack: nova fullstack testing [puppet] - 10https://gerrit.wikimedia.org/r/336413 [20:39:20] (03PS1) 10Andrew Bogott: Restbase secrets: Add more dummy files, hopefully all of them this time [labs/private] - 10https://gerrit.wikimedia.org/r/336472 [20:40:19] PROBLEM - puppet last run on cp3020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:40:24] (03CR) 10Andrew Bogott: [V: 032 C: 032] Restbase secrets: Add more dummy files, hopefully all of them this time [labs/private] - 10https://gerrit.wikimedia.org/r/336472 (owner: 10Andrew Bogott) [20:46:09] (03PS1) 10Andrew Bogott: Move a misplaced dummy hiera file [labs/private] - 10https://gerrit.wikimedia.org/r/336474 [20:46:38] (03CR) 10Andrew Bogott: [V: 032 C: 032] Move a misplaced dummy hiera file [labs/private] - 10https://gerrit.wikimedia.org/r/336474 (owner: 10Andrew Bogott) [20:48:50] (03CR) 10Hashar: [C: 031] "I am all for it! Thanks :]" [puppet] - 10https://gerrit.wikimedia.org/r/336404 (owner: 10Rush) [20:52:37] (03PS1) 10Andrew Bogott: Add more dummy private hiera settings for swift [labs/private] - 10https://gerrit.wikimedia.org/r/336478 [20:53:08] (03CR) 10Andrew Bogott: [V: 032 C: 032] Add more dummy private hiera settings for swift [labs/private] - 10https://gerrit.wikimedia.org/r/336478 (owner: 10Andrew Bogott) [20:54:50] PROBLEM - HHVM rendering on mw1189 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:54:59] PROBLEM - Nginx local proxy to apache on mw1189 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:55:19] PROBLEM - Apache HTTP on mw1189 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:57:09] RECOVERY - puppet last run on wtp1005 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [21:00:51] (03CR) 10Andrew Bogott: "This really ought to be a no-op, verifying with https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/5370/console" [puppet] - 10https://gerrit.wikimedia.org/r/335869 (https://phabricator.wikimedia.org/T149589) (owner: 10Andrew Bogott) [21:01:23] (03PS8) 10Andrew Bogott: Linting fixes (Multiple modules) [puppet] - 10https://gerrit.wikimedia.org/r/334276 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [21:01:56] !log thcipriani@tin Finished scap: testwiki to php-1.29.0-wmf.11 and rebuild l10n cache (duration: 51m 53s) [21:02:01] 06Operations, 10Ops-Access-Requests: Request for access to stat1003 for Sam Tarling - https://phabricator.wikimedia.org/T157483#3007181 (10Samtar) Just to note I've signed L3 :-) [21:02:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:41] thcipriani: please add the release notes [21:05:12] matanya: yup. Will do. Just need to finish group0 stuff then they'll be up. [21:07:39] (03PS1) 10Thcipriani: Group0 to 1.29.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336482 [21:08:19] RECOVERY - puppet last run on cp3020 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [21:09:47] (03CR) 10Andrew Bogott: [C: 032] Linting fixes (Multiple modules) [puppet] - 10https://gerrit.wikimedia.org/r/334276 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [21:12:16] matanya: added https://www.mediawiki.org/wiki/MediaWiki_1.29/wmf.11 [21:12:26] thanks much [21:13:15] absotively. thanks for keeping an eye out :) [21:13:43] (03CR) 10Thcipriani: [C: 032] Group0 to 1.29.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336482 (owner: 10Thcipriani) [21:16:40] (03Merged) 10jenkins-bot: Group0 to 1.29.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336482 (owner: 10Thcipriani) [21:17:10] (03CR) 10jenkins-bot: Group0 to 1.29.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336482 (owner: 10Thcipriani) [21:17:59] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 to 1.29.0-wmf.11 [21:18:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:15] (03PS1) 10Yuvipanda: docker: Turn on live-restore for docker builder host on prod [puppet] - 10https://gerrit.wikimedia.org/r/336487 (https://phabricator.wikimedia.org/T157180) [21:24:09] (03PS2) 10Yuvipanda: docker: Turn on live-restore for docker builder host on prod [puppet] - 10https://gerrit.wikimedia.org/r/336487 (https://phabricator.wikimedia.org/T157180) [21:25:03] (03CR) 10Yuvipanda: [V: 032 C: 032] docker: Turn on live-restore for docker builder host on prod [puppet] - 10https://gerrit.wikimedia.org/r/336487 (https://phabricator.wikimedia.org/T157180) (owner: 10Yuvipanda) [21:26:33] (03PS1) 10Yuvipanda: docker: Unbreak puppet on copper [puppet] - 10https://gerrit.wikimedia.org/r/336489 [21:26:57] (03PS2) 10Yuvipanda: docker: Unbreak puppet on copper [puppet] - 10https://gerrit.wikimedia.org/r/336489 [21:27:02] (03CR) 10Yuvipanda: [V: 032 C: 032] docker: Unbreak puppet on copper [puppet] - 10https://gerrit.wikimedia.org/r/336489 (owner: 10Yuvipanda) [21:28:20] mutante: volans copper fixed [21:28:23] thanks for pointing it out! [21:28:24] brb [21:28:28] yuvipanda: :) thanks [21:28:39] RECOVERY - puppet last run on copper is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [21:28:47] mutante: is this all it takes to setup an nrpe check? https://gerrit.wikimedia.org/r/#/c/336404/3/modules/role/manifests/labs/openstack/nodepool.pp [21:35:11] 06Operations, 10DBA, 13Patch-For-Review: Restart pending mysql hosts with old TLS cert - https://phabricator.wikimedia.org/T152188#3007368 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['db2043.codfw.wmnet'] ``` The log can be found in `/var/log/... [21:38:05] chasemp: there's one other change you need, let me find an example [21:38:23] thanks, I thought so but couldn't figure for some reason [21:39:25] (03PS2) 10Rush: Remove views.pp from labsdb role, duplicate of labs::db::views [puppet] - 10https://gerrit.wikimedia.org/r/324889 (owner: 10Jcrespo) [21:39:39] chasemp: I think this is the bit you're missing [21:39:40] https://gerrit.wikimedia.org/r/#/c/326845/ [21:40:14] (defining the check on the monitoring host as well as on the monitored host) [21:40:21] hm ok [21:47:58] (03PS5) 10Rush: mwyaml: Accept existing, but empty "Hiera:" pages as well [puppet] - 10https://gerrit.wikimedia.org/r/325131 (https://phabricator.wikimedia.org/T152142) (owner: 10Tim Landscheidt) [21:48:31] (03CR) 10Andrew Bogott: [C: 032] "Hm, I guess that would be Bryan, anyway." [puppet] - 10https://gerrit.wikimedia.org/r/329020 (https://phabricator.wikimedia.org/T151422) (owner: 10BryanDavis) [21:48:40] (03PS2) 10Andrew Bogott: logstash: Add a json_lines TCP input [puppet] - 10https://gerrit.wikimedia.org/r/329020 (https://phabricator.wikimedia.org/T151422) (owner: 10BryanDavis) [21:49:51] (03CR) 10Rush: [C: 032] mwyaml: Accept existing, but empty "Hiera:" pages as well [puppet] - 10https://gerrit.wikimedia.org/r/325131 (https://phabricator.wikimedia.org/T152142) (owner: 10Tim Landscheidt) [21:50:41] (03PS6) 10Andrew Bogott: mwyaml: Accept existing, but empty "Hiera:" pages as well [puppet] - 10https://gerrit.wikimedia.org/r/325131 (https://phabricator.wikimedia.org/T152142) (owner: 10Tim Landscheidt) [21:51:27] chasemp: i checked for some existing NRPE checks on mailman. i just have a simple file{} that installs the plugin locally on the host [21:51:41] and it goes to file { '/usr/local/lib/nagios/plugins/check_mailman_queue': [21:52:26] and then the nrpe::monitor_service that uses it [21:53:06] ./puppet$ grep -r check_mailman_q * [21:55:26] (03CR) 10Rush: [C: 032] Remove views.pp from labsdb role, duplicate of labs::db::views [puppet] - 10https://gerrit.wikimedia.org/r/324889 (owner: 10Jcrespo) [21:55:34] (03PS3) 10Rush: Remove views.pp from labsdb role, duplicate of labs::db::views [puppet] - 10https://gerrit.wikimedia.org/r/324889 (owner: 10Jcrespo) [21:55:43] (03CR) 10Rush: [V: 032 C: 032] Remove views.pp from labsdb role, duplicate of labs::db::views [puppet] - 10https://gerrit.wikimedia.org/r/324889 (owner: 10Jcrespo) [21:58:11] (03CR) 10Rush: "Does inline query for this add much delay? Is it a 10ms operations or a 10s one? etc. If 10ms or the like seems good to me" [puppet] - 10https://gerrit.wikimedia.org/r/328030 (owner: 10Tim Landscheidt) [21:59:49] PROBLEM - puppet last run on stat1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:00:49] 07Puppet, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: mwyaml chokes on existing, but empty Hiera: pages on wikitech - https://phabricator.wikimedia.org/T152142#3007439 (10scfc) 05Open>03Resolved [22:04:06] 06Operations, 10DBA, 13Patch-For-Review: Restart pending mysql hosts with old TLS cert - https://phabricator.wikimedia.org/T152188#3007460 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db2043.codfw.wmnet'] ``` and were **ALL** successful. [22:05:39] (03CR) 10Andrew Bogott: [C: 031] "Looks good! I look forward to having this page me :/" [puppet] - 10https://gerrit.wikimedia.org/r/336413 (owner: 10Rush) [22:11:32] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Puppet changes required for elasticsearch 5.x upgrade - https://phabricator.wikimedia.org/T155578#3007473 (10EBernhardson) Additional notes from the elasticsearch migration plugin: * Node attributes move to attr namespace ** node.ra... [22:12:59] 06Operations, 10Ops-Access-Requests: Request for access to stat1003 for Sam Tarling - https://phabricator.wikimedia.org/T157483#3006952 (10RobH) I'm on ops clinic duty this week, so I'll process this request on the ops side of things. @Samtar: @Samtar: Would you be able to review https://wikitech.wikimedia.... [22:15:56] (03PS1) 10Rush: tools: add bzip2 to list of throttled tools [puppet] - 10https://gerrit.wikimedia.org/r/336544 [22:16:57] (03CR) 10Madhuvishy: [C: 031] tools: add bzip2 to list of throttled tools [puppet] - 10https://gerrit.wikimedia.org/r/336544 (owner: 10Rush) [22:18:41] 06Operations, 10ops-codfw, 06DC-Ops: Humidity Alarms from codfw - https://phabricator.wikimedia.org/T110421#3007506 (10RobH) 05Open>03Resolved [22:22:41] (03CR) 10Rush: [V: 032 C: 032] "where are you jenkins?" [puppet] - 10https://gerrit.wikimedia.org/r/336544 (owner: 10Rush) [22:22:58] 06Operations, 10Collection, 10Traffic, 07HTTPS, 13Patch-For-Review: Book collections communicate with pediapress using http: - https://phabricator.wikimedia.org/T157398#3007531 (10Platonides) Only the last patch depend on the pediapress changing their certificate. The other hosts have a valid certificate... [22:26:31] 06Operations, 10ops-codfw, 06DC-Ops: Humidity Alarms from codfw - https://phabricator.wikimedia.org/T110421#3007538 (10Dzahn) >>! In T110421#2412741, @Papaul wrote: > @RobH I have no access to librenms yet. @Papaul I looked at that and i see in puppet code that it allows login to members of the "ops" LDAP... [22:27:49] RECOVERY - puppet last run on stat1003 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [22:28:28] 06Operations, 10Ops-Access-Requests: Request for access to stat1003 for Sam Tarling - https://phabricator.wikimedia.org/T157483#3006952 (10Platonides) The same key in OpenSSH format: $ ssh-keygen -f STarling -i ssh-rsa AAAAB3NzaC1yc2EAAAABJQAAAQEAthC8yN9ImF+F6DQsI4GqYdAKhEtwfZ/+S7xBg2V5Kz5LLrN/KWUN9uiKsU... [22:28:46] !log otto@tin Started deploy [eventstreams/deploy@e86077c]: (no justification provided) [22:28:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:12] !log otto@tin Finished deploy [eventstreams/deploy@e86077c]: (no justification provided) (duration: 02m 26s) [22:31:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:29] 06Operations: fix log reading permissions for dc-ops admin group - https://phabricator.wikimedia.org/T156529#3007578 (10Dzahn) [22:34:00] 06Operations, 07Puppet, 10puppet-compiler: A few hosts never get clean puppet compiler runs - https://phabricator.wikimedia.org/T157496#3007582 (10Dzahn) [22:34:16] 06Operations, 07Puppet, 10puppet-compiler: A few hosts never get clean puppet compiler runs - https://phabricator.wikimedia.org/T157496#3007289 (10Dzahn) p:05Triage>03Normal [22:35:04] 06Operations, 07Puppet, 10puppet-compiler: A few hosts never get clean puppet compiler runs - https://phabricator.wikimedia.org/T157496#3007289 (10Dzahn) I think all 4 hosts are due to the same issue with the prometheus template. [22:35:16] 06Operations, 07Puppet, 10puppet-compiler, 05Prometheus-metrics-monitoring: A few hosts never get clean puppet compiler runs - https://phabricator.wikimedia.org/T157496#3007587 (10Dzahn) [22:36:15] 8 [22:49:19] PROBLEM - puppet last run on cp3030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:51:35] 06Operations, 10Ops-Access-Requests: Request for access to stat1003 for Sam Tarling - https://phabricator.wikimedia.org/T157483#3007630 (10Samwalton9) @RobH no problems from me! I'm a part-time contractor so consider this endorsed if I'm able to do so, but if not @Ocaasi_WMF should be happy to. [22:52:34] 06Operations, 10Ops-Access-Requests: Request for access to stat1003 for Sam Tarling - https://phabricator.wikimedia.org/T157483#3007634 (10RobH) >>! In T157483#3007630, @Samwalton9 wrote: > @RobH no problems from me! I'm a part-time contractor so consider this endorsed if I'm able to do so, but if not @Ocaasi_... [22:56:54] (03CR) 10Andrew Bogott: [C: 032] "Tedious puppet compiler run confirms that this is a no-op. Which, really, it had ought to be." [puppet] - 10https://gerrit.wikimedia.org/r/335869 (https://phabricator.wikimedia.org/T149589) (owner: 10Andrew Bogott) [22:57:01] (03PS2) 10Andrew Bogott: Add a bunch of filtertags to puppet class comments [puppet] - 10https://gerrit.wikimedia.org/r/335869 (https://phabricator.wikimedia.org/T149589) [23:01:39] PROBLEM - puppet last run on rdb1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:01:57] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Puppet changes required for elasticsearch 5.x upgrade - https://phabricator.wikimedia.org/T155578#3007662 (10EBernhardson) Migration plugin also reports another error which I'm working on reproducing: * Index settings ** Built-in si... [23:04:36] (03PS1) 10Kaldari: Enable Echo on loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336548 (https://phabricator.wikimedia.org/T157105) [23:15:29] PROBLEM - puppet last run on fluorine is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:16:29] PROBLEM - puppet last run on elastic1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:18:19] RECOVERY - puppet last run on cp3030 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [23:28:35] (03PS2) 10Dzahn: Block IPs for recent attempts to upload offtopic files to Phabricator [puppet] - 10https://gerrit.wikimedia.org/r/334683 (owner: 10Aklapper) [23:29:39] RECOVERY - puppet last run on rdb1001 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [23:31:22] (03PS3) 10Dzahn: Block IPs for recent attempts to upload offtopic files to Phabricator [puppet] - 10https://gerrit.wikimedia.org/r/334683 (owner: 10Aklapper) [23:32:35] (03PS4) 10Dzahn: phabricator: Block IPs for recent attempts to upload offtopic files [puppet] - 10https://gerrit.wikimedia.org/r/334683 (owner: 10Aklapper) [23:32:49] PROBLEM - puppet last run on wtp1023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:42:59] !log carbon - stopping DHCP service (install* should be used) [23:43:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:29] RECOVERY - puppet last run on fluorine is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [23:44:29] RECOVERY - puppet last run on elastic1018 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [23:46:38] (03PS1) 10Platonides: Increase account creation limit for a couple of schools on it.wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336552 (https://phabricator.wikimedia.org/T157504) [23:51:49] (03PS1) 10Platonides: Enable the "Quiz" extension on Spanish Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336553 (https://phabricator.wikimedia.org/T157513) [23:54:49] PROBLEM - puppet last run on mw1262 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:59:49] RECOVERY - puppet last run on wtp1023 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures