[00:42:49] 10Operations, 10Commons, 10Multimedia, 10Traffic, and 3 others: Disable serving unpatrolled new files to Wikipedia Zero users - https://phabricator.wikimedia.org/T167400#3787682 (10Tgr) So to summarize / spec out the code parts a bit: == Core functionality: * Make `File::getContentHeaders()` look up patro... [00:56:04] PROBLEM - Disk space on elastic1020 is CRITICAL: DISK CRITICAL - free space: /srv 60522 MB (12% inode=99%) [01:03:04] RECOVERY - Disk space on elastic1020 is OK: DISK OK [01:14:49] Krenair: The database servers for Wikidata on ToolForge are slower [01:53:23] Krenair: This shouldn't take 40 seconds execute, it's an INDEX lookup. [01:53:24] SELECT 1 FROM wikidatawiki_p.wb_items_per_site WHERE ips_site_page='Harrington Place, Stellenbosch' AND ips_site_id != 'enwiki'; [01:55:14] mysql:wikiadmin@db1106 [wikidatawiki]> SELECT 1 FROM wb_items_per_site WHERE ips_site_page='Harrington Place, Stellenbosch' AND ips_site_id != 'enwiki'; [01:55:14] Empty set (17.92 sec) [01:55:18] That's pretty bad [02:22:57] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.8) (duration: 05m 44s) [02:23:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:40:42] 10Operations, 10Release Pipeline, 10Release-Engineering-Team (Watching / External): Update Debian package for Blubber - https://phabricator.wikimedia.org/T179984#3787703 (10thcipriani) >>! In T179984#3784256, @akosiaris wrote: > I am guessing we can resolve this, but if you have any info about the version pu... [02:41:24] PROBLEM - Nginx local proxy to apache on mw2251 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:42:14] RECOVERY - Nginx local proxy to apache on mw2251 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.194 second response time [02:57:24] 10Operations, 10Release Pipeline, 10Release-Engineering-Team (Watching / External): Update Debian package for Blubber - https://phabricator.wikimedia.org/T179984#3787716 (10thcipriani) [03:12:26] https://en.wikipedia.org/wiki/User:LinkBot/suggestions/Actor Linked to Paula_Garc▒s (invalid UTF-8 sequence). I fixed it with a null edit. [03:24:45] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 795.90 seconds [03:34:15] PROBLEM - pdfrender on scb1001 is CRITICAL: connect to address 10.64.0.16 and port 5252: Connection refused [03:34:35] PROBLEM - eventstreams on scb1001 is CRITICAL: connect to address 10.64.0.16 and port 8092: Connection refused [03:36:35] RECOVERY - eventstreams on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 929 bytes in 0.024 second response time [03:38:15] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.004 second response time [03:57:04] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 274.75 seconds [04:07:43] (03CR) 10BryanDavis: user homes: Allow git to control +x for $HOME files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/377056 (owner: 10BryanDavis) [04:33:55] PROBLEM - Check Varnish expiry mailbox lag on cp4026 is CRITICAL: CRITICAL: expiry mailbox lag is 2097201 [06:22:09] (03PS1) 10Marostegui: db-eqiad.php: Pool db1097 in s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393522 (https://phabricator.wikimedia.org/T178359) [06:22:17] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of JeanBono → Rexcornot: supervision needed - https://phabricator.wikimedia.org/T181170#3787772 (10Marostegui) I am now online [06:23:52] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Pool db1097 in s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393522 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:25:22] (03Merged) 10jenkins-bot: db-eqiad.php: Pool db1097 in s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393522 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:26:56] (03CR) 10jenkins-bot: db-eqiad.php: Pool db1097 in s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393522 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:27:15] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Pool db1097:3314 with low weight - T178359 (duration: 00m 46s) [06:27:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:22] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [06:28:25] PROBLEM - puppet last run on db2090 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/lib/nagios/plugins/check-fresh-files-in-dir.py] [06:29:14] PROBLEM - puppet last run on mw2222 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssh/userkeys/pybal-check] [06:31:44] PROBLEM - puppet last run on mw2127 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssh/userkeys/pybal-check] [06:46:10] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests: Decommission db1021 - https://phabricator.wikimedia.org/T181378#3787781 (10Marostegui) [06:47:14] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Remove db1021 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393524 (https://phabricator.wikimedia.org/T181378) [06:49:29] (03CR) 10Marostegui: [C: 032] db-eqiad,db-codfw.php: Remove db1021 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393524 (https://phabricator.wikimedia.org/T181378) (owner: 10Marostegui) [06:50:49] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db1021 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393524 (https://phabricator.wikimedia.org/T181378) (owner: 10Marostegui) [06:50:59] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db1021 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393524 (https://phabricator.wikimedia.org/T181378) (owner: 10Marostegui) [06:52:00] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Remove db1021 from the config as it will be decommissioned - T181378 (duration: 00m 45s) [06:52:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:08] T181378: Decommission db1021 - https://phabricator.wikimedia.org/T181378 [06:53:03] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Remove db1021 from the config as it will be decommissioned - T181378 (duration: 00m 44s) [06:53:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:13] (03PS1) 10Marostegui: mariadb: Prepare to decommission db1021 [puppet] - 10https://gerrit.wikimedia.org/r/393525 (https://phabricator.wikimedia.org/T181378) [06:55:40] (03PS2) 10Marostegui: mariadb: Prepare to decommission db1021 [puppet] - 10https://gerrit.wikimedia.org/r/393525 (https://phabricator.wikimedia.org/T181378) [06:56:44] RECOVERY - puppet last run on mw2127 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:25] RECOVERY - puppet last run on db2090 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:59:05] RECOVERY - puppet last run on mw2222 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:03:23] (03CR) 10Marostegui: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/8974/" [puppet] - 10https://gerrit.wikimedia.org/r/393525 (https://phabricator.wikimedia.org/T181378) (owner: 10Marostegui) [07:05:02] (03PS1) 10Marostegui: s2.hosts: Remove db1021 [software] - 10https://gerrit.wikimedia.org/r/393528 (https://phabricator.wikimedia.org/T181378) [07:05:58] (03CR) 10Marostegui: [C: 032] s2.hosts: Remove db1021 [software] - 10https://gerrit.wikimedia.org/r/393528 (https://phabricator.wikimedia.org/T181378) (owner: 10Marostegui) [07:06:39] (03Merged) 10jenkins-bot: s2.hosts: Remove db1021 [software] - 10https://gerrit.wikimedia.org/r/393528 (https://phabricator.wikimedia.org/T181378) (owner: 10Marostegui) [07:09:37] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1021 - https://phabricator.wikimedia.org/T181378#3787811 (10Marostegui) [07:10:32] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1021 - https://phabricator.wikimedia.org/T181378#3787781 (10Marostegui) [07:10:42] !log Stop MySQL on db1021 as it will be decommissioned - T181378 [07:10:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:49] T181378: Decommission db1021 - https://phabricator.wikimedia.org/T181378 [07:13:23] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1021 - https://phabricator.wikimedia.org/T181378#3787814 (10Marostegui) a:05Marostegui>03Cmjohnson This server is now ready for @Cmjohnson to take over and fully decommission it. Thanks! [07:15:52] (03PS1) 10Marostegui: db-eqiad.php: Increase weight for db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393533 (https://phabricator.wikimedia.org/T178359) [07:18:17] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase weight for db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393533 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [07:19:29] (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight for db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393533 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [07:19:39] (03CR) 10jenkins-bot: db-eqiad.php: Increase weight for db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393533 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [07:20:43] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase db1097:3314 weight - T178359 (duration: 00m 45s) [07:20:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:50] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [07:24:21] (03PS1) 10Faidon Liambotis: Include geoip in kafkatee::webrequest::ops [puppet] - 10https://gerrit.wikimedia.org/r/393536 [07:24:44] (03CR) 10jerkins-bot: [V: 04-1] Include geoip in kafkatee::webrequest::ops [puppet] - 10https://gerrit.wikimedia.org/r/393536 (owner: 10Faidon Liambotis) [07:26:13] blergh [07:27:33] _joe_: since you added that check... how am I supposed to work around this without rewriting the whole thing? :) [07:28:07] I guess I could add a profile::geoip { include ::geoip } ? [07:33:41] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of JeanBono → Rexcornot: supervision needed - https://phabricator.wikimedia.org/T181170#3787833 (10alanajjar) @Marostegui I'll start now? [07:33:57] RECOVERY - Check Varnish expiry mailbox lag on cp4026 is OK: OK: expiry mailbox lag is 0 [07:35:58] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of JeanBono → Rexcornot: supervision needed - https://phabricator.wikimedia.org/T181170#3787835 (10alanajjar) [[https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress/Rexcornot|The progress]] .. [07:38:07] PROBLEM - puppet last run on labvirt1009 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/puppet-enabled] [07:39:16] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of JeanBono → Rexcornot: supervision needed - https://phabricator.wikimedia.org/T181170#3787837 (10Marostegui) Thanks! [07:45:14] (03PS1) 10Marostegui: db-eqiad.php: Increase weight for db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393543 [07:48:30] <_joe_> paravoid: either that or # lint:ignore wmf-styleguide [07:57:30] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of JeanBono → Rexcornot: supervision needed - https://phabricator.wikimedia.org/T181170#3787839 (10alanajjar) Done. Thanks @Marostegui [07:57:50] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of JeanBono → Rexcornot: supervision needed - https://phabricator.wikimedia.org/T181170#3787840 (10alanajjar) 05Open>03Resolved [08:02:19] <_joe_> paravoid: so looking at the code, the point is that role::logging::kafkatee::ops should really be a profile :) [08:03:07] RECOVERY - puppet last run on labvirt1009 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [08:06:59] _joe_: yeah I know, but I'd rather do this simple thing instead of rewrite it all :P [08:16:31] <_joe_> in those situations, it's ok to just ignore the -1 for now, or add a lint:ignore that could be removed once we migrate the class [08:19:32] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase weight for db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393543 (owner: 10Marostegui) [08:19:55] (03CR) 10Muehlenhoff: [C: 032] grafana_http: Restrict to CACHE_MISC [puppet] - 10https://gerrit.wikimedia.org/r/393244 (owner: 10Muehlenhoff) [08:20:01] (03PS2) 10Muehlenhoff: grafana_http: Restrict to CACHE_MISC [puppet] - 10https://gerrit.wikimedia.org/r/393244 [08:20:51] (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight for db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393543 (owner: 10Marostegui) [08:21:00] (03CR) 10jenkins-bot: db-eqiad.php: Increase weight for db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393543 (owner: 10Marostegui) [08:22:24] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase db1097:3314 weight - T178359 (duration: 00m 45s) [08:22:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:32] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [08:27:39] !log Deploy schema change on dbstore1002 and dbstore1001 - T174569 [08:27:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:46] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [08:28:17] 10Operations, 10Release Pipeline, 10monitoring, 10Continuous-Integration-Infrastructure (shipyard), 10Release-Engineering-Team (Kanban): Icinga disk space alert when a Docker container is running on an host - https://phabricator.wikimedia.org/T178454#3787851 (10hashar) [08:29:01] 10Operations, 10Cloud-Services, 10monitoring, 10Continuous-Integration-Infrastructure (shipyard), and 3 others: Grafana reports ALL docker mounts in a spammy way - https://phabricator.wikimedia.org/T177052#3645705 (10hashar) [08:33:56] 10Operations, 10CheckUser, 10Traffic: Log source port for anonymous users and expose it for sysops/checkusers - https://phabricator.wikimedia.org/T181368#3787859 (10ema) p:05Triage>03Normal [08:34:20] 10Operations, 10Performance-Team, 10Traffic: load.php response taking 160s (of which only 0.031s in Apache) - https://phabricator.wikimedia.org/T181315#3787860 (10ema) p:05Triage>03Normal [08:34:30] 10Operations, 10Dumps-Generation, 10HHVM, 10Patch-For-Review: Convert snapshot hosts to use HHVM and trusty - https://phabricator.wikimedia.org/T94277#3787863 (10nichtich) [08:40:52] !log installing openjdk security updates on hadoop, druid and kafka clusters [08:40:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:58] (03PS1) 10Marostegui: db-eqiad.php: Fully pool db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393545 (https://phabricator.wikimedia.org/T178359) [08:44:28] PROBLEM - Apache HTTP on mw1324 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [08:45:28] RECOVERY - Apache HTTP on mw1324 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.028 second response time [08:45:39] 10Operations, 10Performance-Team, 10Traffic: load.php response taking 160s (of which only 0.031s in Apache) - https://phabricator.wikimedia.org/T181315#3787868 (10Samat) I re-uploaded the .har file. I will give access to it as soon as I will know who should I share with :) [08:46:23] 10Operations, 10CheckUser, 10Traffic: Log source port for anonymous users and expose it for sysops/checkusers - https://phabricator.wikimedia.org/T181368#3787475 (10ema) >>! In T181368#3787516, @Krenair wrote: > I would expect $_SERVER['REMOTE_PORT'] to be useless inside WMF infrastructure Yeah I'... [08:46:29] (03PS1) 10ArielGlenn: move one more setting out of snapshot hiera and into profiles [puppet] - 10https://gerrit.wikimedia.org/r/393546 [08:46:31] (03PS1) 10ArielGlenn: move last hiera calls out of snapshot modules into profile [puppet] - 10https://gerrit.wikimedia.org/r/393547 [08:46:58] (03CR) 10jerkins-bot: [V: 04-1] move one more setting out of snapshot hiera and into profiles [puppet] - 10https://gerrit.wikimedia.org/r/393546 (owner: 10ArielGlenn) [08:47:11] (03CR) 10jerkins-bot: [V: 04-1] move last hiera calls out of snapshot modules into profile [puppet] - 10https://gerrit.wikimedia.org/r/393547 (owner: 10ArielGlenn) [08:48:49] (03PS2) 10Faidon Liambotis: Include geoip in kafkatee::webrequest::ops [puppet] - 10https://gerrit.wikimedia.org/r/393536 [08:49:09] (03CR) 10jerkins-bot: [V: 04-1] Include geoip in kafkatee::webrequest::ops [puppet] - 10https://gerrit.wikimedia.org/r/393536 (owner: 10Faidon Liambotis) [08:50:02] grumble [08:50:04] (03PS3) 10Faidon Liambotis: Include geoip in kafkatee::webrequest::ops [puppet] - 10https://gerrit.wikimedia.org/r/393536 [08:50:24] (03CR) 10jerkins-bot: [V: 04-1] Include geoip in kafkatee::webrequest::ops [puppet] - 10https://gerrit.wikimedia.org/r/393536 (owner: 10Faidon Liambotis) [08:51:41] (03PS4) 10Faidon Liambotis: Include geoip in kafkatee::webrequest::ops [puppet] - 10https://gerrit.wikimedia.org/r/393536 [08:52:08] yay [08:52:10] (03CR) 10Faidon Liambotis: [C: 032] Include geoip in kafkatee::webrequest::ops [puppet] - 10https://gerrit.wikimedia.org/r/393536 (owner: 10Faidon Liambotis) [08:52:43] _joe_: btw, none of our CI, incl. the wmf styleguide, catches our whitespace recommendations [08:52:47] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Fully pool db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393545 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [08:52:50] the netbox stuff were merged with 2-spaces, I see now [08:53:41] <_joe_> uh, in fact puppet-lint checks if the indentation is consistent, I think [08:53:58] consistent, but it can be consistent with 3 spaces [08:54:05] (03Merged) 10jenkins-bot: db-eqiad.php: Fully pool db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393545 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [08:54:09] <_joe_> paravoid: I think such a control can either be done via the puppet-lint config, or via a dedicated plugin [08:54:18] (03PS2) 10ArielGlenn: move one more setting out of snapshot hiera and into profiles [puppet] - 10https://gerrit.wikimedia.org/r/393546 [08:54:19] (03PS2) 10ArielGlenn: move last hiera calls out of snapshot modules into profile [puppet] - 10https://gerrit.wikimedia.org/r/393547 [08:54:23] <_joe_> I'd have to check, for now I'm busy fixing all of our unit tests for puppet 4 [08:54:53] (03CR) 10jerkins-bot: [V: 04-1] move last hiera calls out of snapshot modules into profile [puppet] - 10https://gerrit.wikimedia.org/r/393547 (owner: 10ArielGlenn) [08:55:17] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Fully pool db1097:3314 - T178359 (duration: 00m 43s) [08:55:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:25] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [08:56:08] 10Operations, 10CheckUser, 10Traffic: Log source port for anonymous users and expose it for sysops/checkusers - https://phabricator.wikimedia.org/T181368#3787475 (10Legoktm) >>! In T181368#3787869, @ema wrote: >>>! In T181368#3787516, @Krenair wrote: >> I would expect $_SERVER['REMOTE_PORT'] to be... [08:56:51] (03CR) 10jenkins-bot: db-eqiad.php: Fully pool db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393545 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [08:58:59] (03PS3) 10ArielGlenn: move last hiera calls out of snapshot modules into profile [puppet] - 10https://gerrit.wikimedia.org/r/393547 [09:05:08] (03PS1) 10Filippo Giunchedi: cassandra: reprovision restbase1007 with cassandra 3 [puppet] - 10https://gerrit.wikimedia.org/r/393550 (https://phabricator.wikimedia.org/T179422) [09:08:22] 10Operations, 10ops-eqiad, 10DBA: db1016 m1 master: Possibly faulty BBU - https://phabricator.wikimedia.org/T166344#3787886 (10Marostegui) db1056 will be freed up during this week - we can use it to replace this host. [09:08:23] (03CR) 10Mobrovac: [C: 031] cassandra: reprovision restbase1007 with cassandra 3 [puppet] - 10https://gerrit.wikimedia.org/r/393550 (https://phabricator.wikimedia.org/T179422) (owner: 10Filippo Giunchedi) [09:08:26] (03PS2) 10Filippo Giunchedi: cassandra: reprovision restbase1007 with cassandra 3 [puppet] - 10https://gerrit.wikimedia.org/r/393550 (https://phabricator.wikimedia.org/T179422) [09:08:58] (03CR) 10Filippo Giunchedi: [C: 032] cassandra: reprovision restbase1007 with cassandra 3 [puppet] - 10https://gerrit.wikimedia.org/r/393550 (https://phabricator.wikimedia.org/T179422) (owner: 10Filippo Giunchedi) [09:14:14] !log reimage restbase1007 - T179422 [09:14:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:23] T179422: Reshape RESTBase Cassandra clusters - https://phabricator.wikimedia.org/T179422 [09:18:13] (03PS4) 10Elukey: role::analytics_cluster::database::meta::backup*: move to profiles [puppet] - 10https://gerrit.wikimedia.org/r/393257 (https://phabricator.wikimedia.org/T167790) [09:19:28] (03PS3) 10DCausse: Upgrade logstash plugins to 5.5.2 [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/392621 (https://phabricator.wikimedia.org/T178412) [09:29:07] 10Operations, 10Performance-Team, 10Traffic: load.php response taking 160s (of which only 0.031s in Apache) - https://phabricator.wikimedia.org/T181315#3786217 (10Gilles) @Samat can you email the HAR file to performance-team@wikimedia.org ? [09:32:39] 10Operations, 10Services, 10Graphite: cpjobqueue spamming statsd metrics - https://phabricator.wikimedia.org/T181333#3787954 (10fgiunchedi) p:05Unbreak!>03High This has been fixed temporarily by reverting to a previous version of service-runner, downgrading severity [09:34:45] (03CR) 10Elukey: [C: 032] role::analytics_cluster::database::meta::backup*: move to profiles [puppet] - 10https://gerrit.wikimedia.org/r/393257 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [09:39:08] PROBLEM - nova-compute process on labvirt1010 is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute [09:41:25] RECOVERY - nova-compute process on labvirt1010 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute [09:41:30] !log ppchelko@tin Started deploy [cpjobqueue/deploy@b570d4e]: Make http agent use keep-alive [09:41:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:18] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@b570d4e]: Make http agent use keep-alive (duration: 00m 48s) [09:42:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:49] 10Operations, 10Services, 10Graphite, 10Wikimedia-Incident: cpjobqueue spamming statsd metrics - https://phabricator.wikimedia.org/T181333#3787994 (10fgiunchedi) [09:47:32] 10Operations, 10Services, 10Graphite, 10Wikimedia-Incident: Alert on graphite UDP loss - https://phabricator.wikimedia.org/T181382#3788004 (10fgiunchedi) [09:47:47] (03CR) 10DCausse: [C: 031] Add stemming languages settings for description indexing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392894 (https://phabricator.wikimedia.org/T176903) (owner: 10Smalyshev) [09:57:59] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393559 (https://phabricator.wikimedia.org/T128546) [09:58:42] (03CR) 10Gehel: [C: 031] "LGTM. We have a ton of free space on /var/lib, so no problem." [puppet] - 10https://gerrit.wikimedia.org/r/392603 (https://phabricator.wikimedia.org/T180051) (owner: 10DCausse) [09:59:08] (03CR) 10Gehel: [C: 031] "Note: template needs to be manually reloaded after this change is merged." [puppet] - 10https://gerrit.wikimedia.org/r/392603 (https://phabricator.wikimedia.org/T180051) (owner: 10DCausse) [10:01:26] 10Operations, 10Cloud-Services, 10Structured-Data-Commons, 10Wikidata: Explore hosting the multimedia commons use case - https://phabricator.wikimedia.org/T152632#3788037 (10SandraF_WMF) [10:01:53] 10Operations: Integrate jessie 8.9 point release - https://phabricator.wikimedia.org/T171452#3788043 (10MoritzMuehlenhoff) 05Open>03Resolved These are fully rolled out: pam debian-archive-keyring [10:04:02] (03CR) 10Gehel: [C: 031] "LGTM" [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/392621 (https://phabricator.wikimedia.org/T178412) (owner: 10DCausse) [10:05:40] (03CR) 10Gehel: [C: 032] Upgrade logstash plugins to 5.5.2 [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/392621 (https://phabricator.wikimedia.org/T178412) (owner: 10DCausse) [10:05:40] (03CR) 10Gehel: [V: 032 C: 032] Upgrade logstash plugins to 5.5.2 [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/392621 (https://phabricator.wikimedia.org/T178412) (owner: 10DCausse) [10:08:35] !log ppchelko@tin Started deploy [cpjobqueue/deploy@e35aa05]: Revert using keep-alive [10:08:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:58] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@e35aa05]: Revert using keep-alive (duration: 00m 22s) [10:09:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:36] (03CR) 10Gehel: [WIP] [logstash] Add a way to move some data to debug_blob (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/392591 (https://phabricator.wikimedia.org/T180051) (owner: 10DCausse) [10:11:18] (03CR) 10Gehel: [WIP] [logstash] Add a way to move some data to debug_blob (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/392591 (https://phabricator.wikimedia.org/T180051) (owner: 10DCausse) [10:12:45] (03CR) 10Gehel: [C: 031] [logstash] add debug_blob field [puppet] - 10https://gerrit.wikimedia.org/r/392590 (https://phabricator.wikimedia.org/T180051) (owner: 10DCausse) [10:12:54] 10Operations, 10Release Pipeline, 10Release-Engineering-Team (Watching / External): Update Debian package for Blubber - https://phabricator.wikimedia.org/T179984#3788071 (10akosiaris) The `DH_GOLANG_EXCLUDES` seems to have worked. I was successfully able to build the package on stretch as well. [10:21:12] 10Operations, 10Performance-Team, 10Traffic: load.php response taking 160s (of which only 0.031s in Apache) - https://phabricator.wikimedia.org/T181315#3788079 (10Samat) @Gilles I emailed the file to this address. Thank you in advance! [10:21:14] (03PS4) 10Jcrespo: mariadb: Leave reimaginable only the db latest servers [puppet] - 10https://gerrit.wikimedia.org/r/392400 (https://phabricator.wikimedia.org/T170662) [10:21:30] (03PS1) 10Filippo Giunchedi: role: alert on statsd inbound udp errors [puppet] - 10https://gerrit.wikimedia.org/r/393562 (https://phabricator.wikimedia.org/T181382) [10:21:50] (03CR) 10jerkins-bot: [V: 04-1] role: alert on statsd inbound udp errors [puppet] - 10https://gerrit.wikimedia.org/r/393562 (https://phabricator.wikimedia.org/T181382) (owner: 10Filippo Giunchedi) [10:22:26] (03CR) 10Jcrespo: [C: 032] mariadb: Leave reimaginable only the db latest servers [puppet] - 10https://gerrit.wikimedia.org/r/392400 (https://phabricator.wikimedia.org/T170662) (owner: 10Jcrespo) [10:24:01] 10Operations, 10Performance-Team, 10Traffic: load.php response taking 160s (of which only 0.031s in Apache) - https://phabricator.wikimedia.org/T181315#3788102 (10Gilles) @Samat sorry I gave you the wrong address, the correct one is: performance-team@lists.wikimedia.org [10:24:18] 10Operations, 10Performance-Team, 10Traffic: load.php response taking 160s (of which only 0.031s in Apache) - https://phabricator.wikimedia.org/T181315#3788103 (10Samat) Hmm: I received from Google the following message: "Message not delivered: 550 Address performance-team@wikimedia.org does not exist" [10:24:42] 10Operations, 10Performance-Team, 10Traffic: load.php response taking 160s (of which only 0.031s in Apache) - https://phabricator.wikimedia.org/T181315#3788106 (10Gilles) See my reply above [10:24:58] (03PS1) 10Elukey: role::analytics_cluster::client: move to profiles [puppet] - 10https://gerrit.wikimedia.org/r/393563 (https://phabricator.wikimedia.org/T167790) [10:28:26] (03CR) 10Volans: role: alert on statsd inbound udp errors (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/393562 (https://phabricator.wikimedia.org/T181382) (owner: 10Filippo Giunchedi) [10:33:31] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 22 probes of 282 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [10:38:31] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 14 probes of 282 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [10:44:50] (03PS1) 10Jcrespo: mariadb: Move db1082 and db1087 to ROW for labsdb filtering [puppet] - 10https://gerrit.wikimedia.org/r/393569 (https://phabricator.wikimedia.org/T177208) [10:45:25] (03CR) 10Marostegui: [C: 031] mariadb: Move db1082 and db1087 to ROW for labsdb filtering [puppet] - 10https://gerrit.wikimedia.org/r/393569 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo) [10:45:56] (03PS2) 10Elukey: role::analytics_cluster::client: move to profiles [puppet] - 10https://gerrit.wikimedia.org/r/393563 (https://phabricator.wikimedia.org/T167790) [10:46:07] (03CR) 10Jcrespo: "Maybe we can set up all slow hosts later with STATEMENT and prepare them to be masters?" [puppet] - 10https://gerrit.wikimedia.org/r/393569 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo) [10:46:51] (03CR) 10Marostegui: [C: 031] "> Maybe we can set up all slow hosts later with STATEMENT and prepare" [puppet] - 10https://gerrit.wikimedia.org/r/393569 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo) [10:49:08] (03PS1) 10Alexandros Kosiaris: Ignore docker containers in disk space checks [puppet] - 10https://gerrit.wikimedia.org/r/393570 (https://phabricator.wikimedia.org/T178454) [10:49:52] (03PS2) 10Jcrespo: mariadb: Move db1082 and db1087 to ROW for labsdb filtering [puppet] - 10https://gerrit.wikimedia.org/r/393569 (https://phabricator.wikimedia.org/T177208) [10:51:18] (03PS2) 10Alexandros Kosiaris: Ignore docker containers in disk space checks [puppet] - 10https://gerrit.wikimedia.org/r/393570 (https://phabricator.wikimedia.org/T178454) [10:52:10] (03CR) 10Jcrespo: [C: 032] mariadb: Move db1082 and db1087 to ROW for labsdb filtering [puppet] - 10https://gerrit.wikimedia.org/r/393569 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo) [10:52:29] (03CR) 10Jcrespo: [C: 032] "db1082 needs depool before restart" [puppet] - 10https://gerrit.wikimedia.org/r/393569 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo) [10:53:02] !log installing postgresql-common security updates [10:53:04] (03CR) 10Alexandros Kosiaris: [C: 032] Ignore docker containers in disk space checks [puppet] - 10https://gerrit.wikimedia.org/r/393570 (https://phabricator.wikimedia.org/T178454) (owner: 10Alexandros Kosiaris) [10:53:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:28] (03PS3) 10Alexandros Kosiaris: Ignore docker containers in disk space checks [puppet] - 10https://gerrit.wikimedia.org/r/393570 (https://phabricator.wikimedia.org/T178454) [10:54:51] (03CR) 10Elukey: "Still having an issue with stat1005, https://puppet-compiler.wmflabs.org/compiler02/8976/stat1005.eqiad.wmnet/change.stat1005.eqiad.wmnet." [puppet] - 10https://gerrit.wikimedia.org/r/393563 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [10:57:38] (03PS1) 10Alexandros Kosiaris: Ignore all of /var/lib/docker [puppet] - 10https://gerrit.wikimedia.org/r/393572 (https://phabricator.wikimedia.org/T178454) [10:57:52] (03PS3) 10Elukey: role::analytics_cluster::client: move to profiles [puppet] - 10https://gerrit.wikimedia.org/r/393563 (https://phabricator.wikimedia.org/T167790) [10:58:57] (03CR) 10Alexandros Kosiaris: [C: 032] Ignore all of /var/lib/docker [puppet] - 10https://gerrit.wikimedia.org/r/393572 (https://phabricator.wikimedia.org/T178454) (owner: 10Alexandros Kosiaris) [11:00:05] jan_drewniak: Your horoscope predicts another unfortunate Wikimedia Portals Update deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171127T1100). [11:00:05] No GERRIT patches in the queue for this window AFAICS. [11:00:54] (03CR) 10Jdrewniak: [C: 032] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393559 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:01:26] 10Operations, 10Release Pipeline, 10monitoring, 10Continuous-Integration-Infrastructure (shipyard), and 2 others: Icinga disk space alert when a Docker container is running on an host - https://phabricator.wikimedia.org/T178454#3788187 (10akosiaris) 05Open>03Resolved a:03akosiaris I think I 've solve... [11:02:14] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393559 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:02:29] (03CR) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393559 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:03:00] (03CR) 10Elukey: "pcc https://puppet-compiler.wmflabs.org/compiler02/8978/dbstore1002.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/393238 (https://phabricator.wikimedia.org/T156844) (owner: 10Elukey) [11:05:07] !log jdrewniak@tin Synchronized portals/prod/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:393559|Bumping portals to master (T128546)]] (duration: 00m 46s) [11:05:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:13] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [11:05:53] !log jdrewniak@tin Synchronized portals: Wikimedia Portals Update: [[gerrit:393559|Bumping portals to master (T128546)]] (duration: 00m 45s) [11:05:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:42] 10Operations, 10Graphite, 10Patch-For-Review, 10Services (watching), 10Wikimedia-Incident: Alert on graphite UDP loss - https://phabricator.wikimedia.org/T181382#3788193 (10mobrovac) [11:17:06] !log bootstrap restbase1007-a - T179422 [11:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:14] T179422: Reshape RESTBase Cassandra clusters - https://phabricator.wikimedia.org/T179422 [11:17:52] 10Operations, 10Release Pipeline, 10monitoring, 10Continuous-Integration-Infrastructure (shipyard), and 2 others: Icinga disk space alert when a Docker container is running on an host - https://phabricator.wikimedia.org/T178454#3788200 (10hashar) /var/lib/docker sounds good enough for now and I noticed you... [11:20:37] (03PS1) 10Alexandros Kosiaris: check_eth: Ignore veth interfaces [puppet] - 10https://gerrit.wikimedia.org/r/393575 [11:22:20] (03CR) 10Alexandros Kosiaris: [C: 032] check_eth: Ignore veth interfaces [puppet] - 10https://gerrit.wikimedia.org/r/393575 (owner: 10Alexandros Kosiaris) [11:24:19] (03PS1) 10Addshore: Add FileExporter & FileImporter to extension-list-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393576 (https://phabricator.wikimedia.org/T181383) [11:24:28] 10Operations, 10Ops-Access-Requests, 10Performance-Team (Radar): Varnish and Apache root for hoo - https://phabricator.wikimedia.org/T179317#3788233 (10hoo) >>! In T179317#3775596, @ArielGlenn wrote: > @hoo: > Which hosts are we looking at then, varnish servers and the app servers? Yes. > Also, what do you... [11:24:59] 10Operations, 10monitoring, 10Continuous-Integration-Infrastructure (shipyard), 10Release-Engineering-Team (Kanban): Icinga "configured eth" warning when a Docker container is running - https://phabricator.wikimedia.org/T181384#3788234 (10hashar) [11:27:24] !log contint1001 enable Icinga "Disks space" notification again. It is no more complaing about Docker partitions | ping mutante | T178454 [11:27:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:32] T178454: Icinga disk space alert when a Docker container is running on an host - https://phabricator.wikimedia.org/T178454 [11:28:24] akosiaris: thank you for the icinga disk space issue with Docker containers :) It works all fine now [11:28:36] hashar: :-) [11:29:09] hashar: btw, I 've went and gargage collected old containers on contint1001 [11:29:33] I think we might want docker run --rm somewhere in the pipeline [11:29:44] instead of "docker run" [11:29:45] akosiaris: and of course there is a configured eth check that fails now. Docker creates a veth6fa987e@if33 when our check_eth is being passed the interface name without @if33 :( [11:30:07] hashar: fixed as well. Look at https://gerrit.wikimedia.org/r/393575 [11:30:10] we do pass --rm ! But icinga happens to run the check while the container is running, it will alarm [11:30:11] ah [11:30:37] hashar: no I mean that finished containers should be delete [11:30:40] deleted* [11:30:44] come on [11:30:45] since I found a few that weren't [11:30:49] maybe they were old ? [11:30:50] 10Operations, 10monitoring, 10Continuous-Integration-Infrastructure (shipyard), 10Release-Engineering-Team (Kanban): Icinga "configured eth" warning when a Docker container is running - https://phabricator.wikimedia.org/T181384#3788258 (10hashar) a:03akosiaris Fixed by @akosiaris with https://gerrit.wiki... [11:31:02] you fix the bug BEFORE I FINISH FILLING THEM!!!! :] [11:31:07] :-D [11:31:48] for left over containers [11:31:53] yeah sometime some get stuck :( [11:32:10] I gotta rewrite the CI jobs to keep track of the docker container id and make sure "docker stop" is always run [11:32:26] + add some garbage collecting cron to reclaim potential left over containers [11:32:49] shower thoughts: we gotta use k8s which probably would handle all of that for us magically [11:34:22] 10Operations, 10monitoring, 10Continuous-Integration-Infrastructure (shipyard), 10Release-Engineering-Team (Kanban): Icinga "configured eth" warning when a Docker container is running - https://phabricator.wikimedia.org/T181384#3788261 (10hashar) 05Open>03Resolved I ran puppet on contint1001, got a con... [11:34:39] akosiaris: contint1001 icinga checks are all green \o/ [11:35:24] nice [11:35:35] stuck containers ? sigh [11:35:47] yeah those you be "docker stop" at some point in time [11:36:14] but non running containers should not be listed in docker ps -a [11:36:23] that is docker ps and docker ps -a should be identical [11:36:33] and --rm is the way that is solved [11:36:39] and it worked fine for me up to now [11:36:48] so maybe we just had some old pre --rm stuff [11:36:52] and I just noticed [11:37:30] (03PS1) 10Addshore: Enable FileImporter & FileExporter on BETA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393577 (https://phabricator.wikimedia.org/T181383) [11:41:31] * addshore is going for 2 beta only patches [11:41:37] (03CR) 10Addshore: [C: 032] Add FileExporter & FileImporter to extension-list-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393576 (https://phabricator.wikimedia.org/T181383) (owner: 10Addshore) [11:41:40] (03CR) 10Addshore: [C: 032] Enable FileImporter & FileExporter on BETA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393577 (https://phabricator.wikimedia.org/T181383) (owner: 10Addshore) [11:43:01] (03Merged) 10jenkins-bot: Add FileExporter & FileImporter to extension-list-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393576 (https://phabricator.wikimedia.org/T181383) (owner: 10Addshore) [11:43:07] (03Merged) 10jenkins-bot: Enable FileImporter & FileExporter on BETA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393577 (https://phabricator.wikimedia.org/T181383) (owner: 10Addshore) [11:43:16] (03CR) 10jenkins-bot: Add FileExporter & FileImporter to extension-list-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393576 (https://phabricator.wikimedia.org/T181383) (owner: 10Addshore) [11:43:18] (03CR) 10jenkins-bot: Enable FileImporter & FileExporter on BETA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393577 (https://phabricator.wikimedia.org/T181383) (owner: 10Addshore) [11:43:47] (03CR) 10Filippo Giunchedi: role: alert on statsd inbound udp errors (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/393562 (https://phabricator.wikimedia.org/T181382) (owner: 10Filippo Giunchedi) [11:43:49] (03PS2) 10Filippo Giunchedi: role: alert on statsd inbound udp errors [puppet] - 10https://gerrit.wikimedia.org/r/393562 (https://phabricator.wikimedia.org/T181382) [11:44:36] !log addshore@tin Synchronized wmf-config/extension-list-labs: [[gerrit:393576]] BETA ONLY Add FileExporter & FileImporter to extension-list-labs (duration: 00m 45s) [11:44:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:34] !log addshore@tin Synchronized wmf-config/InitialiseSettings-labs.php: [[gerrit:393577]] LABS ONLY Enable FileImporter & FileExporter on BETA PT1/2 (T181383) (duration: 00m 45s) [11:46:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:41] T181383: Deploy FileImporter & FileExporter to beta site - https://phabricator.wikimedia.org/T181383 [11:47:34] !log addshore@tin Synchronized wmf-config/CommonSettings-labs.php: [[gerrit:393577]] LABS ONLY Enable FileImporter & FileExporter on BETA PT2/2 (T181383) (duration: 00m 45s) [11:47:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:28] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler02/8980/graphite1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/393562 (https://phabricator.wikimedia.org/T181382) (owner: 10Filippo Giunchedi) [12:06:42] (03PS1) 10Muehlenhoff: Allow to selectively run time servers on Chrony (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/393581 [12:07:02] (03CR) 10jerkins-bot: [V: 04-1] Allow to selectively run time servers on Chrony (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/393581 (owner: 10Muehlenhoff) [12:08:13] !log ppchelko@tin Started deploy [cpjobqueue/deploy@47d27dc]: Enable keep-alive T181007 [12:08:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:20] T181007: Investigate backlog in RecordLintJob - https://phabricator.wikimedia.org/T181007 [12:08:21] (03CR) 10Alexandros Kosiaris: [C: 04-1] user homes: Allow git to control +x for $HOME files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/377056 (owner: 10BryanDavis) [12:09:02] (03PS2) 10Muehlenhoff: Allow to selectively run time servers on Chrony (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/393581 [12:09:17] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@47d27dc]: Enable keep-alive T181007 (duration: 01m 03s) [12:09:23] (03CR) 10jerkins-bot: [V: 04-1] Allow to selectively run time servers on Chrony (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/393581 (owner: 10Muehlenhoff) [12:09:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:50] (03PS1) 10Addshore: Actually call wfLoadExtension for FileExporter & Importer on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393582 (https://phabricator.wikimedia.org/T181383) [12:13:25] 10Operations, 10Performance-Team, 10Traffic: load.php response taking 160s (of which only 0.031s in Apache) - https://phabricator.wikimedia.org/T181315#3788370 (10Gilles) It is indeed startling. Does this happen to you on all browsers? You mention that you can trigger this easily, it would be interesting if... [12:14:50] !log Stop replication on db1109 to test table rename for s5/s8 failover [12:14:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:28] (03CR) 10Addshore: [C: 032] Actually call wfLoadExtension for FileExporter & Importer on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393582 (https://phabricator.wikimedia.org/T181383) (owner: 10Addshore) [12:19:39] (03Merged) 10jenkins-bot: Actually call wfLoadExtension for FileExporter & Importer on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393582 (https://phabricator.wikimedia.org/T181383) (owner: 10Addshore) [12:19:50] (03CR) 10jenkins-bot: Actually call wfLoadExtension for FileExporter & Importer on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393582 (https://phabricator.wikimedia.org/T181383) (owner: 10Addshore) [12:20:25] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/393562 (https://phabricator.wikimedia.org/T181382) (owner: 10Filippo Giunchedi) [12:20:27] (03PS1) 10Amire80: Define wmgBabelMainCategory for officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393583 [12:21:30] !log addshore@tin Synchronized wmf-config/CommonSettings-labs.php: [[gerrit:393582]] LABS ONLY Actually call wfLoadExtension for FileExporter & Importer on beta BETA (T181383) (duration: 00m 44s) [12:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:38] T181383: Deploy FileImporter & FileExporter to beta sites - https://phabricator.wikimedia.org/T181383 [12:29:21] PROBLEM - Host mw1276 is DOWN: PING CRITICAL - Packet loss = 100% [12:31:47] looking [12:35:07] !log powercycling mw1276 [12:35:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:40] (03PS1) 10Addshore: BETA wmgMonologChannels 'FileImporter' => 'debug' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393587 (https://phabricator.wikimedia.org/T181383) [12:40:48] (03PS1) 10Marostegui: db-eqiad.php: Warm up s8 hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393588 (https://phabricator.wikimedia.org/T177208) [12:41:04] (03CR) 10Addshore: [C: 032] BETA wmgMonologChannels 'FileImporter' => 'debug' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393587 (https://phabricator.wikimedia.org/T181383) (owner: 10Addshore) [12:41:25] marostegui: ^^ im doing a beta only patch (just incase your about to do yours) :) [12:41:41] addshore: No worries, not planning to deploy yet. Thanks :-) [12:41:47] :) [12:42:30] (03Merged) 10jenkins-bot: BETA wmgMonologChannels 'FileImporter' => 'debug' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393587 (https://phabricator.wikimedia.org/T181383) (owner: 10Addshore) [12:42:43] (03CR) 10jenkins-bot: BETA wmgMonologChannels 'FileImporter' => 'debug' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393587 (https://phabricator.wikimedia.org/T181383) (owner: 10Addshore) [12:43:00] !log jmm@puppetmaster1001 conftool action : set/pooled=inactive; selector: mw1276.eqiad.wmnet [12:43:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:24] !log addshore@tin Synchronized wmf-config/InitialiseSettings-labs.php: [[gerrit:393587]] BETA wmgMonologChannels FileImporter => debug (duration: 01m 01s) [12:44:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:33] Got 12:43:54 ['/usr/bin/scap', 'pull', '--no-update-l10n', '--include', 'wmf-config', '--include', 'wmf-config/InitialiseSettings-labs.php', 'tin.eqiad.wmnet', 'naos.codfw.wmnet', 'tin.eqiad.wmnet'] on mw1276.eqiad.wmnet returned [255]: ssh: connect to host mw1276.eqiad.wmnet port 22: No route to host [12:44:38] !log started parsoid linter script to generate load on cpjobqueue (python3 parsoid_reparse.py http://parsoid.discovery.wmnet:8000 --sitematrix --linter-only --skip-closed https://commons.wikimedia.org/w/api.php) [12:44:41] but i guess that is due to your powercycle moritzm :) [12:44:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:56] labs only change so no worries :) [12:45:52] addshore: yep, I marked it as inactive (i.e. excluded from dsh group) a few mins ago), has hardware trouble, once resolved "scap pull" will be run before re-pooling [12:46:19] awesome! :) [12:49:54] 10Operations, 10ops-eqiad: Lost network connectivity on mw1276 - https://phabricator.wikimedia.org/T181397#3788499 (10MoritzMuehlenhoff) [12:59:48] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Move some masters away from B6 - https://phabricator.wikimedia.org/T169501#3788542 (10Marostegui) [12:59:51] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Move some masters away from B6 - https://phabricator.wikimedia.org/T169501#3400000 (10Marostegui) db2023 and db2016 aren't masters anymore, so they can be decommissioned. We will create a decommission task when ready [13:01:40] !log stop cpjobqueue in eqiad for backlog accumulation [13:01:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:57] !log updating nginx on francium [13:18:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:49] (03PS3) 10Elukey: Drop the Eventlogging support for dbstore1002 [puppet] - 10https://gerrit.wikimedia.org/r/393238 (https://phabricator.wikimedia.org/T156844) [13:22:33] !log remove eventlogging replication support (log database) from dbstore1002 - T156844 [13:22:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:44] T156844: Decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844 [13:22:54] (03CR) 10Elukey: [C: 032] Drop the Eventlogging support for dbstore1002 [puppet] - 10https://gerrit.wikimedia.org/r/393238 (https://phabricator.wikimedia.org/T156844) (owner: 10Elukey) [13:31:23] !log installing nspr security updates on trusty [13:31:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:40] (03PS1) 10Muehlenhoff: Add library hint for nspr [puppet] - 10https://gerrit.wikimedia.org/r/393592 [13:51:31] (03PS2) 10Muehlenhoff: Add library hint for nspr [puppet] - 10https://gerrit.wikimedia.org/r/393592 [13:53:38] (03PS6) 10Jayprakash12345: IP cap lift for Semaine contributive 2017-2018 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393442 (https://phabricator.wikimedia.org/T181360) [13:56:27] (03CR) 10Muehlenhoff: [C: 032] Add library hint for nspr [puppet] - 10https://gerrit.wikimedia.org/r/393592 (owner: 10Muehlenhoff) [13:58:40] (03CR) 10Herron: [C: 031] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/393358 (owner: 10Andrew Bogott) [14:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: That opportune time is upon us again. Time for a European Mid-day SWAT(Max 8 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171127T1400). [14:00:04] Jhs, Zoranzoki21, and Addshore: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:16] I can SWAT today [14:00:45] Jhs, Zoranzoki21, Addshore: do you want to deploy your changes yourself? [14:01:30] (03CR) 10Jayprakash12345: "Thanks TerraCodes and Zoranzoki21" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393442 (https://phabricator.wikimedia.org/T181360) (owner: 10Jayprakash12345) [14:02:39] Jhs, Zoranzoki21, Addshore: around for SWAT? [14:02:54] o/ [14:02:59] just eating a sadwhich [14:03:02] i can do mine [14:03:09] addshore: want to start? [14:03:13] or want to go later? [14:03:13] sandwich.... [14:03:17] can go now! [14:03:30] addshore: go ahead, I will review the rest of the commits then [14:03:35] let me know when you are done [14:03:40] ack! [14:03:47] Jayprakash12345: around for EU SWAT? [14:03:52] zeljkof, hiya. mine was apparently done outside of swat it seems, so you can strike it or whatever [14:04:02] Jhs: please do :) [14:04:14] PROBLEM - cassandra-a CQL 10.64.0.230:9042 on restbase1007 is CRITICAL: connect to address 10.64.0.230 and port 9042: Connection refused [14:04:25] PROBLEM - cassandra-b CQL 10.64.0.231:9042 on restbase1007 is CRITICAL: connect to address 10.64.0.231 and port 9042: Connection refused [14:04:25] PROBLEM - Check size of conntrack table on restbase1007 is CRITICAL: Return code of 255 is out of bounds [14:04:25] PROBLEM - dhclient process on restbase1007 is CRITICAL: Return code of 255 is out of bounds [14:04:34] PROBLEM - cassandra-c CQL 10.64.0.232:9042 on restbase1007 is CRITICAL: connect to address 10.64.0.232 and port 9042: Connection refused [14:04:34] PROBLEM - Disk space on restbase1007 is CRITICAL: Return code of 255 is out of bounds [14:04:35] PROBLEM - cassandra-c service on restbase1007 is CRITICAL: Return code of 255 is out of bounds [14:04:35] PROBLEM - cassandra-b SSL 10.64.0.231:7001 on restbase1007 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [14:04:39] (03PS2) 10Herron: puppet: point codfw elasticsearch servers at codfw puppet 4 masters [puppet] - 10https://gerrit.wikimedia.org/r/392691 (https://phabricator.wikimedia.org/T177254) [14:04:44] PROBLEM - cassandra-c SSL 10.64.0.232:7001 on restbase1007 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [14:04:45] PROBLEM - Check systemd state on restbase1007 is CRITICAL: Return code of 255 is out of bounds [14:04:45] PROBLEM - cassandra-a service on restbase1007 is CRITICAL: Return code of 255 is out of bounds [14:04:45] PROBLEM - MD RAID on restbase1007 is CRITICAL: Return code of 255 is out of bounds [14:04:45] PROBLEM - configured eth on restbase1007 is CRITICAL: Return code of 255 is out of bounds [14:04:54] (03CR) 10jerkins-bot: [V: 04-1] puppet: point codfw elasticsearch servers at codfw puppet 4 masters [puppet] - 10https://gerrit.wikimedia.org/r/392691 (https://phabricator.wikimedia.org/T177254) (owner: 10Herron) [14:04:54] PROBLEM - DPKG on restbase1007 is CRITICAL: Return code of 255 is out of bounds [14:04:55] PROBLEM - cassandra-b service on restbase1007 is CRITICAL: Return code of 255 is out of bounds [14:05:04] PROBLEM - Check whether ferm is active by checking the default input chain on restbase1007 is CRITICAL: Return code of 255 is out of bounds [14:05:09] Jhs: please check if the logo is deployed, if all is ok, remove the commit from the calendar, if there is something left to do, let me know [14:06:48] restbase1007 down, I don't see any maintenance [14:06:58] Zoranzoki21, Jayprakash12345: your commits will not be deployed if you are not around during SWAT [14:07:11] Please give [config] 393442 IP cap lift for Semaine contributive 2017-2018 (T181360) priortiy first [14:07:12] T181360: IP cap lift request for 2017-11-29 - https://phabricator.wikimedia.org/T181360 [14:07:46] I am here. But My net is too slow [14:08:21] : Please give [config] 393442 IP cap lift for Semaine contributive 2017-2018 (T181360) priortiy first [14:08:24] PROBLEM - IPMI Sensor Status on restbase1007 is CRITICAL: Return code of 255 is out of bounds [14:08:24] PROBLEM - puppet last run on restbase1007 is CRITICAL: Return code of 255 is out of bounds [14:08:30] jynus: I think that Filippo was reimaging it this morning [14:08:47] jayprakash1234: why is 393442 a priority? The event is in two days [14:09:08] and cassandra-a was bootstrapping [14:09:55] syncing [14:10:26] IP cap lift is more important than Deploying logo or extension [14:10:34] !log addshore@tin Synchronized php-1.31.0-wmf.8/extensions/AdvancedSearch: SWAT AdvancedSearch T181175 T181222, adjust links in beta section & fix usability of mobile search (duration: 00m 46s) [14:10:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:43] T181175: AdvancedSearch breaks usability of mobile search by pushing search results down page - https://phabricator.wikimedia.org/T181175 [14:10:43] T181222: AdvancedSearch: adjust links in beta section - https://phabricator.wikimedia.org/T181222 [14:10:50] zeljkof: thats me all done [14:10:57] addshore: ok, taking over [14:11:16] Jayprakash12345: I disagree, especially if the event is in a couple of days [14:12:12] There is nothing in 393442 IP cap lift for Semaine contributive 2017-2018 (T181360) that I can in term of mwdebug1002 [14:12:12] T181360: IP cap lift request for 2017-11-29 - https://phabricator.wikimedia.org/T181360 [14:12:30] Jayprakash12345: ok, I will then merge and deploy, it's next [14:13:33] (03PS3) 10Filippo Giunchedi: role: alert on statsd inbound udp errors [puppet] - 10https://gerrit.wikimedia.org/r/393562 (https://phabricator.wikimedia.org/T181382) [14:13:34] PROBLEM - Check the NTP synchronisation status of timesyncd on restbase1007 is CRITICAL: Return code of 255 is out of bounds [14:14:15] Jayprakash12345: um, the task says "plz, lift it on the frwiki and commons" but the commit does it only for frwiki? [14:14:26] (03CR) 10Filippo Giunchedi: [C: 032] role: alert on statsd inbound udp errors [puppet] - 10https://gerrit.wikimedia.org/r/393562 (https://phabricator.wikimedia.org/T181382) (owner: 10Filippo Giunchedi) [14:15:33] zeljkof: Because of wikidata and commonswiki add automacally [14:15:45] Jayprakash12345: ok then, merging, did not know that [14:16:03] ah, I see it in the comments [14:16:03] zeljkof: See https://phabricator.wikimedia.org/T163872 [14:16:36] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393442 (https://phabricator.wikimedia.org/T181360) (owner: 10Jayprakash12345) [14:17:57] (03Merged) 10jenkins-bot: IP cap lift for Semaine contributive 2017-2018 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393442 (https://phabricator.wikimedia.org/T181360) (owner: 10Jayprakash12345) [14:18:06] (03CR) 10jenkins-bot: IP cap lift for Semaine contributive 2017-2018 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393442 (https://phabricator.wikimedia.org/T181360) (owner: 10Jayprakash12345) [14:19:38] !log zfilipin@tin Synchronized wmf-config/throttle.php: SWAT: [[gerrit:393442|IP cap lift for Semaine contributive 2017-2018 (T181360)]] (duration: 00m 45s) [14:19:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:47] T181360: IP cap lift request for 2017-11-29 - https://phabricator.wikimedia.org/T181360 [14:19:52] (03PS2) 10Giuseppe Lavagetto: [WiP] Move puppet CI to puppet 4.8.2 [puppet] - 10https://gerrit.wikimedia.org/r/393259 [14:19:55] Jayprakash12345: deployed [14:20:04] PROBLEM - Long running screen/tmux on restbase1007 is CRITICAL: Return code of 255 is out of bounds [14:20:11] and thanks for releasing with #releng! ;) [14:20:14] zeljkof: Thanks [14:20:38] (03CR) 10jerkins-bot: [V: 04-1] [WiP] Move puppet CI to puppet 4.8.2 [puppet] - 10https://gerrit.wikimedia.org/r/393259 (owner: 10Giuseppe Lavagetto) [14:20:53] 10Operations, 10Services, 10Graphite, 10Wikimedia-Incident: cpjobqueue spamming statsd metrics - https://phabricator.wikimedia.org/T181333#3788736 (10fgiunchedi) Incident report: https://wikitech.wikimedia.org/wiki/Incident_documentation/20171125-statsd [14:21:59] (03CR) 10Zfilipin: "Scheduled for EU SWAT today, but not deployed because Zoranzoki21 was not in #wikimedia-operations." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387077 (https://phabricator.wikimedia.org/T179241) (owner: 10Zoranzoki21) [14:22:41] 10Operations, 10Services, 10Graphite, 10Wikimedia-Incident: cpjobqueue spamming statsd metrics - https://phabricator.wikimedia.org/T181333#3786600 (10Pchelolo) The new version of `service-runner` have been published that only sends GC metric once per second - that should prevent this from happening. [14:22:51] !log EU SWAT finished [14:22:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:48] 10Operations, 10Performance-Team, 10Traffic: load.php response taking 160s (of which only 0.031s in Apache) - https://phabricator.wikimedia.org/T181315#3788748 (10Samat) @Gilles 160 secs was an extreme example, but in the 10-60 second range it happens daily. I have several screenshots, but earlier I didn't k... [14:23:50] (03PS3) 10Herron: puppet: point codfw elasticsearch servers at codfw puppet 4 masters [puppet] - 10https://gerrit.wikimedia.org/r/392691 (https://phabricator.wikimedia.org/T177254) [14:25:46] 10Operations, 10wikidiff2, 10Patch-For-Review, 10User-Addshore, and 2 others: Update and use php-wikidiff2 1.5.1 & MovedParagraphDetectionCutoff in production - https://phabricator.wikimedia.org/T177891#3788752 (10Lea_WMDE) [14:29:23] (03CR) 10Herron: "Ah! Did not realize that! Fixed in patch set 3" [puppet] - 10https://gerrit.wikimedia.org/r/392691 (https://phabricator.wikimedia.org/T177254) (owner: 10Herron) [14:29:58] 10Operations, 10Analytics, 10hardware-requests: Refresh or replace oxygen - https://phabricator.wikimedia.org/T181264#3788764 (10Ottomata) For me, oxygen is not that useful, as we have 90 days of queryable webrequest data in Hadoop. I suppose it is nice to be able to do some quick sed/awk/jq magic on sample... [14:32:57] 10Operations, 10Performance-Team, 10Traffic: load.php response taking 160s (of which only 0.031s in Apache) - https://phabricator.wikimedia.org/T181315#3788779 (10Peter) @Samat Just want to check, FF 57 is mentioned in the description but you got this before on older versions too (just want to make sure it i... [14:37:12] 10Operations, 10Analytics, 10hardware-requests: Refresh or replace oxygen - https://phabricator.wikimedia.org/T181264#3784670 (10fgiunchedi) The syslog servers have 2x spinning disks and 16GB of memory, so comparable to oxygen performance wise. I personally use the dashboard @elukey mentioned to investigate... [14:38:34] 10Operations, 10Performance-Team, 10Traffic: load.php response taking 160s (of which only 0.031s in Apache) - https://phabricator.wikimedia.org/T181315#3786217 (10ema) We've recently started logging requests taking longer than 60 seconds (from varnish's point of view) and sending the logs to logstash. [[http... [14:41:57] (03Abandoned) 10Rush: openstack: remove todo for horizon [puppet] - 10https://gerrit.wikimedia.org/r/392861 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [14:42:54] (03PS1) 10Elukey: Delete role::mariadb::analytics [puppet] - 10https://gerrit.wikimedia.org/r/393597 (https://phabricator.wikimedia.org/T156844) [14:45:07] (03PS2) 10Elukey: Fix prometheus target for the Eventlogging mysql master db [puppet] - 10https://gerrit.wikimedia.org/r/393220 (https://phabricator.wikimedia.org/T177405) [14:47:54] (03CR) 10Gehel: [C: 031] "LGTM (yes, those es* servers are confusing. I can't even remember right now what es stands for in this context...)" [puppet] - 10https://gerrit.wikimedia.org/r/392691 (https://phabricator.wikimedia.org/T177254) (owner: 10Herron) [14:48:26] 10Operations, 10Analytics, 10hardware-requests: Refresh or replace oxygen - https://phabricator.wikimedia.org/T181264#3788817 (10faidon) I use the sampled-1000 logs from time to time (and the 5xx ones, but less frequently), especially in incident-worthy situations, where speed is of the essence. Additionall... [14:49:31] (03PS2) 10Andrew Bogott: puppetmaster.erb: allow switching of puppetmaster_rack_path [puppet] - 10https://gerrit.wikimedia.org/r/393358 [14:49:43] (03PS3) 10Elukey: Fix prometheus target for the Eventlogging mysql master db [puppet] - 10https://gerrit.wikimedia.org/r/393220 (https://phabricator.wikimedia.org/T177405) [14:50:26] (03PS1) 10Rush: openstack: disable notify temporarily [puppet] - 10https://gerrit.wikimedia.org/r/393600 (https://phabricator.wikimedia.org/T171494) [14:50:49] 10Operations, 10ChangeProp, 10Services, 10Graphite: Many kafka topics created by change-prop - https://phabricator.wikimedia.org/T181405#3788822 (10fgiunchedi) [14:54:07] 10Operations, 10ChangeProp, 10Services, 10Graphite: Many kafka topics created by change-prop - https://phabricator.wikimedia.org/T181405#3788840 (10fgiunchedi) [14:54:43] 10Operations, 10Analytics, 10hardware-requests: Refresh or replace oxygen - https://phabricator.wikimedia.org/T181264#3784670 (10BBlack) Yeah I still use oxygen pretty routinely. I often prefer being able to construct a CLI pipeline out of jq/grep/sed/sort/uniq/etc... to using a web UI, and usually sampled-... [14:55:57] 10Operations, 10Analytics, 10hardware-requests: Refresh or replace oxygen - https://phabricator.wikimedia.org/T181264#3788843 (10Ottomata) > I guess I could extract those stats from stat1006 instead? Are the webrequest logs available there? Naw, and we only have unsampled in Hadoop. [14:57:02] 10Operations, 10Analytics, 10hardware-requests: Refresh or replace oxygen - https://phabricator.wikimedia.org/T181264#3788845 (10Ottomata) > Also, the Hadoop data available in the UI doesn't have the same responsiveness in an immediate situation. IIRC it runs up to an hour behind realtime, It actually could... [14:58:18] (03CR) 10Andrew Bogott: [C: 032] puppetmaster.erb: allow switching of puppetmaster_rack_path [puppet] - 10https://gerrit.wikimedia.org/r/393358 (owner: 10Andrew Bogott) [14:59:11] (03PS1) 10Herron: puppet: point codfw prometheus servers at codfw puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/393602 (https://phabricator.wikimedia.org/T177254) [15:01:00] (03PS2) 10Herron: puppet: point codfw prometheus servers at codfw puppet 4 masters [puppet] - 10https://gerrit.wikimedia.org/r/393602 (https://phabricator.wikimedia.org/T177254) [15:01:05] 10Operations, 10Ops-Access-Requests, 10Performance-Team (Radar): Varnish and Apache root for hoo - https://phabricator.wikimedia.org/T179317#3788865 (10ArielGlenn) After chat with @hoo in irc, here's the specific list of needs: - strace and tcpdump would be good to have on the mw canaries, which are used f... [15:02:06] (03CR) 10Rush: [C: 032] openstack: disable notify temporarily [puppet] - 10https://gerrit.wikimedia.org/r/393600 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [15:02:08] (03PS2) 10Rush: openstack: disable notify temporarily [puppet] - 10https://gerrit.wikimedia.org/r/393600 (https://phabricator.wikimedia.org/T171494) [15:05:35] (03CR) 10Filippo Giunchedi: [C: 031] puppet: point codfw prometheus servers at codfw puppet 4 masters [puppet] - 10https://gerrit.wikimedia.org/r/393602 (https://phabricator.wikimedia.org/T177254) (owner: 10Herron) [15:06:43] 10Operations, 10Performance-Team, 10Traffic: load.php response taking 160s (of which only 0.031s in Apache) - https://phabricator.wikimedia.org/T181315#3788870 (10Samat) >>! In T181315#3788779, @Peter wrote: > @Samat Just want to check, FF 57 is mentioned in the description but you got this before on older v... [15:09:22] PROBLEM - cxserver endpoints health on scb1002 is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received [15:09:59] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi, 10cloud-services-team (Kanban): Port non-deprecated Diamond collectors to Prometheus - https://phabricator.wikimedia.org/T177196#3788877 (10fgiunchedi) >>! In T177196#3783000, @Dzahn wrote: > Is there an issue with the memcached exporter since... [15:10:21] RECOVERY - cxserver endpoints health on scb1002 is OK: All endpoints are healthy [15:11:18] !log beginning canary cutover/deployment of codfw elasticsearch servers to codfw puppet 4 puppetmasters [15:11:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:22] (03CR) 10Herron: [C: 032] puppet: point codfw elasticsearch servers at codfw puppet 4 masters [puppet] - 10https://gerrit.wikimedia.org/r/392691 (https://phabricator.wikimedia.org/T177254) (owner: 10Herron) [15:13:32] (03PS4) 10Herron: puppet: point codfw elasticsearch servers at codfw puppet 4 masters [puppet] - 10https://gerrit.wikimedia.org/r/392691 (https://phabricator.wikimedia.org/T177254) [15:19:44] !log disable puppet across cloud things for cleanup [15:19:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:01] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2055 - https://phabricator.wikimedia.org/T181266#3788894 (10Papaul) Dear Mr Papaul Tshibamba, Thank you for contacting Hewlett Packard Enterprise for your service request. This email confirms your request for service and the details are below. Your reque... [15:24:21] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2055 - https://phabricator.wikimedia.org/T181266#3788907 (10Papaul) I am will be receiving the replacement drive tomorrow. [15:30:39] 10Operations, 10Phabricator, 10Traffic, 10Zero: Missing IP addresses for Maroc Telecom - https://phabricator.wikimedia.org/T174342#3558443 (10Mholloway) >>! In T174342#3787625, @Tgr wrote: >we have the IP configuration on Zerowiki that's in theory maintained by Maroc employees The IP ranges on ZeroWiki for... [15:33:19] (03PS1) 10Filippo Giunchedi: redis: use hostname not fqdn in redis_exporter [puppet] - 10https://gerrit.wikimedia.org/r/393605 (https://phabricator.wikimedia.org/T148637) [15:33:21] (03PS1) 10Filippo Giunchedi: profile: add redis_exporter to redis multidc [puppet] - 10https://gerrit.wikimedia.org/r/393606 (https://phabricator.wikimedia.org/T148637) [15:33:45] (03CR) 10jerkins-bot: [V: 04-1] profile: add redis_exporter to redis multidc [puppet] - 10https://gerrit.wikimedia.org/r/393606 (https://phabricator.wikimedia.org/T148637) (owner: 10Filippo Giunchedi) [15:35:37] (03PS2) 10Filippo Giunchedi: profile: add redis_exporter to redis multidc [puppet] - 10https://gerrit.wikimedia.org/r/393606 (https://phabricator.wikimedia.org/T148637) [15:45:17] (03PS1) 10Filippo Giunchedi: role: bump statsd inbound udp error threshold [puppet] - 10https://gerrit.wikimedia.org/r/393607 [15:47:31] (03PS1) 10BBlack: lvs400[1-4] to spare::system [puppet] - 10https://gerrit.wikimedia.org/r/393610 (https://phabricator.wikimedia.org/T178535) [15:47:42] (03PS3) 10Herron: puppet: point codfw prometheus servers at codfw puppet 4 masters [puppet] - 10https://gerrit.wikimedia.org/r/393602 (https://phabricator.wikimedia.org/T177254) [15:47:59] 10Operations, 10monitoring: Check for long running screen/tmux should mention usernames - https://phabricator.wikimedia.org/T181409#3788998 (10fgiunchedi) [15:49:30] (03PS3) 10Giuseppe Lavagetto: puppet: Move puppet CI to puppet 4.8.2 [puppet] - 10https://gerrit.wikimedia.org/r/393259 [15:50:07] (03PS9) 10ArielGlenn: clean up old misc dump output files from cron jobs on dump hosts [puppet] - 10https://gerrit.wikimedia.org/r/393245 (https://phabricator.wikimedia.org/T179942) [15:52:55] !log beginning canary cutover/deployment of codfw prometheus servers to codfw puppet 4 puppetmasters [15:53:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:14] (03CR) 10Filippo Giunchedi: "I've opened https://phabricator.wikimedia.org/T181410 for the actual fix to the check" [puppet] - 10https://gerrit.wikimedia.org/r/393607 (owner: 10Filippo Giunchedi) [15:53:32] (03CR) 10Herron: [C: 032] puppet: point codfw prometheus servers at codfw puppet 4 masters [puppet] - 10https://gerrit.wikimedia.org/r/393602 (https://phabricator.wikimedia.org/T177254) (owner: 10Herron) [15:54:58] (03PS2) 10BryanDavis: user homes: Allow git to control +x for $HOME files [puppet] - 10https://gerrit.wikimedia.org/r/377056 [15:56:05] 10Operations, 10wikidiff2, 10Patch-For-Review, 10User-Addshore, and 2 others: Update and use php-wikidiff2 1.5.1 & MovedParagraphDetectionCutoff in production - https://phabricator.wikimedia.org/T177891#3789047 (10Tobi_WMDE_SW) [15:59:33] (03PS1) 10Elukey: Move journalnode's config from cdh::hadoop to cdh::hadoop:worker [puppet/cdh] - 10https://gerrit.wikimedia.org/r/393611 (https://phabricator.wikimedia.org/T167790) [15:59:50] (03CR) 10jerkins-bot: [V: 04-1] Move journalnode's config from cdh::hadoop to cdh::hadoop:worker [puppet/cdh] - 10https://gerrit.wikimedia.org/r/393611 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [16:00:02] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2055 - https://phabricator.wikimedia.org/T181266#3789060 (10Marostegui) Thanks! [16:01:40] (03PS2) 10Elukey: Move journalnode's config from cdh::hadoop to cdh::hadoop:worker [puppet/cdh] - 10https://gerrit.wikimedia.org/r/393611 (https://phabricator.wikimedia.org/T167790) [16:03:07] 10Operations, 10ops-codfw, 10Discovery, 10Discovery-Search, 10Elasticsearch: HP RAID Battery issue on elastic2004 - https://phabricator.wikimedia.org/T181412#3789096 (10Gehel) [16:05:28] (03PS2) 10BBlack: lvs400[1-4] to spare::system [puppet] - 10https://gerrit.wikimedia.org/r/393610 (https://phabricator.wikimedia.org/T178535) [16:07:57] !log lvs400x - puppet disabled for https://gerrit.wikimedia.org/r/#/c/393610 [16:08:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:11] (03CR) 10BBlack: [C: 032] lvs400[1-4] to spare::system [puppet] - 10https://gerrit.wikimedia.org/r/393610 (https://phabricator.wikimedia.org/T178535) (owner: 10BBlack) [16:08:31] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler02/8987/" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/393611 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [16:10:35] PROBLEM - Host lvs1001 is DOWN: PING CRITICAL - Packet loss = 100% [16:10:41] (03CR) 10Elukey: [C: 04-1] "Nope, doesn't work on an1028: https://puppet-compiler.wmflabs.org/compiler02/8988/analytics1028.eqiad.wmnet/" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/393611 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [16:11:31] PROBLEM - Host text-lb.eqiad.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [16:11:33] lvs1001 is not me, probably a problem! [16:11:36] PROBLEM - Host text-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [16:11:36] PROBLEM - Host lvs3001 is DOWN: PING CRITICAL - Packet loss = 100% [16:11:53] not me intentionally, anyways, reverting [16:11:56] <_joe_> bblack: indeed [16:12:06] (03PS1) 10BBlack: Revert "lvs400[1-4] to spare::system" [puppet] - 10https://gerrit.wikimedia.org/r/393614 [16:12:15] <_joe_> bblack: is that all? [16:12:15] (03CR) 10BBlack: [V: 032 C: 032] Revert "lvs400[1-4] to spare::system" [puppet] - 10https://gerrit.wikimedia.org/r/393614 (owner: 10BBlack) [16:12:19] need help? [16:12:29] in a meeting with a bunch of opsens, can drop if needed [16:12:30] not yet [16:12:38] both lvs1001 and lvs3001 [16:12:39] ? [16:13:02] yeah I don't know yet either, but reverting the recent lvs commit [16:13:15] I can login at both hosts [16:13:17] RECOVERY - Host lvs3001 is UP: PING WARNING - Packet loss = 93%, RTA = 83.80 ms [16:13:28] <_joe_> I can access the wikis too [16:13:48] I am checking starts, but not 100% sure yet [16:13:57] but so far see no problem [16:14:17] ping lvs1001 [16:14:17] PING lvs1001.wikimedia.org (208.80.154.55) 56(84) bytes of data. [16:14:17] ^C [16:14:17] --- lvs1001.wikimedia.org ping statistics --- [16:14:17] 4 packets transmitted, 0 received, 100% packet loss, time 3043ms [16:14:22] from einsteinium... something's wrong [16:14:49] I see no traffic changes at any layer [16:14:49] and bast1001 so it's something at the box, not at the monitoring level [16:15:13] same holds for lvs3001 [16:15:19] puppet didn't even run on lvs1001 after my merge [16:15:25] something else odd is going on here [16:15:28] yes [16:15:40] if it is spare, could it be just a fake alarm? [16:15:44] it's not a spare [16:15:46] no it's not fake [16:15:48] ah [16:15:56] ipvs stats look ok though [16:16:10] <_joe_> yeah I was about to say the same [16:16:20] this looks icmp related [16:16:25] maybe firewall? [16:16:35] and by that I mean the raw ones, ipvsadm -Ln on the lvs[13]001 hosts [16:16:38] look fine [16:16:43] RECOVERY - Host text-lb.esams.wikimedia.org is UP: PING WARNING - Packet loss = 93%, RTA = 83.89 ms [16:17:01] <_joe_> jynus: those boxes have no firewall [16:17:07] <_joe_> do NOT run iptables there [16:17:08] plaese don't check it with iptables [16:17:19] root@lvs1001:~# lsmod|grep tab [16:17:19] iptable_filter 16384 0 [16:17:19] ip_tables 24576 1 iptable_filter [16:17:19] x_tables 36864 2 ip_tables,iptable_filter [16:17:23] someone did :P [16:17:24] that's me [16:17:29] I have not accessed it [16:17:35] I shouldn't be doing many things at once [16:17:41] I 'll remove the modules [16:17:43] PROBLEM - Host text-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [16:17:47] PROBLEM - Host lvs3001 is DOWN: PING CRITICAL - Packet loss = 100% [16:17:49] thankfully contrack didn't kill everything [16:17:51] (03PS1) 10Rush: Revert "openstack: disable notify temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/393615 [16:17:58] (03PS2) 10Rush: Revert "openstack: disable notify temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/393615 [16:18:12] now esams? [16:18:17] RECOVERY - Host lvs3001 is UP: PING WARNING - Packet loss = 93%, RTA = 83.80 ms [16:18:19] it was already esams, it's just flapping [16:18:39] ok iptables, iptable_filter and x_tables modules removed [16:18:49] I am glad this did not kill lvs1001 [16:19:11] maybe we should literally ban those if they are bad [16:19:16] that's one of our assumptions we should re-evaluate at some point in time [16:19:17] PROBLEM - Host lvs3001 is DOWN: PING CRITICAL - Packet loss = 100% [16:19:19] jynus: I have patches outstanding for that [16:19:22] ok [16:20:19] there you go https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?panelId=14&fullscreen&orgId=1&var-server=lvs1001&var-datasource=eqiad%20prometheus%2Fops [16:20:55] why so many out icmps ? [16:21:27] RECOVERY - Host lvs3001 is UP: PING WARNING - Packet loss = 86%, RTA = 85.05 ms [16:21:53] RECOVERY - Host text-lb.esams.wikimedia.org is UP: PING WARNING - Packet loss = 93%, RTA = 83.90 ms [16:21:59] so, to recap a little: we appear to be having an ICMP problem, not an HTTPS service problem. There's a ton of icmp traffic, it's ratelimiting, and thus icinga ping checks are dying [16:22:04] (03CR) 10Rush: [C: 032] Revert "openstack: disable notify temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/393615 (owner: 10Rush) [16:22:31] bblack: yes agreed [16:23:17] RECOVERY - Host lvs1001 is UP: PING WARNING - Packet loss = 93%, RTA = 0.20 ms [16:25:58] PROBLEM - Host lvs3001 is DOWN: PING CRITICAL - Packet loss = 100% [16:26:58] RECOVERY - Host lvs3001 is UP: PING WARNING - Packet loss = 93%, RTA = 83.73 ms [16:27:07] PROBLEM - Host lvs1001 is DOWN: PING CRITICAL - Packet loss = 100% [16:31:38] (03CR) 10Tjones: [C: 031] Add stemming languages settings for description indexing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392894 (https://phabricator.wikimedia.org/T176903) (owner: 10Smalyshev) [16:32:47] (03PS3) 10Elukey: Move journalnode's config from cdh::hadoop to cdh::hadoop:worker [puppet/cdh] - 10https://gerrit.wikimedia.org/r/393611 (https://phabricator.wikimedia.org/T167790) [16:36:45] (03CR) 10Elukey: "All right now it looks better: https://puppet-compiler.wmflabs.org/compiler02/8990/analytics1028.eqiad.wmnet/" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/393611 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [16:39:03] (03PS3) 10Dzahn: Gerrit: git community logo [puppet] - 10https://gerrit.wikimedia.org/r/392181 (owner: 10Thcipriani) [16:40:09] (03CR) 10Dzahn: [C: 032] Gerrit: git community logo [puppet] - 10https://gerrit.wikimedia.org/r/392181 (owner: 10Thcipriani) [16:42:47] (03CR) 10Dzahn: "i see https://phabricator.wikimedia.org/T178454 got resolved and seems related" [puppet] - 10https://gerrit.wikimedia.org/r/393215 (https://phabricator.wikimedia.org/T177052) (owner: 10Hashar) [16:43:40] (03Abandoned) 10Marostegui: db-eqiad.php: Warm up s8 hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393588 (https://phabricator.wikimedia.org/T177208) (owner: 10Marostegui) [16:44:24] 10Operations, 10Release Pipeline, 10monitoring, 10Continuous-Integration-Infrastructure (shipyard), and 2 others: Icinga disk space alert when a Docker container is running on an host - https://phabricator.wikimedia.org/T178454#3789255 (10Dzahn) Is https://gerrit.wikimedia.org/r/#/c/393215/ also this ticke... [16:52:02] mutante: thanks for the merge! :) [16:52:23] PROBLEM - Host text-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [16:52:32] RECOVERY - Host text-lb.eqiad.wikimedia.org is UP: PING WARNING - Packet loss = 93%, RTA = 0.31 ms [16:53:02] RECOVERY - Host text-lb.esams.wikimedia.org is UP: PING WARNING - Packet loss = 93%, RTA = 83.79 ms [16:53:55] thcipriani: yw, thanks too [16:54:03] PROBLEM - Host text-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [16:54:03] (03CR) 10Nikerabbit: [C: 031] Define wmgBabelMainCategory for officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393583 (owner: 10Amire80) [16:54:32] RECOVERY - Host text-lb.esams.wikimedia.org is UP: PING WARNING - Packet loss = 80%, RTA = 83.79 ms [16:57:46] (03CR) 10Alexandros Kosiaris: [C: 04-1] diamond: skip DiskSpace for Docker containers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/393215 (https://phabricator.wikimedia.org/T177052) (owner: 10Hashar) [16:57:54] [we're still working on the above, but in the overall, services to users shouldn't be seeing much impact (if any)] [16:57:57] RECOVERY - Host lvs1001 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [16:59:01] (03CR) 10Zoranzoki21: "> Scheduled for EU SWAT today, but not deployed because Zoranzoki21" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387077 (https://phabricator.wikimedia.org/T179241) (owner: 10Zoranzoki21) [16:59:12] (03PS20) 10Zoranzoki21: Enable the ArticlePlaceholder for sewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387077 (https://phabricator.wikimedia.org/T179241) [17:02:23] 10Operations, 10MinervaNeue, 10MobileFrontend, 10Readers-Web-Backlog, 10Wikimedia-log-errors: Special:Nearby broken on wmflabs (503 server error) - https://phabricator.wikimedia.org/T181290#3789317 (10Jdlrobson) [17:02:28] Hi. Please deploy patch https://gerrit.wikimedia.org/r/#/c/387077/ [17:02:38] I not been in time here when is scheduled [17:02:42] Can you do it now? [17:02:43] I am here [17:02:44] (03CR) 10TerraCodes: "> > Scheduled for EU SWAT today, but not deployed because Zoranzoki21" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387077 (https://phabricator.wikimedia.org/T179241) (owner: 10Zoranzoki21) [17:02:56] 10Operations, 10MinervaNeue, 10MobileFrontend, 10Readers-Web-Backlog, 10Wikimedia-log-errors: Special:Nearby broken on wmflabs (503/500 server error) - https://phabricator.wikimedia.org/T181290#3785279 (10Jdlrobson) [17:03:35] (03CR) 10Zoranzoki21: "> > > Scheduled for EU SWAT today, but not deployed because" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387077 (https://phabricator.wikimedia.org/T179241) (owner: 10Zoranzoki21) [17:06:06] Zoranzoki21: Probably not, deployments out of SWAT windows usually aren't done unless it's really urgent (which this patch is not). Schedule it for the another SWAT window, at a time when you can be here. [17:06:42] ok [17:10:19] I added for next swat [17:16:47] 10Operations, 10Analytics-Kanban, 10monitoring, 10netops, 10User-Elukey: Pull netflow data in realtime from Kafka via Tranquillity/Spark - https://phabricator.wikimedia.org/T181036#3789390 (10Milimetric) [17:24:07] 10Operations, 10ops-codfw, 10Discovery, 10Discovery-Search, 10Elasticsearch: HP RAID Battery issue on elastic2004 - https://phabricator.wikimedia.org/T181412#3789424 (10Papaul) @Gehel I will be receiving the part tomorrow Dear Mr Papaul Tshibamba, Thank you for contacting Hewlett Packard Enterprise f... [17:25:00] 10Operations, 10ops-codfw, 10Discovery, 10Discovery-Search, 10Elasticsearch: HP RAID Battery issue on elastic2004 - https://phabricator.wikimedia.org/T181412#3789425 (10Gehel) That's fast! Thanks! Ping me if you need me to be around when you install it. [17:25:22] 10Operations, 10ops-codfw, 10Discovery, 10Elasticsearch, 10Discovery-Search (Current work): HP RAID Battery issue on elastic2004 - https://phabricator.wikimedia.org/T181412#3789427 (10Gehel) [17:27:05] 10Operations, 10ops-codfw, 10Discovery, 10Elasticsearch, 10Discovery-Search (Current work): HP RAID Battery issue on elastic2004 - https://phabricator.wikimedia.org/T181412#3789434 (10Papaul) I will because i have to update all the firmware too on the server. So will let you know tomorrow once I received... [17:39:34] (03CR) 10EBernhardson: [C: 031] [logstash] log all elastic queries [puppet] - 10https://gerrit.wikimedia.org/r/392603 (https://phabricator.wikimedia.org/T180051) (owner: 10DCausse) [17:45:41] (03CR) 10EBernhardson: [C: 031] [logstash] add debug_blob field [puppet] - 10https://gerrit.wikimedia.org/r/392590 (https://phabricator.wikimedia.org/T180051) (owner: 10DCausse) [17:49:02] 10Operations, 10Ops-Access-Requests, 10Discovery, 10Wikidata, and 3 others: Allow Kirk and Martijn (JClarity) access to our WDQS production servers - https://phabricator.wikimedia.org/T178271#3789480 (10Gehel) It looks liek we've done most of our analysis already. Shipping logs worked. We can always reopen... [17:50:31] 10Operations, 10Ops-Access-Requests, 10Discovery, 10Wikidata, and 3 others: Allow Kirk and Martijn (JClarity) access to our WDQS production servers - https://phabricator.wikimedia.org/T178271#3789484 (10RobH) 05Open>03Resolved [17:53:13] 10Operations, 10ops-eqiad: Possible memory errors on ganeti1005, ganeti1006 - https://phabricator.wikimedia.org/T181121#3789501 (10akosiaris) bohrium has been migrated to the other 2 boxes as well [18:00:04] gehel: I, the Bot under the Fountain, allow thee, The Deployer, to do Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171127T1800). [18:00:04] Smalyshev: A patch you scheduled for Wikidata Query Service weekly deploy is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:02:52] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe: Passenger spews Exception NoMethodError in Rack application object - https://phabricator.wikimedia.org/T180944#3789583 (10bd808) [18:04:33] (03CR) 10Dzahn: [C: 04-1] "Apache on cobalt is too old for this. at the minimum this should only "if stretch" but more likely even "if buster", per the security-rela" [puppet] - 10https://gerrit.wikimedia.org/r/392489 (https://phabricator.wikimedia.org/T180978) (owner: 10Paladox) [18:05:17] !log Sending Toolforge survey 1 week reminder emails from silver for T177126 [18:05:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:25] T177126: 2017 Toolforge user survey - https://phabricator.wikimedia.org/T177126 [18:05:45] (03PS8) 10Dzahn: openldap: move firewall/standard to roles, use profile [puppet] - 10https://gerrit.wikimedia.org/r/391737 [18:06:02] !log poweroff ganeti1005, ganeti1006 T181121 [18:06:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:09] T181121: Possible memory errors on ganeti1005, ganeti1006 - https://phabricator.wikimedia.org/T181121 [18:06:33] (03CR) 10Dzahn: [C: 032] "seaborgium = labs LDAP , dubnium = corp LDAP, no-op http://puppet-compiler.wmflabs.org/8937/ and violation delta -8" [puppet] - 10https://gerrit.wikimedia.org/r/391737 (owner: 10Dzahn) [18:08:26] !log Killed truthy nt dumpers stuck at 100% CPU (from last week) on snapshot1007 (T181385) [18:08:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:33] T181385: Wikidata truthy nt dumpers stuck with 100% CPU on snapshot1007 - https://phabricator.wikimedia.org/T181385 [18:08:57] (03CR) 10Dzahn: "no-op confirmed on seaborgium and dubnium" [puppet] - 10https://gerrit.wikimedia.org/r/391737 (owner: 10Dzahn) [18:10:33] !log gehel@tin Started deploy [wdqs/wdqs@7ad20f3]: (no justification provided) [18:10:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:57] damn forgot the justification again... [18:11:07] !log deploying blazegraph + GUI updates [18:11:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:44] !log gehel@tin Finished deploy [wdqs/wdqs@7ad20f3]: (no justification provided) (duration: 02m 10s) [18:12:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:43] (03PS4) 10Elukey: Remove journalnode's config auto-inclusion from cdh::hadoop [puppet/cdh] - 10https://gerrit.wikimedia.org/r/393611 (https://phabricator.wikimedia.org/T167790) [18:13:49] SMalyshev: deployment completed, tests are green. Feel free to have a look! [18:14:33] !log pushing v6 gre firewall filter [18:14:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:09] (03PS5) 10Elukey: Remove journalnode's config auto-inclusion from cdh::hadoop [puppet/cdh] - 10https://gerrit.wikimedia.org/r/393611 (https://phabricator.wikimedia.org/T167790) [18:21:17] PROBLEM - cxserver endpoints health on scb1001 is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received: /v1/mt/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium.) timed out before a response was received [18:22:08] RECOVERY - cxserver endpoints health on scb1001 is OK: All endpoints are healthy [18:22:42] 10Operations, 10CheckUser, 10Traffic: Log source port for anonymous users and expose it for sysops/checkusers - https://phabricator.wikimedia.org/T181368#3789662 (10Krenair) Per @legoktm this has nothing to do with #Analytics as far as I am aware. >>! In T181368#3787869, @ema wrote: >> I would not... [18:23:49] (03PS1) 10BBlack: Revert "Revert "lvs400[1-4] to spare::system"" [puppet] - 10https://gerrit.wikimedia.org/r/393623 [18:23:57] (03PS2) 10BBlack: Revert "Revert "lvs400[1-4] to spare::system"" [puppet] - 10https://gerrit.wikimedia.org/r/393623 [18:26:51] (03PS3) 10Dzahn: memcached: remove ganglia monitoring [puppet] - 10https://gerrit.wikimedia.org/r/382922 (https://phabricator.wikimedia.org/T177225) [18:27:50] (03CR) 10BBlack: [C: 032] Revert "Revert "lvs400[1-4] to spare::system"" [puppet] - 10https://gerrit.wikimedia.org/r/393623 (owner: 10BBlack) [18:28:12] (03CR) 10Dzahn: [C: 031] "see https://phabricator.wikimedia.org/T177196#3783000 the right dashboard is https://grafana.wikimedia.org/dashboard/db/prometheus-memcach" [puppet] - 10https://gerrit.wikimedia.org/r/382922 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [18:30:36] (03PS4) 10Dzahn: memcached: remove ganglia monitoring [puppet] - 10https://gerrit.wikimedia.org/r/382922 (https://phabricator.wikimedia.org/T177225) [18:31:48] PROBLEM - pybal on lvs4001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [18:32:25] lvs400x alerts are ok (sorry, didn't silence ahead!) [18:32:40] should be gone now, I updated einsteinium [18:35:09] (03CR) 10Ottomata: [C: 031] Remove journalnode's config auto-inclusion from cdh::hadoop [puppet/cdh] - 10https://gerrit.wikimedia.org/r/393611 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [18:35:20] (03PS2) 10BBlack: kmod::blacklist: prevent manual install, update initramfs [puppet] - 10https://gerrit.wikimedia.org/r/392644 [18:35:29] (03CR) 10Dzahn: [C: 032] memcached: remove ganglia monitoring [puppet] - 10https://gerrit.wikimedia.org/r/382922 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [18:36:11] (03PS3) 10BBlack: kmod::blacklist: prevent manual install, update initramfs [puppet] - 10https://gerrit.wikimedia.org/r/392644 [18:36:42] (03CR) 10BBlack: [C: 032] kmod::blacklist: prevent manual install, update initramfs [puppet] - 10https://gerrit.wikimedia.org/r/392644 (owner: 10BBlack) [18:37:00] removing/decom ganglia from memcached servers, it _might_ cause some temp. puppet alert due to races, but maybe not [18:37:37] (03PS2) 10BBlack: cp/lvs: prevent accidental iptables kmods [puppet] - 10https://gerrit.wikimedia.org/r/392645 [18:38:54] sees the kmod::blacklist change being applied (but not the expected ganglia removal because that already happened in the past) [18:39:23] (03CR) 10BBlack: [C: 032] cp/lvs: prevent accidental iptables kmods [puppet] - 10https://gerrit.wikimedia.org/r/392645 (owner: 10BBlack) [18:40:46] PROBLEM - puppet last run on es2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:41:15] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:41:15] PROBLEM - puppet last run on db1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:41:40] possibly me [18:42:04] checked one, and yea, it is [18:42:06] PROBLEM - puppet last run on eventlog1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:42:08] invalid relationship [18:42:15] worked fine on cp/lvs? [18:42:31] Invalid relationship: File[/etc/modprobe.d/blacklist-wmf.conf] { notify => Exec[update-initramfs] }, because Exec[update-initramfs] doesn't seem to be in the catalog [18:42:38] Exec[update-initramfs] doesn't seem to be in the catalog [18:42:44] I thought that was in base for all? [18:42:53] i bet it's in standard [18:42:58] and these dont include it [18:43:16] PROBLEM - puppet last run on californium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:43:25] PROBLEM - puppet last run on snapshot1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:43:25] PROBLEM - puppet last run on labvirt1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:43:39] ah [18:43:41] if os_version('debian >= jessie') { [18:43:41] class { '::base::initramfs': } [18:43:41] } [18:43:48] it's in base not standard, but only for jessie :P [18:43:53] ah.. [18:44:00] (03CR) 10Zoranzoki21: [C: 031] Remove $wgStyleVersion appending in CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393353 (https://phabricator.wikimedia.org/T181318) (owner: 10Reedy) [18:44:05] so, all the trusties will spam IRC now [18:44:16] PROBLEM - puppet last run on labvirt1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:44:21] nice reminder which ones need upgrade :) [18:44:26] PROBLEM - puppet last run on labtestservices2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:44:33] working on a fixup patch [18:44:41] we have 90 of them :D [18:44:46] PROBLEM - puppet last run on labvirt1017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:45:06] bblack: do you want me to stop ircecho for a bit? [18:45:15] PROBLEM - puppet last run on dbstore1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:45:15] PROBLEM - puppet last run on labcontrol1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:45:15] PROBLEM - puppet last run on db1039 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:45:15] PROBLEM - puppet last run on es2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:45:16] PROBLEM - puppet last run on db2028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:45:20] Oh it spamming much [18:45:39] volans: couldn't hurt? [18:45:50] !log stopped ircecho temporary to avoid spam [18:45:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:30] !log test [18:46:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:30] the ~> chaining arrows don't require resources to exist, right? [18:48:04] bblack: Applies the resource on the left first. If the left-hand resource changes, the right-hand resource will refresh [18:48:11] (03PS1) 10BBlack: kmod::blacklist - allow undefined update-initramfs [puppet] - 10https://gerrit.wikimedia.org/r/393626 [18:48:29] yeah I just thought they allowed non-existence, as compared to notify=> [18:48:42] but maybe that's only the <| |> form? [18:48:50] not sure [18:49:04] let's puppet compile it ;) [18:49:20] yeah doing [18:49:53] (03PS2) 10Smalyshev: Create script for automatic reload of categories [puppet] - 10https://gerrit.wikimedia.org/r/392736 (https://phabricator.wikimedia.org/T173772) [18:50:02] !log elastic/cirrus: reindexing english group2 wikis: T179945 [18:50:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:11] T179945: Re-index English-language wikis to pick up kana mapping - https://phabricator.wikimedia.org/T179945 [18:50:16] bblack: no, still complaining [18:50:23] bleh [18:50:24] Could not find resource 'Exec[update-initramfs]' [18:50:38] revert then, I have other stuff on my schedule to figure this out at the moment [18:50:45] missed notification: [18:50:46] PROBLEM - eventstreams on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:51:00] (03PS1) 10BBlack: Revert "cp/lvs: prevent accidental iptables kmods" [puppet] - 10https://gerrit.wikimedia.org/r/393628 [18:51:11] (03CR) 10BBlack: [V: 032 C: 032] Revert "cp/lvs: prevent accidental iptables kmods" [puppet] - 10https://gerrit.wikimedia.org/r/393628 (owner: 10BBlack) [18:51:14] PROBLEM - cxserver endpoints health on scb1001 is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received: /v1/mt/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium.) timed out before a response was received [18:51:19] RECOVERY - eventstreams on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 929 bytes in 0.065 second response time [18:51:28] (03PS1) 10BBlack: Revert "kmod::blacklist: prevent manual install, update initramfs" [puppet] - 10https://gerrit.wikimedia.org/r/393629 [18:51:31] * volans is the new icinga-wm :D [18:51:37] (03PS2) 10BBlack: Revert "kmod::blacklist: prevent manual install, update initramfs" [puppet] - 10https://gerrit.wikimedia.org/r/393629 [18:51:41] (03CR) 10BBlack: [V: 032 C: 032] Revert "kmod::blacklist: prevent manual install, update initramfs" [puppet] - 10https://gerrit.wikimedia.org/r/393629 (owner: 10BBlack) [18:52:05] bblack: can you also run https://wikitech.wikimedia.org/wiki/Cumin#Run_Puppet_only_if_last_run_failed ? [18:52:14] I have to step away in few minutes [18:52:22] (03PS3) 10BBlack: Revert "kmod::blacklist: prevent manual install, update initramfs" [puppet] - 10https://gerrit.wikimedia.org/r/393629 [18:52:28] (03CR) 10BBlack: [V: 032 C: 032] Revert "kmod::blacklist: prevent manual install, update initramfs" [puppet] - 10https://gerrit.wikimedia.org/r/393629 (owner: 10BBlack) [18:52:52] volans: yes [18:52:58] thx [18:53:21] running now [18:55:23] starting to get recoveries [18:58:45] should be all recovered now [18:58:54] re-running --failed-only in case of races [18:59:18] if they didn't failed there shouldn't be IIRC [18:59:29] waiting icinga to catchup and restarting ircecho [18:59:53] well I mean, they may have started an agent run on the failed catalog just as --failed-only was going out, but not completed it yet [19:00:00] unless --failed-only waits on current runs? [19:00:05] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: That opportune time is upon us again. Time for a Morning SWAT (Max 8 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171127T1900). [19:00:05] RoanKattouw, Zoranzoki21, and Smalyshev: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:05] yes it waits [19:00:09] ah, cool [19:00:11] or bailout if timeout [19:00:12] here I am [19:00:28] Who will do swat? I am here now :) [19:01:17] I can SWAT [19:01:47] OK [19:02:00] First deploy my [19:02:34] !log restarted ircecho [19:02:40] (03CR) 10Catrope: [C: 032] Enable the ArticlePlaceholder for sewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387077 (https://phabricator.wikimedia.org/T179241) (owner: 10Zoranzoki21) [19:02:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:20] Thank you very much [19:04:31] Zoranzoki21: Do you have the WikimediaDebug extension installed in your browser (Chrome or Firefox)? [19:04:45] RoanKattouw: No [19:04:52] (03Merged) 10jenkins-bot: Enable the ArticlePlaceholder for sewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387077 (https://phabricator.wikimedia.org/T179241) (owner: 10Zoranzoki21) [19:04:57] RoanKattouw: Why? [19:04:59] Please install it. You'll need it in a minute to test your change before it goes liev [19:05:01] *live [19:05:21] We first deploy changes to a test server, then ask the requester to test the change there [19:05:32] The browser extension is needed to use the test server [19:05:58] RoanKattouw: How to install? [19:06:13] If you have Chrome, https://chrome.google.com/webstore/detail/wikimediadebug/binmakecefompkjggiklgjenddjoifbb [19:06:29] If you have Firefox, https://addons.mozilla.org/en-US/firefox/addon/wikimedia-debug-header/ [19:06:40] I have maxton [19:07:10] (03CR) 10jenkins-bot: Enable the ArticlePlaceholder for sewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387077 (https://phabricator.wikimedia.org/T179241) (owner: 10Zoranzoki21) [19:07:25] OK [19:07:51] I can probably test this change for you, because I'm pretty sure I know how to test this one [19:08:08] Ok, please test [19:08:10] But for any future deployments it'd be helpful if you had either FF or Chrome installed with that extension [19:08:13] Ok [19:08:31] RoanKattouw: Thank you [19:10:15] (03CR) 10Elukey: [C: 032] Remove journalnode's config auto-inclusion from cdh::hadoop [puppet/cdh] - 10https://gerrit.wikimedia.org/r/393611 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [19:11:47] Is enabled ircecho? [19:11:59] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: T179241 (duration: 00m 45s) [19:12:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:06] T179241: Enable the ArticlePlaceholder for Northern Sami (sewiki) - https://phabricator.wikimedia.org/T179241 [19:13:19] (03CR) 10Catrope: [C: 032] Add stemming languages settings for description indexing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392894 (https://phabricator.wikimedia.org/T176903) (owner: 10Smalyshev) [19:13:35] SMalyshev: I'm guessing this change ---^^ is indexing-only and doesn't need/support mwdebug1002 testing? [19:13:40] (03PS14) 10Madhuvishy: labstore: initial ferm rules shared by all labstore hosts [puppet] - 10https://gerrit.wikimedia.org/r/353508 (https://phabricator.wikimedia.org/T165136) (owner: 10Muehlenhoff) [19:13:50] RoanKattouw: you are correct, it's for indexing only [19:14:19] so no point in testing on mwdebug, it doesn't do anything yet [19:14:31] (03CR) 10Madhuvishy: [C: 032] labstore: initial ferm rules shared by all labstore hosts [puppet] - 10https://gerrit.wikimedia.org/r/353508 (https://phabricator.wikimedia.org/T165136) (owner: 10Muehlenhoff) [19:14:44] Thank you very much for deploying. Bye [19:14:58] (03PS1) 10Elukey: profile::hadoop::worker: include the journalnode when needed [puppet] - 10https://gerrit.wikimedia.org/r/393636 (https://phabricator.wikimedia.org/T167790) [19:17:27] (03Merged) 10jenkins-bot: Add stemming languages settings for description indexing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392894 (https://phabricator.wikimedia.org/T176903) (owner: 10Smalyshev) [19:17:38] (03CR) 10jenkins-bot: Add stemming languages settings for description indexing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392894 (https://phabricator.wikimedia.org/T176903) (owner: 10Smalyshev) [19:17:55] !log Removed more bogus md5 wanobjecache/ metric from graphite[12]001 [19:18:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:28] (03CR) 10Elukey: "it seems a wonderful no-op: https://puppet-compiler.wmflabs.org/compiler02/8995/" [puppet] - 10https://gerrit.wikimedia.org/r/393636 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [19:19:28] (03CR) 10Ottomata: [C: 031] profile::hadoop::worker: include the journalnode when needed [puppet] - 10https://gerrit.wikimedia.org/r/393636 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [19:20:38] (03PS2) 10Elukey: profile::hadoop::worker: include the journalnode when needed [puppet] - 10https://gerrit.wikimedia.org/r/393636 (https://phabricator.wikimedia.org/T167790) [19:21:08] (03CR) 10Elukey: [C: 032] profile::hadoop::worker: include the journalnode when needed [puppet] - 10https://gerrit.wikimedia.org/r/393636 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [19:21:09] !log catrope@tin Synchronized wmf-config/Wikibase.php: T176903 (duration: 00m 45s) [19:21:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:16] T176903: Index wikidata descriptions - https://phabricator.wikimedia.org/T176903 [19:21:26] RoanKattouw: thanks! [19:28:07] (03CR) 10Chad: "Nope, I just did it when I had some spare time :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393075 (https://phabricator.wikimedia.org/T181241) (owner: 10Jon Harald Søby) [19:31:49] 10Operations, 10Trending-Service, 10Reading-Infrastructure-Team-Backlog (Kanban), 10Services (designing): Turn off Trending Service - https://phabricator.wikimedia.org/T180384#3789909 (10Fjalapeno) Hey all, following up here after the holiday. Looks like we have next steps from @Joe. I'll add this to the... [19:32:19] (03CR) 10Chad: "We should abandon for now. Cobalt needs upgrading (which I'm not rushing to do), needs further testing in a live environment, etc." [puppet] - 10https://gerrit.wikimedia.org/r/392489 (https://phabricator.wikimedia.org/T180978) (owner: 10Paladox) [19:32:36] (03Abandoned) 10Paladox: Gerrit: Enable http/2 for apache [puppet] - 10https://gerrit.wikimedia.org/r/392489 (https://phabricator.wikimedia.org/T180978) (owner: 10Paladox) [19:33:57] 10Operations, 10Goal, 10User-fgiunchedi: Export Prometheus-compatible JVM metrics from JVMs in production - https://phabricator.wikimedia.org/T177197#3789921 (10elukey) [19:34:07] !log otto@tin Started deploy [eventlogging/eventbus@3df06ab]: Temp deploy to kafka1001 only to catch bug: T180017 [19:34:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:13] T180017: Timeouts on event delivery to EventBus - https://phabricator.wikimedia.org/T180017 [19:34:19] !log otto@tin Finished deploy [eventlogging/eventbus@3df06ab]: Temp deploy to kafka1001 only to catch bug: T180017 (duration: 00m 12s) [19:34:26] 10Operations, 10Goal, 10User-Elukey, 10User-fgiunchedi: Export Prometheus-compatible JVM metrics from JVMs in production - https://phabricator.wikimedia.org/T177197#3650154 (10elukey) [19:34:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:23] (03Abandoned) 10BBlack: kmod::blacklist - allow undefined update-initramfs [puppet] - 10https://gerrit.wikimedia.org/r/393626 (owner: 10BBlack) [19:35:36] (03PS2) 10Chad: Define wmgBabelMainCategory for officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393583 (owner: 10Amire80) [19:38:05] (03CR) 10Chad: [C: 032] Define wmgBabelMainCategory for officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393583 (owner: 10Amire80) [19:39:02] !log catrope@tin Synchronized php-1.31.0-wmf.8/includes/specials/SpecialRecentchangeslinked.php: T181100 (duration: 00m 45s) [19:39:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:09] T181100: [wmf.8 - regression] Related changes: Saved filters set as default won't include the page as part of a query - https://phabricator.wikimedia.org/T181100 [19:39:19] (03Merged) 10jenkins-bot: Define wmgBabelMainCategory for officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393583 (owner: 10Amire80) [19:39:29] (03CR) 10jenkins-bot: Define wmgBabelMainCategory for officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393583 (owner: 10Amire80) [19:40:25] (03PS1) 10BBlack: kmod::blacklist: prevent manual install, update initramfs [puppet] - 10https://gerrit.wikimedia.org/r/393639 [19:40:27] (03PS1) 10BBlack: cp/lvs: prevent accidental iptables kmods [puppet] - 10https://gerrit.wikimedia.org/r/393640 [19:41:32] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: babel officewiki thing (duration: 00m 45s) [19:41:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:42] !log catrope@tin Synchronized php-1.31.0-wmf.8/resources/src/mediawiki.rcfilters/mw.rcfilters.UriProcessor.js: T181100 (duration: 00m 45s) [19:42:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:43] (03CR) 10BBlack: [C: 032] kmod::blacklist: prevent manual install, update initramfs [puppet] - 10https://gerrit.wikimedia.org/r/393639 (owner: 10BBlack) [19:44:46] (03CR) 10BBlack: [C: 032] cp/lvs: prevent accidental iptables kmods [puppet] - 10https://gerrit.wikimedia.org/r/393640 (owner: 10BBlack) [19:47:55] herron: am I correct in understanding that in the final puppet 4 state neither client nor master should have 'environment' set in puppet.conf? [19:48:45] I think it's having parser set that causes a warning [19:49:23] RoanKattouw: Can I sneak another config change into the SWAT window, or should I wait until this afternoon? [19:50:01] kaldari: Go for it [19:50:10] cool [19:50:17] (03CR) 10Kaldari: [C: 032] Enable MP3 uploading on Commons on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392765 (https://phabricator.wikimedia.org/T120288) (owner: 10Kaldari) [19:50:20] (03CR) 10Chad: "Hahaha, what on earth was this doing around still?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393353 (https://phabricator.wikimedia.org/T181318) (owner: 10Reedy) [19:50:34] herron: if I have environment=future set on a client but not on the master, it doesn't work at all. Not Found: Could not find environment 'future'" [19:51:18] So I'm not clear how to gracefully transition now that I have 'future' set on all my clients. I have to unset it after the master is upgraded, but… once the master is upgraded I can't actually change anything on the client via puppet because puppet breaks [19:51:41] (03PS4) 10Kaldari: Enable MP3 uploading on Commons on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392765 (https://phabricator.wikimedia.org/T120288) [19:51:45] (03CR) 10Reedy: "Cause it's still in core too :(" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393353 (https://phabricator.wikimedia.org/T181318) (owner: 10Reedy) [19:53:59] (03CR) 10Kaldari: Enable MP3 uploading on Commons on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392765 (https://phabricator.wikimedia.org/T120288) (owner: 10Kaldari) [19:54:03] (03CR) 10Kaldari: [C: 032] Enable MP3 uploading on Commons on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392765 (https://phabricator.wikimedia.org/T120288) (owner: 10Kaldari) [19:54:22] if I'm understanding this right it sounds like the thing to do is create an environment called future on the master [19:54:52] can you show me which config? [19:54:58] !log crN-ulsfo: remove lvs400[1-4] from PyBal BGP neighbors list [19:55:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:23] (03Merged) 10jenkins-bot: Enable MP3 uploading on Commons on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392765 (https://phabricator.wikimedia.org/T120288) (owner: 10Kaldari) [19:55:27] herron: you can see what's happening on upgrade-master.puppet.eqiad.wmflabs [19:55:33] and upgrade-client.puppet.eqiad.wmflabs [19:55:39] ok, checking those out [19:55:52] The client is currently in the broken state, if you remove the environment line in puppet.conf it will work [19:56:08] does the 'future' environment still work on e.g. labpuppetmaster2001? Maybe this is just another accidental puppet diff [19:57:15] (03CR) 10jenkins-bot: Enable MP3 uploading on Commons on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392765 (https://phabricator.wikimedia.org/T120288) (owner: 10Kaldari) [19:57:49] 10Operations, 10Trending-Service, 10Reading-Infrastructure-Team-Backlog (Kanban), 10Services (designing): Turn off Trending Service - https://phabricator.wikimedia.org/T180384#3790042 (10Fjalapeno) [20:01:45] !log kaldari@tin Synchronized wmf-config/InitialiseSettings.php: (no justification provided) (duration: 00m 46s) [20:01:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:33] !log kaldari@tin Synchronized wmf-config/CommonSettings.php: (no justification provided) (duration: 00m 45s) [20:02:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:43] !log kaldari@tin Synchronized wmf-config/InitialiseSettings-labs.php: (no justification provided) (duration: 00m 45s) [20:03:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:48] (03PS2) 10Chad: Remove $wgStyleVersion appending in CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393353 (https://phabricator.wikimedia.org/T181318) (owner: 10Reedy) [20:07:53] (03CR) 10Chad: [C: 032] Remove $wgStyleVersion appending in CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393353 (https://phabricator.wikimedia.org/T181318) (owner: 10Reedy) [20:14:27] (03Merged) 10jenkins-bot: Remove $wgStyleVersion appending in CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393353 (https://phabricator.wikimedia.org/T181318) (owner: 10Reedy) [20:16:59] 10Operations, 10Phabricator, 10Traffic, 10Zero: Missing IP addresses for Maroc Telecom - https://phabricator.wikimedia.org/T174342#3790158 (10Keegan) >>! In T174342#3787625, @Tgr wrote: > So what does it take for this task to be resolved? Is someone actually looking into it or is it just being pushed aroun... [20:17:20] (03CR) 10jenkins-bot: Remove $wgStyleVersion appending in CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393353 (https://phabricator.wikimedia.org/T181318) (owner: 10Reedy) [20:20:10] !log demon@tin Synchronized wmf-config/CommonSettings.php: cli sapi fixes (duration: 00m 45s) [20:20:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:14] !log demon@tin i lied, that was for wgStyleVersion removal [20:21:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:11] 10Operations, 10Phabricator, 10Traffic, 10Zero: Missing IP addresses for Maroc Telecom - https://phabricator.wikimedia.org/T174342#3790202 (10Mholloway) I've reached out to Partnerships about getting in touch with Maroc and INWI for IP range updates. [20:53:49] 10Operations, 10Phabricator, 10Traffic, 10Zero: Missing IP addresses for Maroc Telecom - https://phabricator.wikimedia.org/T174342#3790296 (10Keegan) >>! In T174342#3790202, @Mholloway wrote: > I've reached out to Partnerships about getting in touch with Maroc and INWI for IP range updates. Great, thank you! [21:00:05] gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: Dear deployers, time to do the Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171127T2100). [21:00:05] No GERRIT patches in the queue for this window AFAICS. [21:00:18] ORES is deploying the all. [21:05:28] (03PS1) 10Awight: Temporarily disable ORES on wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393654 [21:06:11] Here we go. [21:06:22] * halfak hovers hand over the big red "abort" button just in case [21:06:22] (03CR) 10Awight: [C: 032] Temporarily disable ORES on wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393654 (owner: 10Awight) [21:07:41] (03Merged) 10jenkins-bot: Temporarily disable ORES on wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393654 (owner: 10Awight) [21:07:51] (03CR) 10jenkins-bot: Temporarily disable ORES on wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393654 (owner: 10Awight) [21:07:54] (03Draft1) 10Paladox: phabricator: Fix elasticsearch version field [puppet] - 10https://gerrit.wikimedia.org/r/393655 (https://phabricator.wikimedia.org/T181437) [21:07:59] (03PS2) 10Paladox: phabricator: Fix elasticsearch version field [puppet] - 10https://gerrit.wikimedia.org/r/393655 (https://phabricator.wikimedia.org/T181437) [21:08:14] !log awight@tin Finished deploy [ores/deploy@82a13ae]: Rollback ORES (take 3); 181006 (duration: 9942m 42s) [21:08:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:03] 10Operations, 10Cassandra, 10User-Eevans: Upload new cassandra-tools-wmf package to Debian repository - https://phabricator.wikimedia.org/T181438#3790356 (10Eevans) [21:10:56] !log Previous “rollback ORES” was from a stale screen session, no actions were taken today. [21:10:58] "duration: 9942m 42s" [21:10:58] wat [21:11:01] !log awight@tin Synchronized wmf-config/InitialiseSettings.php: Temporarily disable ORES on wikidata (duration: 00m 45s) [21:11:01] lol [21:11:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:04] oh [21:11:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:29] Also: When you get the wrong syntax for a scap command it should automatically revoke your rights until you can prove you read the man page :p [21:12:26] lol [21:13:45] PROBLEM - Druid historical on druid1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args io.druid.cli.Main server historical [21:13:55] PROBLEM - Check systemd state on druid1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:16:35] (03PS1) 10Kaldari: Enable MP3 uploads on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393661 (https://phabricator.wikimedia.org/T120288) [21:18:26] * awight has <3 <3 for whoever make gate-and-submit-swat so fast. [21:23:38] !log awight@tin Synchronized php-1.31.0-wmf.8/extensions/ORES: ORES error handling for bad thresholds, T181191 (duration: 00m 46s) [21:23:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:48] T181191: Make ORES-consuming pages more robust to ORES errors - https://phabricator.wikimedia.org/T181191 [21:26:24] no parsoid deploy today [21:27:03] 10Operations, 10Phabricator, 10Traffic, 10Zero: Missing IP addresses for Maroc Telecom - https://phabricator.wikimedia.org/T174342#3790430 (10Dzahn) Sounds like it would be useful if Partnerships could be added in the Phabricator-based workflow for future updates. [21:27:46] RECOVERY - Druid historical on druid1003 is OK: PROCS OK: 1 process with command name java, args io.druid.cli.Main server historical [21:27:56] RECOVERY - Check systemd state on druid1003 is OK: OK - running: The system is fully operational [21:28:49] 10Operations, 10Traffic: Puppetize LVS interface IP sets per-DC for easy use in ferm rules - https://phabricator.wikimedia.org/T179027#3790444 (10Nuria) [21:33:10] !log awight@tin Started deploy [ores/deploy@e58bfbf]: Update ORES for impossible threshold handling and new wikidata editquality model, T179711 T180686 T180450 [21:33:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:18] T180686: Wikidata beta edit filters are showing every edit in watchlist as damaging - https://phabricator.wikimedia.org/T180686 [21:33:19] T179711: ORES 500 errors on a threshold lookup request - https://phabricator.wikimedia.org/T179711 [21:33:19] T180450: ORES thresholds for Wikidata is too strict - https://phabricator.wikimedia.org/T180450 [21:36:36] (03PS1) 10Awight: Reenable ORES on frwiki, ruwiki, and wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393667 (https://phabricator.wikimedia.org/T181006) [21:38:29] 10Operations, 10Performance-Team, 10Traffic: load.php response taking 160s (of which only 0.031s in Apache) - https://phabricator.wikimedia.org/T181315#3790483 (10Gilles) I've looked for that particular request from the HAR file in the varnish logstash data you've linked to @ema and couldn't find it. [21:44:59] (03CR) 1020after4: [C: 031] phabricator: Fix elasticsearch version field [puppet] - 10https://gerrit.wikimedia.org/r/393655 (https://phabricator.wikimedia.org/T181437) (owner: 10Paladox) [21:46:04] !log awight@tin Finished deploy [ores/deploy@e58bfbf]: Update ORES for impossible threshold handling and new wikidata editquality model, T179711 T180686 T180450 (duration: 12m 55s) [21:46:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:13] T180686: Wikidata beta edit filters are showing every edit in watchlist as damaging - https://phabricator.wikimedia.org/T180686 [21:46:13] T179711: ORES 500 errors on a threshold lookup request - https://phabricator.wikimedia.org/T179711 [21:46:13] T180450: ORES thresholds for Wikidata is too strict - https://phabricator.wikimedia.org/T180450 [21:46:46] All continues to look good. [21:46:51] awight: ^ [21:47:48] K, continuing with codfw [21:47:57] !log awight@tin Started deploy [ores/deploy@e58bfbf]: (codfw) Update ORES for impossible threshold handling and new wikidata editquality model, T179711 T180686 T180450 [21:48:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:12] 10Operations, 10Performance-Team, 10Traffic: load.php response taking 160s (of which only 0.031s in Apache) - https://phabricator.wikimedia.org/T181315#3790555 (10Tgr) If varnish saw this request as taking long, that would be reflected in the `Backend-Timing` header, right? So this must be happening inside v... [21:52:15] (03PS1) 10Ayounsi: Bird: add monitoring to the VIP and bird process [puppet] - 10https://gerrit.wikimedia.org/r/393668 (https://phabricator.wikimedia.org/T98006) [21:52:41] (03CR) 10jerkins-bot: [V: 04-1] Bird: add monitoring to the VIP and bird process [puppet] - 10https://gerrit.wikimedia.org/r/393668 (https://phabricator.wikimedia.org/T98006) (owner: 10Ayounsi) [21:53:03] !log awight@tin Finished deploy [ores/deploy@e58bfbf]: (codfw) Update ORES for impossible threshold handling and new wikidata editquality model, T179711 T180686 T180450 (duration: 05m 06s) [21:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:11] T180686: Wikidata beta edit filters are showing every edit in watchlist as damaging - https://phabricator.wikimedia.org/T180686 [21:53:12] T179711: ORES 500 errors on a threshold lookup request - https://phabricator.wikimedia.org/T179711 [21:53:12] T180450: ORES thresholds for Wikidata is too strict - https://phabricator.wikimedia.org/T180450 [21:56:18] (03CR) 10Awight: [C: 032] Reenable ORES on frwiki, ruwiki, and wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393667 (https://phabricator.wikimedia.org/T181006) (owner: 10Awight) [21:56:35] \o/ [21:57:07] (03PS2) 10Ayounsi: Bird: add monitoring to the VIP and bird process [puppet] - 10https://gerrit.wikimedia.org/r/393668 (https://phabricator.wikimedia.org/T98006) [21:57:44] (03Merged) 10jenkins-bot: Reenable ORES on frwiki, ruwiki, and wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393667 (https://phabricator.wikimedia.org/T181006) (owner: 10Awight) [21:57:54] (03CR) 10jenkins-bot: Reenable ORES on frwiki, ruwiki, and wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393667 (https://phabricator.wikimedia.org/T181006) (owner: 10Awight) [21:59:37] !log awight@tin Synchronized wmf-config/InitialiseSettings.php: Reenable ORES on frwiki, ruwiki, and wikidata; T181006 (duration: 00m 45s) [21:59:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:44] T181006: Watchlist and RecentChanges failure due to ORES on frwiki and ruwiki - https://phabricator.wikimedia.org/T181006 [21:59:50] !log clearing ORES threshold caches for ruwiki, frwiki, wikidatawiki. [21:59:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:05] dapatrick, bawolff, and Reedy: (Dis)respected human, time to deploy Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171127T2200). Please do the needful. [22:00:05] No GERRIT patches in the queue for this window AFAICS. [22:00:28] alas, no security patches afaik [22:01:45] PROBLEM - puppet last run on labtestnet2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:06:55] ^ me [22:11:24] (03PS1) 10Dzahn: remove Ganglia from labtest* hosts [puppet] - 10https://gerrit.wikimedia.org/r/393669 (https://phabricator.wikimedia.org/T177225) [22:12:19] waits before changing labtest* :) [22:15:28] mutante: I'm going to step away for a minute if you want to enable and run puppet that's cool [22:15:28] (03PS1) 10Herron: puppet: point codfw lvs servers at codfw puppet 4 masters [puppet] - 10https://gerrit.wikimedia.org/r/393670 (https://phabricator.wikimedia.org/T177254) [22:17:06] chasemp: ok :) thanks [22:19:45] (03CR) 10Dzahn: [C: 032] remove Ganglia from labtest* hosts [puppet] - 10https://gerrit.wikimedia.org/r/393669 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [22:19:58] (03PS2) 10Dzahn: remove Ganglia from labtest* hosts [puppet] - 10https://gerrit.wikimedia.org/r/393669 (https://phabricator.wikimedia.org/T177225) [22:21:32] !log otto@tin Started deploy [eventlogging/eventbus@e024af3]: deploy revert checked out at 3df06ab (we caught one). T180017 [22:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:39] T180017: Timeouts on event delivery to EventBus - https://phabricator.wikimedia.org/T180017 [22:21:42] !log otto@tin Finished deploy [eventlogging/eventbus@e024af3]: deploy revert checked out at 3df06ab (we caught one). T180017 (duration: 00m 10s) [22:21:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:35] !log decom'ing Ganglia from all labtest* hosts - packages removed by puppet etc, might cause some false positives in Icinga but it's ok (T177225) [22:22:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:42] T177225: Uninstall ganglia from the fleet - https://phabricator.wikimedia.org/T177225 [22:28:06] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler02/8998/" [puppet] - 10https://gerrit.wikimedia.org/r/393668 (https://phabricator.wikimedia.org/T98006) (owner: 10Ayounsi) [22:29:31] !log labtestnet2001 - re-enabled puppet to decom ganglia, no errors [22:29:36] 10Operations, 10Performance-Team, 10Traffic: load.php response taking 160s (of which only 0.031s in Apache) - https://phabricator.wikimedia.org/T181315#3790663 (10Samat) >>! In T181315#3788787, @ema wrote: > We've recently started logging requests taking longer than 60 seconds (from varnish's point of view)... [22:29:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:54] 10Operations, 10Ops-Access-Requests: Requesting access to the ldap nda group - https://phabricator.wikimedia.org/T181446#3790664 (10Paladox) [22:30:09] 10Operations, 10Ops-Access-Requests: Requesting access to the ldap nda group - https://phabricator.wikimedia.org/T181446#3790679 (10Paladox) [22:31:45] RECOVERY - puppet last run on labtestnet2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [22:32:23] 10Operations, 10Ops-Access-Requests: Requesting access to the ldap nda group - https://phabricator.wikimedia.org/T181446#3790681 (10Dzahn) Hi @Paladox I know it since we already talked on IRC, but you should add that you already mailed legal and your question if the volunteer NDA is right and all that. I can... [22:33:04] 10Operations, 10Ops-Access-Requests, 10Gerrit: Requesting access to the ldap nda group - https://phabricator.wikimedia.org/T181446#3790698 (10Dzahn) [22:36:54] 10Operations, 10Performance-Team, 10Traffic: load.php response taking 160s (of which only 0.031s in Apache) - https://phabricator.wikimedia.org/T181315#3790707 (10Samat) @Tgr I think I experienced this on multiple connections, but I usually use my laptop to edit Wikipedia. The only other place where I edit s... [22:36:58] 10Operations, 10Ops-Access-Requests, 10Gerrit: Requesting access to the ldap nda group - https://phabricator.wikimedia.org/T181446#3790708 (10Paladox) [22:38:25] 10Operations, 10Ops-Access-Requests, 10Gerrit: Access to logstash (LDAP group 'nda') for Paladox - https://phabricator.wikimedia.org/T181446#3790711 (10Dzahn) [22:41:46] (03PS8) 10Smalyshev: Enable configuration for aliasing namespaces [puppet] - 10https://gerrit.wikimedia.org/r/392554 (https://phabricator.wikimedia.org/T181016) [22:46:22] (03PS1) 10Dzahn: remove Ganglia from cache::canary (cp1008) [puppet] - 10https://gerrit.wikimedia.org/r/393674 (https://phabricator.wikimedia.org/T177225) [22:46:43] (03CR) 10jerkins-bot: [V: 04-1] remove Ganglia from cache::canary (cp1008) [puppet] - 10https://gerrit.wikimedia.org/r/393674 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [22:47:21] (03PS2) 10Dzahn: remove Ganglia from cache::canary (cp1008) [puppet] - 10https://gerrit.wikimedia.org/r/393674 (https://phabricator.wikimedia.org/T177225) [22:47:42] (03CR) 10jerkins-bot: [V: 04-1] remove Ganglia from cache::canary (cp1008) [puppet] - 10https://gerrit.wikimedia.org/r/393674 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [22:48:43] (03PS3) 10Dzahn: remove Ganglia from cache::canary cp1008 [puppet] - 10https://gerrit.wikimedia.org/r/393674 (https://phabricator.wikimedia.org/T177225) [22:49:02] (03CR) 10jerkins-bot: [V: 04-1] remove Ganglia from cache::canary cp1008 [puppet] - 10https://gerrit.wikimedia.org/r/393674 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [22:49:53] (03PS4) 10Dzahn: remove Ganglia from cache canary cp1008 [puppet] - 10https://gerrit.wikimedia.org/r/393674 (https://phabricator.wikimedia.org/T177225) [22:50:12] (03CR) 10jerkins-bot: [V: 04-1] remove Ganglia from cache canary cp1008 [puppet] - 10https://gerrit.wikimedia.org/r/393674 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [22:51:23] (03PS5) 10Dzahn: remove Ganglia from cache::canary, cp1008 [puppet] - 10https://gerrit.wikimedia.org/r/393674 (https://phabricator.wikimedia.org/T177225) [22:51:31] sigh :) invalid commit message [22:51:37] but now [22:54:28] 10Operations, 10Performance-Team, 10Traffic: load.php response taking 160s (of which only 0.031s in Apache) - https://phabricator.wikimedia.org/T181315#3790749 (10Gilles) >>! In T181315#3790555, @Tgr wrote: > If varnish saw this request as taking long, that would be reflected in the `Backend-Timing` header,... [22:55:01] 10Operations, 10Ops-Access-Requests, 10Gerrit: Access to logstash (LDAP group 'nda') for Paladox - https://phabricator.wikimedia.org/T181446#3790664 (10faidon) LDAP NDA access effectively means getting access to private and sensitive information, on multiple servers and services, across the board. As such, i... [22:58:12] (03PS1) 10Dzahn: remove Ganglia from cache::misc [puppet] - 10https://gerrit.wikimedia.org/r/393675 (https://phabricator.wikimedia.org/T177225) [22:58:14] (03PS1) 10Dzahn: remove ganglia from cache::text,cache::upload [puppet] - 10https://gerrit.wikimedia.org/r/393676 (https://phabricator.wikimedia.org/T177225) [23:01:16] (03PS1) 10Andrew Bogott: profile::puppetmaster::common: Always enable environments [puppet] - 10https://gerrit.wikimedia.org/r/393677 [23:01:17] (03PS1) 10Andrew Bogott: puppetmaster::standalone: include environment env [puppet] - 10https://gerrit.wikimedia.org/r/393678 [23:02:47] (03PS1) 10Ladsgroup: Revert "Comply wikidata with new ores thresholds" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393679 (https://phabricator.wikimedia.org/T180450) [23:11:12] (03PS1) 10Madhuvishy: public_dumps: Set up NFS on the dumps servers [puppet] - 10https://gerrit.wikimedia.org/r/393680 (https://phabricator.wikimedia.org/T181431) [23:14:42] (03CR) 10Madhuvishy: [C: 032] public_dumps: Set up NFS on the dumps servers [puppet] - 10https://gerrit.wikimedia.org/r/393680 (https://phabricator.wikimedia.org/T181431) (owner: 10Madhuvishy) [23:23:25] PROBLEM - puppet last run on labstore1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:28:15] (03PS1) 10Dzahn: labtest/labvirt/labs*: remove Ganglia [puppet] - 10https://gerrit.wikimedia.org/r/393682 (https://phabricator.wikimedia.org/T177225) [23:29:04] ^i've got the labstore ping [23:29:44] (03PS2) 10Dzahn: labtest/labvirt/labs*: remove Ganglia [puppet] - 10https://gerrit.wikimedia.org/r/393682 (https://phabricator.wikimedia.org/T177225) [23:30:32] (03CR) 10Dzahn: [C: 032] "talked with cloud-admins :)" [puppet] - 10https://gerrit.wikimedia.org/r/393682 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [23:35:50] (03PS1) 10Dzahn: ganeti: remove Ganglia [puppet] - 10https://gerrit.wikimedia.org/r/393683 (https://phabricator.wikimedia.org/T177225) [23:42:45] PROBLEM - puppet last run on labstore1006 is CRITICAL: CRITICAL: Puppet has 67 failures. Last run 2 minutes ago with 67 failures. Failed resources (up to 3 shown): User[edenhill],User[midom],User[jgirault],User[ssmith]