[00:03:19] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 69.57% of data above the critical threshold [5000000.0] [00:13:49] RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 50.00% above the threshold [1000000.0] [00:49:49] (03PS1) 10Yuvipanda: tools: Add authentication for docker registry [puppet] - 10https://gerrit.wikimedia.org/r/273840 (https://phabricator.wikimedia.org/T118758) [00:51:10] (03PS2) 10Yuvipanda: tools: Add authentication for docker registry [puppet] - 10https://gerrit.wikimedia.org/r/273840 (https://phabricator.wikimedia.org/T118758) [00:52:45] (03PS3) 10Yuvipanda: tools: Add authentication for docker registry [puppet] - 10https://gerrit.wikimedia.org/r/273840 (https://phabricator.wikimedia.org/T118758) [00:58:06] (03PS4) 10Yuvipanda: tools: Add authentication for docker registry [puppet] - 10https://gerrit.wikimedia.org/r/273840 (https://phabricator.wikimedia.org/T118758) [02:13:34] 6Operations, 10Phabricator, 6Release-Engineering-Team: just in case: set up a new oauth consumer on mediawiki.org that has oauth callback url checkbox enabled - https://phabricator.wikimedia.org/T96618#2071231 (10mmodell) [02:25:00] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.14) (duration: 11m 12s) [02:25:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:32:43] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Feb 29 02:32:43 UTC 2016 (duration 7m 43s) [02:32:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:17:09] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [5000000.0] [03:24:10] RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 50.00% above the threshold [1000000.0] [06:30:49] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: puppet fail [06:31:00] PROBLEM - puppet last run on mw1073 is CRITICAL: CRITICAL: puppet fail [06:31:29] PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:38] PROBLEM - puppet last run on db2044 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:48] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:50] PROBLEM - puppet last run on aqs1002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:58] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:59] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:58] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:56:00] RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [06:56:19] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [06:56:29] RECOVERY - puppet last run on aqs1002 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:56:29] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [06:57:18] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:57:20] RECOVERY - puppet last run on mw1073 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [06:57:29] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:57:58] RECOVERY - puppet last run on db2044 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:20] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [07:13:22] 6Operations, 10MobileFrontend, 10Traffic, 5MW-1.27-release, and 5 others: Incorrect TOC and section edit links rendering in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2071421 (10Legoktm) ``` 2016-02-29 02:49:07 mw110... [07:35:43] 6Operations, 10ops-eqiad, 6Labs: disk failure on labsdb1002 - https://phabricator.wikimedia.org/T126946#2071461 (10jcrespo) @liangent, user databases were lost and will not be able to be recovered. [08:09:22] (03PS2) 10Nemo bis: [Planet Wikimedia] Multiple additions to English, Spanish, Ukrainian [puppet] - 10https://gerrit.wikimedia.org/r/273777 [08:25:10] 6Operations, 10MobileFrontend, 10Traffic, 5MW-1.27-release, and 5 others: Incorrect TOC and section edit links rendering in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2071506 (10Legoktm) a:5Legoktm>3None Unassign... [08:27:57] 6Operations, 10MobileFrontend, 10Traffic, 5MW-1.27-release, and 5 others: Incorrect TOC and section edit links rendering in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2071510 (10Elitre) FWIW I'm hearing at it.wp this... [08:58:37] (03CR) 10Muehlenhoff: "python-monotonic isn't installed anywhere in production currently (and it's also not available in standard repos for jessie, trusty and pr" [puppet] - 10https://gerrit.wikimedia.org/r/273512 (owner: 10Andrew Bogott) [09:00:45] 6Operations, 10ops-eqiad, 6Labs: disk failure on labsdb1002 - https://phabricator.wikimedia.org/T126946#2071553 (10jcrespo) I would happily would, but I would prefer to actually solve the issue for you forever so you can self-serve. One tip before I research what is failing: * It is probable that commons a... [09:08:45] (03PS7) 10Thiemo Mättig (WMDE): Avoid breaking full phabricator URLs [puppet] - 10https://gerrit.wikimedia.org/r/256663 (https://phabricator.wikimedia.org/T75997) [09:11:29] (03CR) 10Muehlenhoff: [C: 032 V: 032] Drop the annotations, dpkg from jessie chokes on them and they are only needed for bootstrapping new archs [debs/linux44] - 10https://gerrit.wikimedia.org/r/273471 (owner: 10Muehlenhoff) [09:11:52] (03CR) 10Muehlenhoff: [C: 032 V: 032] Use gcc 4.9 on x86 [debs/linux44] - 10https://gerrit.wikimedia.org/r/273472 (owner: 10Muehlenhoff) [09:13:34] 6Operations, 10MediaWiki-Uploading, 6Multimedia, 10Wikimedia-Video: Uploading 1.2GB ogv results in 503 - https://phabricator.wikimedia.org/T128358#2071563 (10zhuyifei1999) [09:13:39] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:14:37] 6Operations, 6Commons, 10MediaWiki-Uploading, 6Multimedia, and 2 others: Raise max upload limit above 1GB - https://phabricator.wikimedia.org/T76614#807206 (10zhuyifei1999) @Dzahn First test in video2commons tool, failed: {T128358} [09:15:25] (03CR) 10Thiemo Mättig (WMDE): "Rebased." [puppet] - 10https://gerrit.wikimedia.org/r/256663 (https://phabricator.wikimedia.org/T75997) (owner: 10Thiemo Mättig (WMDE)) [09:15:28] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 28169 bytes in 8.681 second response time [09:20:57] (03CR) 10Thiemo Mättig (WMDE): "Have you see patch I2f06f93 I uploaded about four months ago? It does fix this issue without breaking existing functionality." [puppet] - 10https://gerrit.wikimedia.org/r/226234 (https://phabricator.wikimedia.org/T75997) (owner: 10Ricordisamoa) [09:28:33] 6Operations, 10ops-eqiad, 6Labs: disk failure on labsdb1002 - https://phabricator.wikimedia.org/T126946#2071592 (10jcrespo) @liangent Your problem is that your script is trying to execute `CREATE TRIGGER` with `DEFINER=```root```@```208.80.154.151````, for which you have no rights. Removing the DEFINER, wit... [09:29:12] 6Operations, 10ops-eqiad, 6Labs: disk failure on labsdb1002 - https://phabricator.wikimedia.org/T126946#2071593 (10jcrespo) Also, please use a different ticket for importing issues, as this is offtopic. [09:33:12] (03CR) 10Muehlenhoff: [C: 04-1] "Some of these packages already use the fonts-* name in trusty:" [puppet] - 10https://gerrit.wikimedia.org/r/218640 (https://phabricator.wikimedia.org/T102623) (owner: 10Dzahn) [09:45:23] (03PS2) 10MarcoAurelio: Set transwiki import sources for hi.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272992 (https://phabricator.wikimedia.org/T127593) [09:54:38] (03PS1) 10Filippo Giunchedi: cassandra: add restbase1009-b instance [puppet] - 10https://gerrit.wikimedia.org/r/273858 [09:55:02] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add restbase1009-b instance [puppet] - 10https://gerrit.wikimedia.org/r/273858 (owner: 10Filippo Giunchedi) [09:58:39] !log bootstrap restbase1009-b T95253 [09:58:40] T95253: Finish conversion to multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253 [09:58:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:02:37] 6Operations, 7Puppet, 13Patch-For-Review: Import vs autoload: the puppet parser is a bad joke that stopped to be funny years ago. - https://phabricator.wikimedia.org/T119042#2071612 (10Joe) [10:07:43] PROBLEM - cassandra-b CQL 10.64.48.130:9042 on restbase1009 is CRITICAL: Connection refused [10:07:52] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 64.00% of data above the critical threshold [5000000.0] [10:08:02] PROBLEM - cassandra-b service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [10:11:41] 6Operations, 7Puppet, 13Patch-For-Review: Import vs autoload: the puppet parser is a bad joke that stopped to be funny years ago. - https://phabricator.wikimedia.org/T119042#2071619 (10Joe) For the record, I isolated the issue. It has nothing to do with the role function or anything outside the puppet parse... [10:12:17] <_joe_> godog: ^^ if you want a sour laugh... [10:13:04] ACKNOWLEDGEMENT - cassandra-b CQL 10.64.48.130:9042 on restbase1009 is CRITICAL: Connection refused Filippo Giunchedi bootstrap [10:13:04] ACKNOWLEDGEMENT - cassandra-b service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed Filippo Giunchedi bootstrap [10:14:03] _joe_: I can't even [10:14:35] 6Operations, 7Puppet, 13Patch-For-Review: Import vs autoload: the puppet parser is a bad joke that stopped to be funny years ago. - https://phabricator.wikimedia.org/T119042#2071621 (10Joe) [10:15:24] PROBLEM - Apache HTTP on mw1140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:16:02] mw1140 down? [10:16:13] PROBLEM - HHVM rendering on mw1140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:17:02] PROBLEM - nutcracker port on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:17:29] 6Operations, 7Puppet, 13Patch-For-Review: Import vs autoload: the puppet parser is a bad joke that stopped to be funny years ago. - https://phabricator.wikimedia.org/T119042#2071623 (10Joe) So, it looks like we're left with the option of either: # migrate all second-level roles (like role::mediawiki::appser... [10:17:32] PROBLEM - nutcracker process on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:17:32] PROBLEM - SSH on mw1140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:17:32] PROBLEM - configured eth on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:17:43] PROBLEM - Check size of conntrack table on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:17:44] PROBLEM - HHVM processes on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:17:44] PROBLEM - Disk space on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:17:52] PROBLEM - puppet last run on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:17:53] PROBLEM - DPKG on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:18:04] PROBLEM - salt-minion processes on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:18:22] 6Operations, 7Need-volunteer, 13Patch-For-Review: align puppet-lint config with coding style - https://phabricator.wikimedia.org/T93645#2071638 (10Joe) [10:18:22] PROBLEM - RAID on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:18:24] 6Operations, 7Puppet, 13Patch-For-Review: Import vs autoload: the puppet parser is a bad joke that stopped to be funny years ago. - https://phabricator.wikimedia.org/T119042#2071637 (10Joe) 5Open>3stalled [10:18:32] PROBLEM - dhclient process on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:18:43] RECOVERY - cassandra-b service on restbase1009 is OK: OK - cassandra-b is active [10:19:42] 6Operations, 7Puppet, 13Patch-For-Review: Import vs autoload: the puppet parser is a bad joke that stopped being funny years ago. - https://phabricator.wikimedia.org/T119042#1816403 (10Joe) [10:19:43] system is up, but extremely slow [10:19:56] <_joe_> jynus: oom [10:20:03] <_joe_> see ganglia [10:20:23] soft restart? Is it regular or queue processor? [10:21:26] it is apt, not previos know issue [10:21:28] (03PS3) 10ArielGlenn: report_minions: show minions known about in redis for repo [software/deployment/trebuchet-trigger] - 10https://gerrit.wikimedia.org/r/219845 [10:21:29] *api [10:22:08] <_joe_> jynus: no it happens from time to time to api appservers as well [10:22:14] RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 50.00% above the threshold [1000000.0] [10:22:16] mm [10:22:39] if you want to debug it, say so, otherwise I will try to cleanly restart it [10:23:09] ssh and serial are too slow to try to do something [10:23:28] <_joe_> no just powerclycle it [10:23:35] <_joe_> if you're already in the drac [10:23:37] doing [10:23:40] yes [10:25:33] !log powercycling mw1140, almost 100% unresponsive, OOM probable cause [10:25:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:26:00] it responded well [10:27:22] RECOVERY - dhclient process on mw1140 is OK: PROCS OK: 0 processes with command name dhclient [10:27:23] nothing strange on the boot [10:27:42] RECOVERY - nutcracker port on mw1140 is OK: TCP OK - 0.000 second response time on port 11212 [10:27:54] RECOVERY - Apache HTTP on mw1140 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 500 bytes in 4.922 second response time [10:28:12] RECOVERY - SSH on mw1140 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [10:28:12] RECOVERY - nutcracker process on mw1140 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [10:28:12] RECOVERY - configured eth on mw1140 is OK: OK - interfaces up [10:28:22] RECOVERY - Check size of conntrack table on mw1140 is OK: OK: nf_conntrack is 5 % full [10:28:23] RECOVERY - HHVM processes on mw1140 is OK: PROCS OK: 6 processes with command name hhvm [10:28:23] RECOVERY - Disk space on mw1140 is OK: DISK OK [10:28:32] RECOVERY - puppet last run on mw1140 is OK: OK: Puppet is currently enabled, last run 44 minutes ago with 0 failures [10:28:33] RECOVERY - DPKG on mw1140 is OK: All packages OK [10:28:34] (03CR) 10ArielGlenn: [C: 032 V: 032] report_minions: show minions known about in redis for repo [software/deployment/trebuchet-trigger] - 10https://gerrit.wikimedia.org/r/219845 (owner: 10ArielGlenn) [10:28:43] RECOVERY - HHVM rendering on mw1140 is OK: HTTP OK: HTTP/1.1 200 OK - 71249 bytes in 1.203 second response time [10:28:48] (03PS2) 10MarcoAurelio: Enable signature button at NS:102 for frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272479 (https://phabricator.wikimedia.org/T127688) [10:28:52] RECOVERY - salt-minion processes on mw1140 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [10:29:03] RECOVERY - RAID on mw1140 is OK: OK: no RAID installed [10:30:21] https://wikitech.wikimedia.org/wiki/User:KamaljitchakrabortyM4074291 is doing weird things there... [10:30:43] I'll clean up [10:31:41] 6Operations, 7Puppet, 13Patch-For-Review: Import vs autoload: the puppet parser is a bad joke that stopped being funny years ago. - https://phabricator.wikimedia.org/T119042#2071663 (10Joe) Another possible solution: we move the master to be puppet 3.7 or later (so upgrading to jessie would do) and we use th... [10:32:52] thanks Reedy - consider a block too :) [10:33:07] Yeah... I was trying to decide if that was warranted [10:38:00] (03CR) 10Luke081515: "Question solved." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273828 (https://phabricator.wikimedia.org/T128205) (owner: 10Luke081515) [10:38:54] 6Operations, 10ops-eqiad, 10Traffic, 13Patch-For-Review: eqiad cache cluster re-arrangements - https://phabricator.wikimedia.org/T125486#2071672 (10ema) p:5Triage>3Normal a:3Christopher [10:39:01] (03PS2) 10ArielGlenn: git deploy: update for salt bug fix, pylint [puppet] - 10https://gerrit.wikimedia.org/r/243411 [10:39:43] 6Operations, 10ops-eqiad, 10Traffic, 13Patch-For-Review: eqiad cache cluster re-arrangements - https://phabricator.wikimedia.org/T125486#1988833 (10ema) a:5Christopher>3Cmjohnson [10:42:48] 6Operations, 10ops-eqiad: dbstore1001 management interface has saturated the number of available ssh connections - https://phabricator.wikimedia.org/T126227#2071689 (10ema) p:5Triage>3Normal a:3Cmjohnson [10:43:05] (03PS3) 10ArielGlenn: git deploy: update for salt bug fix, pylint [puppet] - 10https://gerrit.wikimedia.org/r/243411 [10:43:33] 6Operations, 10ops-eqiad, 13Patch-For-Review: decom iodine - https://phabricator.wikimedia.org/T126483#2071692 (10ema) p:5Triage>3Normal a:3Cmjohnson [10:44:16] 6Operations, 10ops-eqiad, 13Patch-For-Review: Decommission mw1037 - https://phabricator.wikimedia.org/T126350#2071694 (10ema) p:5Triage>3Normal a:3Cmjohnson [10:44:39] (03CR) 10ArielGlenn: [C: 032] git deploy: update for salt bug fix, pylint [puppet] - 10https://gerrit.wikimedia.org/r/243411 (owner: 10ArielGlenn) [10:46:15] !log hhvm restarted on mw1119 [10:46:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:47:14] RECOVERY - Apache HTTP on mw1119 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 499 bytes in 0.128 second response time [10:47:23] RECOVERY - HHVM rendering on mw1119 is OK: HTTP OK: HTTP/1.1 200 OK - 71238 bytes in 0.419 second response time [10:52:49] 6Operations, 7Puppet, 13Patch-For-Review: Import vs autoload: the puppet parser is a bad joke that stopped being funny years ago. - https://phabricator.wikimedia.org/T119042#2071737 (10scfc) @Joe: One thing I found intriguing with my tests was that when I defined a class `toollabs::xyz` in `manifests/role/rc... [10:55:43] (03PS18) 10Gehel: Ship Elasticsearch logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/269100 (https://phabricator.wikimedia.org/T109101) [11:01:13] (03CR) 10Kelson: [C: 031] Fix regex to enable upload from ETHZ Library with the GWT [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273774 (owner: 10Kelson) [11:02:44] (03CR) 10Kelson: [C: 031] Correct one Domain at $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273776 (https://phabricator.wikimedia.org/T123109) (owner: 10Luke081515) [11:04:20] (03CR) 10Luke081515: [C: 031] Temporarily disable thank-you-edit notifications [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273836 (https://phabricator.wikimedia.org/T128249) (owner: 10Catrope) [11:12:52] 6Operations, 10ops-eqiad: dbstore1001 management interface has saturated the number of available ssh connections - https://phabricator.wikimedia.org/T126227#2071749 (10jcrespo) Let's schedule it for a Thursday, as it should have finished the backups by then. [11:20:51] 6Operations, 10Ops-Access-Requests, 13Patch-For-Review: Access Request for mobrovac as ci-admin to mess with CI infrastructure - https://phabricator.wikimedia.org/T128175#2066176 (10hashar) >>! In T128175#2066977, @JanZerebecki wrote: > > (Offtopic: We could copy the keys from ldap into place at instance cr... [11:23:17] (03CR) 10Giuseppe Lavagetto: [C: 031] Ship Elasticsearch logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/269100 (https://phabricator.wikimedia.org/T109101) (owner: 10Gehel) [11:27:31] (03PS1) 10Muehlenhoff: Update to 4.4.3 [debs/linux44] - 10https://gerrit.wikimedia.org/r/273870 [11:31:55] (03PS6) 10Giuseppe Lavagetto: ntp: further reorg, split of client and server code [puppet] - 10https://gerrit.wikimedia.org/r/273247 [11:37:47] (03PS1) 10Jcrespo: Settingup dbproxy1005 for m5 load balancing/failover [puppet] - 10https://gerrit.wikimedia.org/r/273871 (https://phabricator.wikimedia.org/T126251) [11:38:16] (03Abandoned) 10Giuseppe Lavagetto: role::testsystem: move to module [puppet] - 10https://gerrit.wikimedia.org/r/273445 (owner: 10Giuseppe Lavagetto) [11:38:42] (03PS6) 10Giuseppe Lavagetto: role::diamond: move to standard::diamond [puppet] - 10https://gerrit.wikimedia.org/r/273248 [11:40:10] (03PS2) 10Giuseppe Lavagetto: apache-fast-test: fix pybal url, add codfw and options [puppet] - 10https://gerrit.wikimedia.org/r/273200 [11:40:24] (03PS19) 10Gehel: Ship Elasticsearch logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/269100 (https://phabricator.wikimedia.org/T109101) [11:42:49] (03PS1) 10Jcrespo: Reimage dbproxy1005-11 with jessie [puppet] - 10https://gerrit.wikimedia.org/r/273872 (https://phabricator.wikimedia.org/T126251) [11:43:57] (03PS2) 10Jcrespo: Set up dbproxy1005 for m5 load balancing/failover [puppet] - 10https://gerrit.wikimedia.org/r/273871 (https://phabricator.wikimedia.org/T126251) [11:44:37] (03CR) 10Jcrespo: [C: 032] Set up dbproxy1005 for m5 load balancing/failover [puppet] - 10https://gerrit.wikimedia.org/r/273871 (https://phabricator.wikimedia.org/T126251) (owner: 10Jcrespo) [11:44:52] (03CR) 10Jcrespo: [V: 032] Set up dbproxy1005 for m5 load balancing/failover [puppet] - 10https://gerrit.wikimedia.org/r/273871 (https://phabricator.wikimedia.org/T126251) (owner: 10Jcrespo) [11:45:09] (03PS2) 10Jcrespo: Reimage dbproxy1005-11 with jessie [puppet] - 10https://gerrit.wikimedia.org/r/273872 (https://phabricator.wikimedia.org/T126251) [11:46:01] (03CR) 10Jcrespo: [C: 032 V: 032] Reimage dbproxy1005-11 with jessie [puppet] - 10https://gerrit.wikimedia.org/r/273872 (https://phabricator.wikimedia.org/T126251) (owner: 10Jcrespo) [11:47:29] jynus about? [11:48:05] Steinsplitter, what do you mean? [11:48:25] jynus: NOT IN operations are broken on labs? [11:48:44] sql [11:49:14] I do not know what you are talking about [11:49:50] using NOT IN (query) in sql queryes no longer works for me on labs [11:49:56] if you have any issue or belive something is broken, file a ticket with appropiate backgound and I will gladly help [11:50:26] I have not touched any config/permission/etc in labs from a long time ago [11:56:00] (03PS2) 10Filippo Giunchedi: codfw: add statsd service entry [dns] - 10https://gerrit.wikimedia.org/r/273199 (https://phabricator.wikimedia.org/T127976) [11:56:09] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] codfw: add statsd service entry [dns] - 10https://gerrit.wikimedia.org/r/273199 (https://phabricator.wikimedia.org/T127976) (owner: 10Filippo Giunchedi) [11:56:29] s/from/since/ [11:57:23] (03PS2) 10Muehlenhoff: Update to 4.4.3 [debs/linux44] - 10https://gerrit.wikimedia.org/r/273870 [11:57:29] !log reimaging dbproxy1005 with jessie before its put into production (potential alerts due to new role) [11:57:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:58:29] (03PS2) 10Filippo Giunchedi: swift: return 400 on UnicodeDecodeErrors [puppet] - 10https://gerrit.wikimedia.org/r/273431 (https://phabricator.wikimedia.org/T128081) [11:59:18] (03CR) 10Muehlenhoff: [C: 032 V: 032] Update to 4.4.3 [debs/linux44] - 10https://gerrit.wikimedia.org/r/273870 (owner: 10Muehlenhoff) [12:02:38] 6Operations: upgrade 15+4 swift servers from precise to trusty - https://phabricator.wikimedia.org/T125024#2071822 (10fgiunchedi) I've upgraded `ms-be1008` to `ms-be1012`, will continue tue/wed to complete all ms-be. ms-fe is pending https://gerrit.wikimedia.org/r/#/c/273431/ [12:07:50] (03PS1) 10ArielGlenn: 2014.7.5 patch to catch failure to retrieve/decrypt master response [debs/salt] (jessie) - 10https://gerrit.wikimedia.org/r/273875 [12:07:52] (03PS1) 10ArielGlenn: 2014.7.5 patch to use user specified timeout for runner publish [debs/salt] (jessie) - 10https://gerrit.wikimedia.org/r/273876 [12:09:58] (03CR) 10Gehel: [V: 032] Ship Elasticsearch logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/269100 (https://phabricator.wikimedia.org/T109101) (owner: 10Gehel) [12:10:43] (03PS1) 10ArielGlenn: bump version number for wmf build, 2014.7.5+ds-1+wm3 [debs/salt] (jessie) - 10https://gerrit.wikimedia.org/r/273879 [12:12:29] (03PS20) 10Gehel: Ship Elasticsearch logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/269100 (https://phabricator.wikimedia.org/T109101) [12:12:44] (03CR) 10Gehel: [C: 032] Ship Elasticsearch logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/269100 (https://phabricator.wikimedia.org/T109101) (owner: 10Gehel) [12:14:21] 6Operations, 10Salt, 10Trebuchet, 13Patch-For-Review: [Trebuchet] Salt times out on parsoid restarts - https://phabricator.wikimedia.org/T63882#2071837 (10ArielGlenn) The next round osf salt packages will include the changes to the runner publish function. There is stil an issue in trebuchet-trigger with... [12:16:43] !log elastic2001.codfw.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101) [12:16:45] T109101: Currently elasticsearch logs do not leave nodes. We use logstash for this across the cluster generally. - https://phabricator.wikimedia.org/T109101 [12:16:45] T122697: Upgrade ElasticSearch to 1.7.5 - https://phabricator.wikimedia.org/T122697 [12:16:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:17:02] 6Operations, 10Salt: Salt minions randomly crashing when the deployment server grain gets changed - https://phabricator.wikimedia.org/T124646#1961461 (10ArielGlenn) [12:17:22] 6Operations, 10Salt, 10Trebuchet, 13Patch-For-Review: [Trebuchet] Salt times out on parsoid restarts - https://phabricator.wikimedia.org/T63882#651916 (10ArielGlenn) [12:18:53] 6Operations, 10Salt, 10Trebuchet: salt-minion processes terminate on deployment sync - https://phabricator.wikimedia.org/T122544#2071871 (10ArielGlenn) This is the same underlying issue as T124646 and as such I will merge them. [12:19:39] 6Operations, 10Salt, 10Trebuchet: salt-minion processes terminate on deployment sync - https://phabricator.wikimedia.org/T122544#2071874 (10ArielGlenn) [12:19:41] 6Operations, 10Salt: Salt minions randomly crashing when the deployment server grain gets changed - https://phabricator.wikimedia.org/T124646#1961461 (10ArielGlenn) [12:22:12] 6Operations, 10Salt: salt broken after the upgrade - https://phabricator.wikimedia.org/T100502#2071878 (10ArielGlenn) 5Open>3Resolved This is done as it's going to be; prod is happy, toollabs is happy, etc. Closing. [12:23:22] PROBLEM - puppet last run on mw1238 is CRITICAL: CRITICAL: Puppet has 1 failures [12:32:52] (03PS2) 10Filippo Giunchedi: restbase: move test/staging to its own cluster [puppet] - 10https://gerrit.wikimedia.org/r/272989 (https://phabricator.wikimedia.org/T103124) [12:32:59] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] restbase: move test/staging to its own cluster [puppet] - 10https://gerrit.wikimedia.org/r/272989 (https://phabricator.wikimedia.org/T103124) (owner: 10Filippo Giunchedi) [12:35:38] FYI that might wiggle ganglia ^ since adding a cluster bounces the aggregators and gmetad [12:37:47] <_joe_> yes [12:39:05] actually gmetad isn't restarting, taking a look [12:39:27] ah I think I know what it is, a / in the cluster name is my wild guess ATM [12:41:00] (03PS1) 10Muehlenhoff: Also strip annotations from debian/config/defines [debs/linux44] - 10https://gerrit.wikimedia.org/r/273880 [12:42:46] (03PS1) 10Filippo Giunchedi: ganglia: fix restbase_test cluster name [puppet] - 10https://gerrit.wikimedia.org/r/273881 [12:43:12] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] ganglia: fix restbase_test cluster name [puppet] - 10https://gerrit.wikimedia.org/r/273881 (owner: 10Filippo Giunchedi) [12:47:23] <_joe_> godog: ganglia is back but... no data? [12:47:47] <_joe_> uhm flaky :/ [12:49:43] RECOVERY - puppet last run on mw1238 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [12:50:06] _joe_: yeah gmetad has restarted now, it will catch up [12:50:09] (03PS1) 10Gehel: Adding authorization for user Gehel (Guillaume Lederrey) [puppet] - 10https://gerrit.wikimedia.org/r/273884 [12:50:55] !log gmetad on uranium restarted following https://gerrit.wikimedia.org/r/273881 it will converge asap [12:50:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:52:54] 6Operations, 10Dumps-Generation: Migrate dataset1001 and ms1001 to jessie - https://phabricator.wikimedia.org/T123724#2071924 (10ArielGlenn) Plan for ms1001 (proceeding now, no downtime notice to us or to users needed: Disable all rsyncs to/from ms1001, any cron jobs that run there Make sure ms1001 does not e... [12:53:34] 6Operations: Reinstall redis servers (Job queues) with Jessie - https://phabricator.wikimedia.org/T123675#2071925 (10elukey) p:5Triage>3Normal [12:53:59] (03PS3) 10Giuseppe Lavagetto: apache-fast-test: fix pybal url, add codfw and options [puppet] - 10https://gerrit.wikimedia.org/r/273200 [12:56:19] (03PS4) 10Giuseppe Lavagetto: apache-fast-test: fix pybal url, add codfw and options [puppet] - 10https://gerrit.wikimedia.org/r/273200 [12:57:12] (03PS5) 10Giuseppe Lavagetto: apache-fast-test: fix pybal url, add codfw and options [puppet] - 10https://gerrit.wikimedia.org/r/273200 [12:57:20] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] apache-fast-test: fix pybal url, add codfw and options [puppet] - 10https://gerrit.wikimedia.org/r/273200 (owner: 10Giuseppe Lavagetto) [12:57:41] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/273884 (owner: 10Gehel) [13:02:50] (03PS1) 10ArielGlenn: dumps: remove last reference to outdated rsync bash script [puppet] - 10https://gerrit.wikimedia.org/r/273885 [13:03:45] (03PS2) 10ArielGlenn: dumps: remove last reference to outdated rsync bash script [puppet] - 10https://gerrit.wikimedia.org/r/273885 [13:04:13] (03PS3) 10Giuseppe Lavagetto: role::mail::sender: move to standard [puppet] - 10https://gerrit.wikimedia.org/r/273444 [13:05:09] (03CR) 10ArielGlenn: [C: 032] dumps: remove last reference to outdated rsync bash script [puppet] - 10https://gerrit.wikimedia.org/r/273885 (owner: 10ArielGlenn) [13:07:28] (03CR) 10Muehlenhoff: [C: 032 V: 032] Also strip annotations from debian/config/defines [debs/linux44] - 10https://gerrit.wikimedia.org/r/273880 (owner: 10Muehlenhoff) [13:11:31] 6Operations, 10Monitoring: ganglia cluster name validation in puppet - https://phabricator.wikimedia.org/T128369#2071957 (10fgiunchedi) p:5Triage>3Normal a:3fgiunchedi [13:12:32] (03PS1) 10Muehlenhoff: Relax the build dependency on kernel-wedge [debs/linux44] - 10https://gerrit.wikimedia.org/r/273887 [13:12:51] (03PS1) 10Filippo Giunchedi: ganglia: replace reserved characters in cluster name [puppet] - 10https://gerrit.wikimedia.org/r/273888 (https://phabricator.wikimedia.org/T128369) [13:20:23] (03CR) 10Muehlenhoff: [C: 032 V: 032] Relax the build dependency on kernel-wedge [debs/linux44] - 10https://gerrit.wikimedia.org/r/273887 (owner: 10Muehlenhoff) [13:20:24] Hey godog, we are experiencing issues with the hadoop cluster, and I'm looking at ganglia --> it seems something went wrong in the last hour ar so ... Would it be ganglia ? [13:21:13] joal: yeah gmetad got restarted but should be fully recovered by now I think? [13:21:24] It is now, but there is a hole :) [13:21:34] No issue, it was to confirm it's not hadoop related :) [13:21:41] Since hadoop is no good either: ) [13:22:00] heheh yeah not related to hadoop joal, related to ganglia :( [13:22:22] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [13:22:24] k, thx godog - Continuing to investigate hadoop issue [13:23:02] joal: kk, let me know if you see ganglia acting up tho! [13:23:08] (03PS1) 10ArielGlenn: dumps rsync jobs: leave files for cron in place when job is disabled [puppet] - 10https://gerrit.wikimedia.org/r/273890 [13:23:15] sure godog [13:24:30] (03CR) 10ArielGlenn: [C: 032] dumps rsync jobs: leave files for cron in place when job is disabled [puppet] - 10https://gerrit.wikimedia.org/r/273890 (owner: 10ArielGlenn) [13:26:02] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [13:27:57] (03CR) 10Ema: [C: 031] traffic-pool: remove BindsTo= [puppet] - 10https://gerrit.wikimedia.org/r/273502 (owner: 10BBlack) [13:28:35] (03PS1) 10Muehlenhoff: Disable rt kernel flavour [debs/linux44] - 10https://gerrit.wikimedia.org/r/273891 [13:30:17] (03PS1) 10ArielGlenn: disable rsync between ms1001 and dataset1001, prep for jessie upgrade [puppet] - 10https://gerrit.wikimedia.org/r/273892 (https://phabricator.wikimedia.org/T123724) [13:31:24] (03CR) 10Muehlenhoff: [C: 032 V: 032] Disable rt kernel flavour [debs/linux44] - 10https://gerrit.wikimedia.org/r/273891 (owner: 10Muehlenhoff) [13:31:37] (03CR) 10ArielGlenn: [C: 032] disable rsync between ms1001 and dataset1001, prep for jessie upgrade [puppet] - 10https://gerrit.wikimedia.org/r/273892 (https://phabricator.wikimedia.org/T123724) (owner: 10ArielGlenn) [13:33:15] (03CR) 10Ema: [C: 031] ganglia: replace reserved characters in cluster name [puppet] - 10https://gerrit.wikimedia.org/r/273888 (https://phabricator.wikimedia.org/T128369) (owner: 10Filippo Giunchedi) [13:35:46] !log disabling puppet on cp* for traffic-pool work [13:35:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:36:02] (03PS2) 10BBlack: traffic-pool: remove BindsTo= [puppet] - 10https://gerrit.wikimedia.org/r/273502 [13:36:15] (03CR) 10BBlack: [C: 032 V: 032] traffic-pool: remove BindsTo= [puppet] - 10https://gerrit.wikimedia.org/r/273502 (owner: 10BBlack) [13:43:40] (03CR) 10Ricordisamoa: "I have seen it, I hope it will fix the bug." [puppet] - 10https://gerrit.wikimedia.org/r/226234 (https://phabricator.wikimedia.org/T75997) (owner: 10Ricordisamoa) [13:44:10] 6Operations, 10Analytics, 10Traffic: varnishkafka integration with Varnish 4 for analytics - https://phabricator.wikimedia.org/T124278#2072057 (10elukey) Summary of my findings so far: Varnish logs various kind of data (statistics, requests handled, etc..) in a shared memory file rather than in a file to a... [13:52:07] (03PS2) 10Gehel: Adding authorization for user Gehel (Guillaume Lederrey) [puppet] - 10https://gerrit.wikimedia.org/r/273884 [13:52:29] (03CR) 10Gehel: [C: 032] Adding authorization for user Gehel (Guillaume Lederrey) [puppet] - 10https://gerrit.wikimedia.org/r/273884 (owner: 10Gehel) [13:54:51] !log puppet back to normal on caches [13:54:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:57:17] I am running now puppet on dbproxy1005 [13:57:58] (03PS1) 10Muehlenhoff: Update changelog [debs/linux44] - 10https://gerrit.wikimedia.org/r/273895 [14:00:02] !log icinga: adding authorization for Gehel [14:00:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:04:19] (03CR) 10QChris: [C: 04-1] Avoid breaking full phabricator URLs (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/256663 (https://phabricator.wikimedia.org/T75997) (owner: 10Thiemo Mättig (WMDE)) [14:07:39] (03CR) 10QChris: [C: 04-1] Avoid breaking full phabricator URLs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/242237 (owner: 10Daniel Kinzler) [14:09:23] 6Operations, 10Analytics, 10Traffic: Sort out analytics service dependency issues for cp* cache hosts - https://phabricator.wikimedia.org/T128374#2072088 (10BBlack) [14:10:16] (03PS1) 10Muehlenhoff: Annotate CVE IDs fixed in initial 4.4.1 and 4.4.2 uploads [debs/linux44] - 10https://gerrit.wikimedia.org/r/273898 [14:10:24] 6Operations, 10Analytics, 10Traffic: Sort out analytics service dependency issues for cp* cache hosts - https://phabricator.wikimedia.org/T128374#2072101 (10BBlack) [14:10:31] (03PS2) 10Ema: Simplify VCL errorpage [puppet] - 10https://gerrit.wikimedia.org/r/273480 [14:11:08] !log re-enabling icinga notifications for elasticsearch cluster on codfw [14:11:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:13:31] (03CR) 10Muehlenhoff: [C: 032 V: 032] Update changelog [debs/linux44] - 10https://gerrit.wikimedia.org/r/273895 (owner: 10Muehlenhoff) [14:14:02] (03CR) 10Muehlenhoff: [C: 032 V: 032] Annotate CVE IDs fixed in initial 4.4.1 and 4.4.2 uploads [debs/linux44] - 10https://gerrit.wikimedia.org/r/273898 (owner: 10Muehlenhoff) [14:14:55] (03CR) 10Ema: [C: 032 V: 032] Simplify VCL errorpage [puppet] - 10https://gerrit.wikimedia.org/r/273480 (owner: 10Ema) [14:16:40] !log elastic2001.codfw.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101) [14:16:42] T109101: Currently elasticsearch logs do not leave nodes. We use logstash for this across the cluster generally. - https://phabricator.wikimedia.org/T109101 [14:16:42] T122697: Upgrade ElasticSearch to 1.7.5 - https://phabricator.wikimedia.org/T122697 [14:16:54] (03PS1) 10Muehlenhoff: Mention changelog update [debs/linux44] - 10https://gerrit.wikimedia.org/r/273899 [14:17:15] (03CR) 10Muehlenhoff: [C: 032 V: 032] Mention changelog update [debs/linux44] - 10https://gerrit.wikimedia.org/r/273899 (owner: 10Muehlenhoff) [14:17:36] !log upgrading nginx on cp* to 1.9.4-1+wmf2 for T126616 [14:17:37] T126616: openssl-1.0.2f introduced minor bug with nginx - https://phabricator.wikimedia.org/T126616 [14:17:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:24:31] PROBLEM - puppet last run on cp1065 is CRITICAL: CRITICAL: Puppet has 1 failures [14:39:16] 6Operations, 5Continuous-Integration-Scaling, 7Nodepool, 7WorkType-NewFunctionality: Backport python-shade from debian/testing to jessie-wikimedia - https://phabricator.wikimedia.org/T107267#2072133 (10hashar) > So at first the /debian/ sources are not in the subversion repository that is used by the Debia... [14:42:28] 6Operations, 10Traffic: Investigate TCP Fast Open for tlsproxy - https://phabricator.wikimedia.org/T108827#2072135 (10BBlack) As of when the caches switch to Linux 4.4.2 kernels (coming soon), they'll have the updates for the real (not experimental) TCP Fast Open IANA option code from [[ https://tools.ietf.org... [14:43:20] RECOVERY - puppet last run on cp1065 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [14:44:46] (03PS1) 10Muehlenhoff: Decom berkelium/curium [puppet] - 10https://gerrit.wikimedia.org/r/273906 (https://phabricator.wikimedia.org/T125962) [14:45:20] (03CR) 10Thiemo Mättig (WMDE): Avoid breaking full phabricator URLs (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/256663 (https://phabricator.wikimedia.org/T75997) (owner: 10Thiemo Mättig (WMDE)) [14:49:03] 6Operations, 10Analytics, 10Traffic: varnishkafka integration with Varnish 4 for analytics - https://phabricator.wikimedia.org/T124278#2072177 (10Ottomata) OH awesome! Does the Varnish Utils lib have a similar ABI enough to just work with existing varishkafka code? [14:51:12] (03PS1) 10ArielGlenn: re-enable rsyncs to ms1001 for finsl dataset1001 sync of all data [puppet] - 10https://gerrit.wikimedia.org/r/273907 [14:51:41] (03PS2) 10ArielGlenn: re-enable rsyncs to ms1001 for finsl dataset1001 sync of all data [puppet] - 10https://gerrit.wikimedia.org/r/273907 [14:52:58] (03CR) 10ArielGlenn: [C: 032] re-enable rsyncs to ms1001 for finsl dataset1001 sync of all data [puppet] - 10https://gerrit.wikimedia.org/r/273907 (owner: 10ArielGlenn) [14:53:01] 6Operations, 10Traffic, 10netops: Consider per-route DCTCP for dc-local traffic on jessie hosts - https://phabricator.wikimedia.org/T128377#2072179 (10BBlack) [14:59:22] 7Blocked-on-Operations, 10RESTBase, 13Patch-For-Review: Separate metrics, logs, and monitoring between staging and production - https://phabricator.wikimedia.org/T103124#2072196 (10fgiunchedi) ok I've deployed https://gerrit.wikimedia.org/r/#/c/272989 this morning so hosts are now separated, I've moved the p... [15:03:46] 6Operations, 10Traffic, 13Patch-For-Review: openssl-1.0.2f introduced minor bug with nginx - https://phabricator.wikimedia.org/T126616#2072215 (10BBlack) 5Open>3Resolved @MoritzMuehlenhoff 's nginx package update deployed everywhere, and the openssl-related logspam is gone! [15:04:50] 6Operations: Decom/reclaim berkelium/curium - https://phabricator.wikimedia.org/T125962#2072222 (10MoritzMuehlenhoff) a:3MoritzMuehlenhoff [15:05:23] (03CR) 10Filippo Giunchedi: [C: 031] restbase: override statsd metric prefix for restbase test cluster [puppet] - 10https://gerrit.wikimedia.org/r/273052 (https://phabricator.wikimedia.org/T103124) (owner: 10Eevans) [15:05:30] (03CR) 10Filippo Giunchedi: [C: 031] restbase: override logging name [puppet] - 10https://gerrit.wikimedia.org/r/273061 (https://phabricator.wikimedia.org/T103124) (owner: 10Eevans) [15:12:31] (03CR) 10Andrew Bogott: "Yep, removing that one package seems to do it. Easy enough." [puppet] - 10https://gerrit.wikimedia.org/r/273512 (owner: 10Andrew Bogott) [15:12:40] (03Abandoned) 10Andrew Bogott: Pin the cloud archive at the same priority as wikimedia repo [puppet] - 10https://gerrit.wikimedia.org/r/273512 (owner: 10Andrew Bogott) [15:13:43] (03PS2) 10Eevans: restbase: override statsd metric prefix for restbase test cluster [puppet] - 10https://gerrit.wikimedia.org/r/273052 (https://phabricator.wikimedia.org/T103124) [15:13:53] (03PS2) 10Eevans: restbase: override logging name [puppet] - 10https://gerrit.wikimedia.org/r/273061 (https://phabricator.wikimedia.org/T103124) [15:18:44] (03PS7) 10Giuseppe Lavagetto: Use wmfMasterDatacenter for picking the master redis config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266513 (https://phabricator.wikimedia.org/T114273) [15:19:16] (03CR) 10jenkins-bot: [V: 04-1] Use wmfMasterDatacenter for picking the master redis config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266513 (https://phabricator.wikimedia.org/T114273) (owner: 10Giuseppe Lavagetto) [15:19:20] (03CR) 10Mobrovac: [C: 031] "No-op in prod, name-change in staging: https://puppet-compiler.wmflabs.org/1887/" [puppet] - 10https://gerrit.wikimedia.org/r/273052 (https://phabricator.wikimedia.org/T103124) (owner: 10Eevans) [15:20:18] (03CR) 10Mobrovac: [C: 031] "GTG - https://puppet-compiler.wmflabs.org/1888/" [puppet] - 10https://gerrit.wikimedia.org/r/273061 (https://phabricator.wikimedia.org/T103124) (owner: 10Eevans) [15:20:22] (03PS1) 10Filippo Giunchedi: cassandra: add restbase1010-restbase1015 to site/install_server [puppet] - 10https://gerrit.wikimedia.org/r/273912 (https://phabricator.wikimedia.org/T128107) [15:21:31] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] restbase: override logging name [puppet] - 10https://gerrit.wikimedia.org/r/273061 (https://phabricator.wikimedia.org/T103124) (owner: 10Eevans) [15:21:40] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] restbase: override statsd metric prefix for restbase test cluster [puppet] - 10https://gerrit.wikimedia.org/r/273052 (https://phabricator.wikimedia.org/T103124) (owner: 10Eevans) [15:21:49] (03PS3) 10Filippo Giunchedi: restbase: override statsd metric prefix for restbase test cluster [puppet] - 10https://gerrit.wikimedia.org/r/273052 (https://phabricator.wikimedia.org/T103124) (owner: 10Eevans) [15:21:56] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] restbase: override statsd metric prefix for restbase test cluster [puppet] - 10https://gerrit.wikimedia.org/r/273052 (https://phabricator.wikimedia.org/T103124) (owner: 10Eevans) [15:27:15] (03PS2) 10Filippo Giunchedi: cassandra: add restbase1010-restbase1015 to site/install_server [puppet] - 10https://gerrit.wikimedia.org/r/273912 (https://phabricator.wikimedia.org/T128107) [15:27:21] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add restbase1010-restbase1015 to site/install_server [puppet] - 10https://gerrit.wikimedia.org/r/273912 (https://phabricator.wikimedia.org/T128107) (owner: 10Filippo Giunchedi) [15:27:34] (03PS8) 10Giuseppe Lavagetto: Add references to wmfServices for Cirrusearch. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266512 (https://phabricator.wikimedia.org/T114273) [15:27:50] 6Operations, 10Analytics, 10Traffic: varnishkafka integration with Varnish 4 for analytics - https://phabricator.wikimedia.org/T124278#2072286 (10elukey) @Ottomata not sure yet, but it would be nice to have a single source code branch. The next step is for me and @ema to figure out if and how we could use... [15:28:09] (03PS1) 10Muehlenhoff: Move access_new_install role to neodymium [puppet] - 10https://gerrit.wikimedia.org/r/273913 [15:30:34] !log forcing puppet run in restbase staging ((noop) config deploy) : T103124 [15:30:35] T103124: Separate metrics, logs, and monitoring between staging and production - https://phabricator.wikimedia.org/T103124 [15:30:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:32:32] !log forcing puppet run in restbase clustter ((noop) config deploy) : T103124 [15:32:33] T103124: Separate metrics, logs, and monitoring between staging and production - https://phabricator.wikimedia.org/T103124 [15:32:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:33:59] !log Perform rolling restart of restbase in staging cluster : T103124 [15:34:00] T103124: Separate metrics, logs, and monitoring between staging and production - https://phabricator.wikimedia.org/T103124 [15:34:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:37:45] 6Operations, 10Ops-Access-Requests: Requesting access to researchers for nschaaf - https://phabricator.wikimedia.org/T128381#2072331 (10schana) [15:42:07] !log Rolling restart of restbase staging complete : T103124 [15:42:08] T103124: Separate metrics, logs, and monitoring between staging and production - https://phabricator.wikimedia.org/T103124 [15:42:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:43:11] !log Perform rolling restart of restbase in production cluster : T103124 [15:43:12] T103124: Separate metrics, logs, and monitoring between staging and production - https://phabricator.wikimedia.org/T103124 [15:43:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:43:40] 6Operations, 10Analytics, 10Traffic: varnishkafka integration with Varnish 4 for analytics - https://phabricator.wikimedia.org/T124278#2072350 (10ema) Although it would be great for libvarnishtools to be a proper shared library, that is apparently not going to happen anytime soon in varnish itself, maybe fo... [15:44:47] 6Operations, 10Ops-Access-Requests: Requesting access to researchers for nschaaf - https://phabricator.wikimedia.org/T128381#2072365 (10DarTar) approving this request as @schana's manager, thanks. [15:46:29] 6Operations, 10Analytics, 10Traffic: varnishkafka integration with Varnish 4 for analytics - https://phabricator.wikimedia.org/T124278#2072374 (10elukey) So the plan would be to create two separate branches: 1) 3.X containing the current codebase 2) 4.X containing the vut.c/.h files to leverage the tools l... [15:50:33] !log Rolling restart of restbase production complete : T103124 [15:50:34] T103124: Separate metrics, logs, and monitoring between staging and production - https://phabricator.wikimedia.org/T103124 [15:50:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:58:30] (03CR) 10Thcipriani: "Isn't this handled by upstart? https://github.com/wikimedia/operations-puppet/blob/production/modules/keyholder/files/keyholder-proxy.conf" [puppet] - 10https://gerrit.wikimedia.org/r/259596 (owner: 10Ottomata) [16:00:04] anomie ostriches thcipriani marktraceur: Dear anthropoid, the time has come. Please deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160229T1600). [16:00:04] aude matt_flaschen mafk jzerebecki: A patch you scheduled for Morning SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:15] . [16:01:04] I can SWAT today. [16:01:07] * aude waves [16:01:40] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271971 (owner: 10Matěj Suchánek) [16:02:20] Here [16:02:21] <_joe_> !log stopping hhvm on mw1050 (depooled) for testing bug T128380 [16:02:22] T128380: Redirect with a question mark '?' in the title treats everything following it as URL query part when updating the URL - https://phabricator.wikimedia.org/T128380 [16:02:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:02:32] (03Merged) 10jenkins-bot: Update Wikidata property blacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271971 (owner: 10Matěj Suchánek) [16:04:20] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273836 (https://phabricator.wikimedia.org/T128249) (owner: 10Catrope) [16:04:21] PROBLEM - HHVM rendering on mw1050 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.010 second response time [16:04:52] (03Merged) 10jenkins-bot: Temporarily disable thank-you-edit notifications [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273836 (https://phabricator.wikimedia.org/T128249) (owner: 10Catrope) [16:05:12] PROBLEM - Apache HTTP on mw1050 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.012 second response time [16:05:50] PROBLEM - HHVM processes on mw1050 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm [16:06:01] !log thcipriani@tin Synchronized wmf-config/Wikibase.php: SWAT: Update Wikidata property blacklist [[gerrit:271971]] (duration: 01m 01s) [16:06:03] ^ aude check please [16:06:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:06:08] ok [16:06:30] think it's ok [16:07:05] aude: cool, thanks. [16:07:12] thanks :) [16:09:25] !log thcipriani@tin Synchronized wmf-config/CommonSettings.php: SWAT: Temporarily disable thank-you-edit notifications [[gerrit:273836]] (duration: 00m 41s) [16:09:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:09:32] ^ matt_flaschen check please [16:10:01] (03PS1) 10Filippo Giunchedi: set SO_REUSEADDR before bind() [software/statsdlb] - 10https://gerrit.wikimedia.org/r/273920 (https://phabricator.wikimedia.org/T126447) [16:10:41] RECOVERY - Apache HTTP on mw1050 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 500 bytes in 1.144 second response time [16:10:59] maf...hmm...no mafk. [16:11:11] RECOVERY - HHVM processes on mw1050 is OK: PROCS OK: 6 processes with command name hhvm [16:11:32] RECOVERY - HHVM rendering on mw1050 is OK: HTTP OK: HTTP/1.1 200 OK - 71696 bytes in 0.233 second response time [16:12:37] no jouncebot? [16:12:51] oh, just no ping for me... [16:12:51] thcipriani, now works on MediaWiki.org, thanks. [16:13:03] ok... [16:13:27] 6Operations, 10Analytics, 10ArchCom-RfC, 6Discovery, and 7 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#2072426 (10Ottomata) We will resolve this after T120212 is closed, and after we have the first consumer (change propagation) in production. [16:13:36] matt_flaschen: thanks for checking [16:14:09] Krenair: looks like your name is missing from this SWAT window. Same with Monday March 7th. [16:17:14] thcipriani, wait, you didn't deploy the move fix yet? I tested that (I thought you meant that was ready to test too), but I don't see it above. [16:17:45] matt_flaschen: not yet. Still waiting on jenkins for that one. Sorry I wasn't clear. [16:18:23] I'm just wondering why it worked... [16:23:56] !log thcipriani@tin Synchronized php-1.27.0-wmf.14/includes/MovePage.php: SWAT: Add TitleMoveStarting, mirroring TitleMoveCompleting [[gerrit:273535]] (duration: 00m 41s) [16:23:59] ^ matt_flaschen titlemovestarting hook syncd, still waiting on jenkins for flow change [16:23:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:24:52] jzerebecki: looks like a sync-dir will suffice for the wikidata change, anything else needed? [16:25:54] thcipriani: no that is sufficient [16:28:19] !log thcipriani@tin Synchronized php-1.27.0-wmf.14/extensions/Wikidata: SWAT: Update Wikibase: Fix over-encoding of expanded URLs [[gerrit:273917]] (duration: 02m 04s) [16:28:22] ^ jzerebecki check please [16:28:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:29:09] thcipriani: works. thx. [16:29:14] looks good to me [16:29:17] jzerebecki: cool, thanks for checking [16:32:23] !log thcipriani@tin Synchronized php-1.27.0-wmf.14/extensions/Flow: SWAT: Fix board move DB issue using new hook TitleMoveStarting [[gerrit:273536]] (duration: 01m 00s) [16:32:25] ^ matt_flaschen flow change sync'd, check please [16:32:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:33:11] thcipriani, checking now. [16:33:37] (03CR) 10Luke081515: "There is still the problem, that renaming a global group would deactivate these policys. Do you think that the chance, that someone rename" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272660 (https://phabricator.wikimedia.org/T119100) (owner: 10CSteipp) [16:34:02] (03PS1) 10Filippo Giunchedi: statsdlb: add three statsite instances [puppet] - 10https://gerrit.wikimedia.org/r/273927 (https://phabricator.wikimedia.org/T105679) [16:34:51] PROBLEM - salt-minion processes on lvs3001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [16:34:57] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273580 (owner: 10MarcoAurelio) [16:35:04] 6Operations, 10ops-ulsfo: ulsfo UL Nagios "host DOWN" for pdua-122/pdua-123 - https://phabricator.wikimedia.org/T128383#2072473 (10faidon) [16:35:12] oh, I forgot the swat! [16:35:15] 6Operations, 10ops-ulsfo: ulsfo UL Nagios "host DOWN" for pdua-122/pdua-123 - https://phabricator.wikimedia.org/T128383#2072487 (10faidon) p:5Triage>3High [16:35:16] I'm here [16:35:37] thcipriani, (still) works... [16:35:44] (03Merged) 10jenkins-bot: Maintenance on throttle.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273580 (owner: 10MarcoAurelio) [16:35:45] thcipriani: sorry for being late, I was at work [16:35:49] matt_flaschen: oh good :) [16:36:08] mafk: swat isn't finished though ;) [16:36:10] y [16:36:11] mafk: no problem [16:36:21] good, /me breathes [16:36:36] 6Operations, 10Analytics, 10Traffic: varnishkafka integration with Varnish 4 for analytics - https://phabricator.wikimedia.org/T124278#2072492 (10Ottomata) Sounds good. Let’s make a new 3.x branch now, and make master be for 4.x support. [16:38:18] !log thcipriani@tin Synchronized wmf-config/throttle.php: SWAT: Maintenance on throttle.php [[gerrit:273580]] (duration: 00m 41s) [16:38:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:38:33] 6Operations, 10Analytics, 10Traffic: varnishkafka integration with Varnish 4 for analytics - https://phabricator.wikimedia.org/T124278#2072493 (10Ottomata) Would it be helpful to build a varnish utils .deb package instead of directly adding sources? If so, I can help with that. [16:38:34] ^ mafk throttle changes sync'd, thanks for the patch! [16:38:36] thcipriani, I think I figured out why it worked on that wiki before the fix. It wouldn't on officewiki, though. Need to go test there. [16:38:52] thcipriani: my pleasure, hope it works [16:39:03] I think I had a couple more in the queue iirc [16:40:14] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272992 (https://phabricator.wikimedia.org/T127593) (owner: 10MarcoAurelio) [16:40:50] (03Merged) 10jenkins-bot: Set transwiki import sources for hi.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272992 (https://phabricator.wikimedia.org/T127593) (owner: 10MarcoAurelio) [16:41:01] * mafk tests [16:42:15] needs sync. first [16:43:35] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Set transwiki import sources for hi.wikiquote [[gerrit:272992]] (duration: 00m 41s) [16:43:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:43:40] ^ mafk check please [16:44:10] thcipriani: MarcoAurelio (talk | contribs | block) changed group membership for MarcoAurelio@hiwikiquote from (none) to transwiki importer <-- testing [16:44:37] form looks ok [16:44:41] and sources are there too [16:44:49] shall I do a test import? [16:45:12] sure [16:47:07] thcipriani: https://hi.wikiquote.org/w/index.php?title=%E0%A4%B8%E0%A4%A6%E0%A4%B8%E0%A5%8D%E0%A4%AF:MarcoAurelio/Mobile_Gateway/Mobile_homepage_formatting&action=history [16:47:12] it works [16:47:27] mafk: awesome! Thanks for testing, and for the patch. [16:47:32] :) [16:47:38] and to you for swat [16:51:44] (03CR) 10Alex Monk: Move horizon apache config into a vhost (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/270753 (owner: 10Andrew Bogott) [16:52:57] 6Operations, 10ops-codfw, 6Labs: Figure out what labstore hardware is viable in codfw - https://phabricator.wikimedia.org/T128083#2072530 (10Papaul) labstore2002 is plugged into the switch but there is no activity light on the switch port ge-1/0/1 labstore2003 only has production DNS no mgmt DNS labstore2004... [16:53:07] 6Operations, 10ops-ulsfo: ulsfo UL Nagios "host DOWN" for pdua-122/pdua-123 - https://phabricator.wikimedia.org/T128383#2072531 (10RobH) There isn't an actual loss of power, as both cr1 and cr2-ulsfo.wikimedia.org show power for both of their power supply units. Each PSU is connected to one of the two towers,... [16:54:51] RECOVERY - salt-minion processes on lvs3001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:01:23] 6Operations, 6Project-Admins, 3DevRel-February-2016: Operations-related subprojects/tags reorganization - https://phabricator.wikimedia.org/T119944#1840761 (10RobH) Please don't have any herald rules add any kind of ops site specific projects to #procurement tasks. For now, each procurement task can have th... [17:02:18] (03PS1) 10Muehlenhoff: Regenerate rules/control files after configuration changes [debs/linux44] - 10https://gerrit.wikimedia.org/r/273934 [17:03:03] 6Operations, 10ops-eqiad, 6Labs: disk failure on labsdb1002 - https://phabricator.wikimedia.org/T126946#2072589 (10liangent) Thanks. It will not be a reimport but a move, so I'll have to take care of host names every time... There's no way to ensure reliability I think, unless you have prev.commonswiki.labsd... [17:10:06] 6Operations, 13Patch-For-Review, 7Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2072646 (10Dzahn) [17:13:00] 6Operations, 10Analytics, 10Traffic: Sort out analytics service dependency issues for cp* cache hosts - https://phabricator.wikimedia.org/T128374#2072088 (10Milimetric) p:5Triage>3Normal [17:13:15] (03PS1) 10MarcoAurelio: Enabling ShortURL for bnwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273936 (https://phabricator.wikimedia.org/T127968) [17:14:11] (03PS5) 10Yuvipanda: tools: Add authentication for docker registry [puppet] - 10https://gerrit.wikimedia.org/r/273840 (https://phabricator.wikimedia.org/T118758) [17:14:19] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Add authentication for docker registry [puppet] - 10https://gerrit.wikimedia.org/r/273840 (https://phabricator.wikimedia.org/T118758) (owner: 10Yuvipanda) [17:16:33] 6Operations, 10MobileFrontend, 10Traffic, 5MW-1.27-release, and 5 others: Incorrect TOC and section edit links rendering in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2072680 (10ssastry) @tstarling and @aaron might b... [17:24:58] (03CR) 10Andrew Bogott: "(followup -- I didn't actually revert this, just fixed the typo)" [puppet] - 10https://gerrit.wikimedia.org/r/273314 (https://phabricator.wikimedia.org/T124680) (owner: 10Andrew Bogott) [17:26:59] 6Operations, 10MobileFrontend, 10Traffic, 5MW-1.27-release, and 5 others: Incorrect TOC and section edit links rendering in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2072737 (10ssastry) >>! In T124356#2072680, @ssas... [17:30:54] 6Operations: the centralauth databases is accessible form the mysql shell on terbium only in some cases - https://phabricator.wikimedia.org/T122475#2072763 (10Milimetric) [17:36:01] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 57.69% of data above the critical threshold [5000000.0] [17:43:30] "Attempt to connect to phuser@m3-master.eqiad.wmnet failed with error #2003: Can't connect to MySQL server on 'm3-master.eqiad.wmnet' (99)." -- phab busted? [17:43:38] works now [17:45:01] 6Operations, 10MobileFrontend, 10Traffic, 5MW-1.27-release, and 5 others: Incorrect TOC and section edit links rendering in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2072819 (10ssastry) So, the extension hooks were... [17:46:04] any other reports? [17:46:34] 6Operations, 10MobileFrontend, 10Traffic, 5MW-1.27-release, and 5 others: Incorrect TOC and section edit links rendering in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2072828 (10ssastry) @legoktm .. I wonder if the '... [17:50:41] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 50.00% above the threshold [1000000.0] [17:56:14] 6Operations, 10Ops-Access-Requests, 13Patch-For-Review: Access Request for mobrovac as ci-admin to mess with CI infrastructure - https://phabricator.wikimedia.org/T128175#2066176 (10RobH) In the ops meeting discussion with Marko; further details and justification are requested for on task discussion. Potent... [17:57:27] (03PS1) 10Krinkle: wmfstatic: Update statsd documentation comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273945 [17:57:29] (03PS1) 10Krinkle: wmfstatic: Give requests with unsupported query strings long cache age [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273946 [17:59:16] (03PS2) 10Krinkle: wmfstatic: Give requests with unsupported query strings long cache age [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273946 [18:01:34] !log elastic2002.codfw.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101) [18:01:36] T109101: Currently elasticsearch logs do not leave nodes. We use logstash for this across the cluster generally. - https://phabricator.wikimedia.org/T109101 [18:01:36] T122697: Upgrade ElasticSearch to 1.7.5 - https://phabricator.wikimedia.org/T122697 [18:01:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:01:56] 6Operations, 10MobileFrontend, 10Traffic, 5MW-1.27-release, and 5 others: Incorrect TOC and section edit links rendering in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2072881 (10Danny_B) For the record: not all wikis... [18:02:48] (03CR) 10Krinkle: [C: 032] wmfstatic: Update statsd documentation comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273945 (owner: 10Krinkle) [18:02:57] (03CR) 10Krinkle: [C: 032] wmfstatic: Give requests with unsupported query strings long cache age [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273946 (owner: 10Krinkle) [18:03:15] (03Merged) 10jenkins-bot: wmfstatic: Update statsd documentation comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273945 (owner: 10Krinkle) [18:03:26] (03Merged) 10jenkins-bot: wmfstatic: Give requests with unsupported query strings long cache age [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273946 (owner: 10Krinkle) [18:05:25] !log krinkle@tin Synchronized w/static.php: (no message) (duration: 01m 01s) [18:05:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:06:30] ori: Hm.. varnishtop documents -m but then errors with "-m is not supported" [18:07:24] 6Operations, 10MediaWiki-Uploading, 6Multimedia, 10Wikimedia-Video: Uploading 1.2GB ogv results in 503 - https://phabricator.wikimedia.org/T128358#2071563 (10BBlack) We should confirm whether this is purely timeout related (due to slow upload speeds), or something else, I guess by trying a large file from... [18:07:42] 6Operations, 10MediaWiki-Uploading, 6Multimedia, 10Traffic, 10Wikimedia-Video: Uploading 1.2GB ogv results in 503 - https://phabricator.wikimedia.org/T128358#2072890 (10BBlack) [18:08:50] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 55.17% of data above the critical threshold [5000000.0] [18:10:57] 6Operations, 6Labs, 10Tool-Labs, 10Traffic, 7HTTPS: Fix all http-only tools in tools.wmflabs.org - https://phabricator.wikimedia.org/T102457#1364862 (10Krinkle) >>! In T102457#1391885, @yuvipanda wrote: > I can also help the maintainer of stats.grok.de setup SSL, if so desired. Or maybe Magnus' tool sho... [18:15:24] 6Operations, 10MobileFrontend, 10Traffic, 5MW-1.27-release, and 5 others: Incorrect TOC and section edit links rendering in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2072907 (10ssastry) @Danny_B, FlaggedRevs is just... [18:19:18] @seen hashar [18:19:18] mutante: Last time I saw hashar they were quitting the network with reason: Quit: Textual IRC Client: www.textualapp.com N/A at 2/29/2016 5:04:03 PM (1h15m15s ago) [18:20:51] ema: this one https://phabricator.wikimedia.org/T128381 is just like the one you merged for me last week [18:21:23] Krinkle: what is the full command-line? [18:21:30] -m is not compatible with some other switches, IIRC [18:21:34] $ varnishtop -m 'RxURL:^/w/(skins|extensions|resources)' [18:21:38] nothing else [18:21:54] cp1055:~$ varnishtop -m 'RxURL:^/w/(skins|extensions|resources)' [18:21:54] -m is not supported [18:22:22] the man page doesn't show '-m' [18:22:26] but varnishtop --help does [18:22:41] i suspect it's because multiple varnishlog utilities share a '--help' implementation [18:22:47] because they share the same command-line options [18:23:01] or mostly-share [18:23:17] _joe_: wow, i'm now reading "Now I just touch an empty file in modules/role/manifests and MAGIC! the catalog compilation fails:" haha [18:23:22] RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 50.00% above the threshold [1000000.0] [18:23:23] ori: so for https://phabricator.wikimedia.org/T123711 we should re-image mc100[123] as last step. I guess that today we could try to pool/de-pool mc1001 as test? [18:23:34] <_joe_> mutante: told you it was a good read [18:23:42] Krinkle: yes, that's exactly it: the line in varnishtop's source code is: fprintf(stderr, "usage: varnishtop %s [-1fV] [-n varnish_name]\n", VSL_USAGE); [18:23:48] VSL_USAGE is shared by all varnishlog utils [18:23:56] that's why the '-m' case is explicitly handled [18:24:10] ori: my undestanding is that wmf-config/filebackend-production.php should be changed to remove the IP (10.64.0.180) [18:24:17] <_joe_> I secretly hope that "Import vs autoload: the puppet parser is a bad joke that stopped being funny years ago." catches puppetlab's folks attention [18:24:38] elukey: yes, that's correct [18:25:23] 6Operations, 10RESTBase, 13Patch-For-Review: install restbase1010-restbase1015 - https://phabricator.wikimedia.org/T128107#2072930 (10fgiunchedi) [18:25:46] elukey: by the way, it would be nice to change those to hostnames, since hhvm caches dns lookups for 5 mins [18:25:56] so having the ip instead of the hostname isn't much of an optimization [18:26:25] 6Operations, 10Dumps-Generation, 13Patch-For-Review: Migrate dataset1001 and ms1001 to jessie - https://phabricator.wikimedia.org/T123724#2072937 (10ArielGlenn) rsync between dataset1001 and ms1001 is in progress. I'll rerun it tomorrow am and then proceed on the upgrade. [18:26:54] 6Operations, 10RESTBase, 13Patch-For-Review: install restbase1010-restbase1015 - https://phabricator.wikimedia.org/T128107#2064139 (10fgiunchedi) thanks @Cmjohnson ! I could successfully install restbase1010 after fixing partman in https://gerrit.wikimedia.org/r/273912 though restbase1011 doesn't seem to be... [18:27:04] ori: ok I have two questions: 1) what kind of HA are mc100[123] using? I didn't see anything in the redis config so I suspect it is in mw? 2) How do we measure that the host is not taking any lock related traffic after the de-pool? Redis-cli? (not super important but it would be nice) Or maybe just logstash [18:27:53] _joe_: i have been doing a mixture of 1) and 2) so far, migrated all the easy second-level ones, and renamed some first-level to second-level, re: bikeshedding, instead of "temp" i did stuff like "etherpad" becomes "etherpad::server" [18:28:11] i will try to get the mediawiki part done next [18:28:20] then let's see what is left to move in one patch [18:28:25] <_joe_> mutante: nope, I prefer to have it with a clear back-migration path [18:28:38] <_joe_> and yes, +1 [18:29:01] to point it out more? like role foo::fixme :) [18:29:14] ok [18:29:31] elukey: 2) yes, I think redis-cli (with 'monitor') would be ideal [18:30:15] elukey: for 1), the best I can find is : https://github.com/wikimedia/mediawiki/blob/master/includes/filebackend/lockmanager/RedisLockManager.php#L27-L33 [18:31:27] (03PS1) 10Mattflaschen: Fix ptwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273948 [18:31:34] elukey: specifically: "All lock requests for a resource, identified by a hash string, will map to one bucket. Each bucket maps to one or several peer servers, each running redis. A majority of peers must agree for a lock to be acquired." [18:31:41] (03PS1) 10Elukey: Remove mc1001.eqiad from the list of lock managers for maintenance. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273949 (https://phabricator.wikimedia.org/T123711) [18:31:50] elukey: it looks like there is only one bucket defined in prod, which includes all three redis servers [18:31:56] (03CR) 10Mattflaschen: [C: 04-1] "Needs to be regenerated after ptwikibooks fix." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272929 (owner: 10Mattflaschen) [18:32:03] so locks will be set in all three, and as long as there are two servers still up, that's enough for a majority [18:32:11] _joe_: fwiw, here's another largish one, CI https://gerrit.wikimedia.org/r/#/c/260939/ laters [18:33:08] (03CR) 10Ori.livneh: [C: 031] Remove mc1001.eqiad from the list of lock managers for maintenance. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273949 (https://phabricator.wikimedia.org/T123711) (owner: 10Elukey) [18:34:38] (03CR) 10Mattflaschen: [C: 032] "Discussed in team meeting. Solution is clear." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273948 (owner: 10Mattflaschen) [18:35:02] (03Merged) 10jenkins-bot: Fix ptwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273948 (owner: 10Mattflaschen) [18:35:21] (03PS1) 10Elukey: Add Debian PXE boot option to mc100[123] servers. [puppet] - 10https://gerrit.wikimedia.org/r/273951 (https://phabricator.wikimedia.org/T123711) [18:36:36] ori: I need to merge https://gerrit.wikimedia.org/r/#/c/273951/1 firt, I knew I was missing something. After that I'd need a bit of help in deploying the media-wiki config since I haven't done it yet :) [18:36:52] mutante: Hi! Would you mind to review https://gerrit.wikimedia.org/r/#/c/273951/1 to be super sure? [18:37:26] elukey: ok, looking [18:37:28] elukey: sure [18:38:50] (03CR) 10Dzahn: [C: 031] Add Debian PXE boot option to mc100[123] servers. [puppet] - 10https://gerrit.wikimedia.org/r/273951 (https://phabricator.wikimedia.org/T123711) (owner: 10Elukey) [18:39:15] (03CR) 10Dzahn: [V: 031] Add Debian PXE boot option to mc100[123] servers. [puppet] - 10https://gerrit.wikimedia.org/r/273951 (https://phabricator.wikimedia.org/T123711) (owner: 10Elukey) [18:39:24] thanks mutante! [18:39:31] sure, yw [18:39:48] (03CR) 10Elukey: [C: 032] Add Debian PXE boot option to mc100[123] servers. [puppet] - 10https://gerrit.wikimedia.org/r/273951 (https://phabricator.wikimedia.org/T123711) (owner: 10Elukey) [18:40:32] !log mattflaschen@tin Synchronized dblists/nonflow.dblist: Re-enable Flow on ptwikibooks. Accidentally disabled earlier. (duration: 00m 47s) [18:40:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:41:53] ori: all right, done, I am going to check your suggestion for 2) then we'll proceed with the deployment [18:42:58] ah ok it enables tracing, starightforward enough :) [18:44:41] ori: do we want to use mw1017 first or just directly to https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Small_changes:_sync_individual_files ? [18:46:34] elukey: just directly [18:46:46] elukey: I have the following command for verifying it is no longer used: [18:46:56] redis-cli -a "$(sudo grep -Po '(?<=masterauth ).*' /etc/redis/tcp_6379.conf)" monitor | grep RedisLockManager [18:47:17] the whole "$(...)" is just a fancy way of referencing the password without quoting it on irc [18:49:14] and if you were wondering if there's a way to avoid having the password shown in "ps" the answer is "not yet" https://github.com/antirez/redis/issues/2194 [18:50:21] (03PS2) 10Dzahn: Move access_new_install role to neodymium [puppet] - 10https://gerrit.wikimedia.org/r/273913 (owner: 10Muehlenhoff) [18:50:31] (03CR) 10Dzahn: [C: 032] Move access_new_install role to neodymium [puppet] - 10https://gerrit.wikimedia.org/r/273913 (owner: 10Muehlenhoff) [18:51:01] godog: :) [18:52:08] (03CR) 10Aaron Schulz: [C: 031] Add references to wmfServices for Cirrusearch. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266512 (https://phabricator.wikimedia.org/T114273) (owner: 10Giuseppe Lavagetto) [18:53:05] ori: :D btw I'll probably need to do the same ::instance thing for statsdlb as we have for statsite, unless there's a better/newer pattern for multiple instances [18:53:51] (03PS1) 10Elukey: Remove mc1001 from the redis/memcached pools for maintenance [puppet] - 10https://gerrit.wikimedia.org/r/273952 (https://phabricator.wikimedia.org/T123711) [18:54:04] godog: are you opposed to trying out a faster hash function? [18:54:18] is there a chance that it would improve performance sufficiently that multiple instances would not be required? [18:55:25] ori: could be! I'm not opposed but is all the cpu spent in hashing? also even with faster hashing it is still limited to one core [18:55:40] (03PS2) 10Elukey: Remove mc1001.eqiad from the list of lock managers for maintenance. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273949 (https://phabricator.wikimedia.org/T123711) [18:56:31] ori: I'm good with either btw, including a golang implementation [18:56:40] (03CR) 10Dzahn: [Planet Wikimedia] Add Albanian and Bulgarian planets (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/273830 (owner: 10Nemo bis) [18:56:45] I have to go tho, ttyl [18:56:55] godog: ttyl [18:57:24] (03CR) 10Elukey: [C: 032] Remove mc1001.eqiad from the list of lock managers for maintenance. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273949 (https://phabricator.wikimedia.org/T123711) (owner: 10Elukey) [18:57:47] ori: change merged, proceeding with deploy --^ [18:57:54] elukey: +1 [18:59:01] (03CR) 10Nemo bis: [Planet Wikimedia] Add Albanian and Bulgarian planets (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/273830 (owner: 10Nemo bis) [19:01:13] (03CR) 10Dzahn: [Planet Wikimedia] Add Albanian and Bulgarian planets (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/273830 (owner: 10Nemo bis) [19:01:21] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures [19:03:31] ori: just to be super sure - I did a git fetch and checked git log HEAD..origin from elukey@tin:/srv/mediawiki-staging/wmf-config$ that my commit is the only one in the diff. Now I should do git pull and then sync-file to complete right? [19:04:30] ori: Hm.. okay. So how would you do this instead? Or just collect a bit and then aggregate manually [19:04:59] elukey: 'git merge origin/master' should be sufficient [19:05:00] (03CR) 10Dzahn: [Planet Wikimedia] Add Albanian and Bulgarian planets (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/273830 (owner: 10Nemo bis) [19:05:36] ori: thanks, proceeding [19:06:58] 6Operations, 10hardware-requests, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: MediaWiki maintenance host for codfw (terbium's equivalent) - https://phabricator.wikimedia.org/T126987#2073111 (10RobH) Both sub-tasks have been escalated to @faidon for his review. [19:07:37] elukey: ori: (sorry just lurking) - should it stay in 'srvsByBucket' ? [19:07:46] 6Operations, 10MobileFrontend, 10Traffic, 5MW-1.27-release, and 5 others: Incorrect TOC and section edit links rendering in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2073125 (10ssastry) >>! In T124356#2072907, @ssas... [19:07:52] !log elukey@tin Synchronized wmf-config/filebackend-production.php: Remove mc1001 from the lock managers for maintenance (duration: 00m 41s) [19:07:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:08:16] Krinkle: why? the two remaining servers are enough to form a majority; what would be the benefit of having mediawiki try (and fail) to get an answer from mc1001? [19:08:23] Krinkle: sorry I didn't get the srvsByBucket reference :( [19:08:28] ori: Yes... [19:08:48] It is however still in srvsByBucket [19:08:49] Krinkle, ori: change pushed [19:08:51] I don't know this code [19:08:55] Just seemed odd [19:09:21] checking with redis-cli -a "$(sudo grep -Po '(?<=masterauth ).*' /etc/redis/tcp_6379.conf)" monitor | grep RedisLockManager as agreed [19:09:44] Notice: Undefined index: rdb1 in /srv/mediawiki/php-1.27.0-wmf.14/includes/filebackend/lockmanager/RedisLockManager.php on line 245 [19:09:52] aaaaand that's what I thought [19:10:09] Not sure how this falls back, should be fine? [19:10:28] * Krinkle goes back to Gerrit review [19:10:35] oh, I see what you're saying [19:10:37] yeah, we missed that. [19:10:59] elukey: there's a reference to a 'rdb1' in the srvsByBucket key on line 143 [19:11:18] mediawiki syntax for vim https://raw.githubusercontent.com/chikamichi/mediawiki.vim/master/syntax/mediawiki.vim [19:11:30] ori: ah snap I only checked the IP [19:11:37] (03PS1) 10Ori.livneh: Follow-up for I8a87a679ed: unlist rdb1 server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273957 [19:11:56] (03CR) 10Ori.livneh: [C: 032] Follow-up for I8a87a679ed: unlist rdb1 server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273957 (owner: 10Ori.livneh) [19:12:31] (03Merged) 10jenkins-bot: Follow-up for I8a87a679ed: unlist rdb1 server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273957 (owner: 10Ori.livneh) [19:12:31] Krinkle: now I get what you were saying, totally missed it, sorry [19:12:47] (03PS3) 10Dzahn: [Planet Wikimedia] Multiple additions to English, Spanish, Ukrainian [puppet] - 10https://gerrit.wikimedia.org/r/273777 (owner: 10Nemo bis) [19:12:58] thanks Krinkle [19:13:00] (03CR) 10Dzahn: [C: 032] [Planet Wikimedia] Multiple additions to English, Spanish, Ukrainian [puppet] - 10https://gerrit.wikimedia.org/r/273777 (owner: 10Nemo bis) [19:13:09] ori: yw. [19:13:15] This structure should be better designed so this isn't possible [19:13:36] !log ori@tin Synchronized wmf-config/filebackend-production.php: I0ad3e23719c: Follow-up for I8a87a679ed: unlist rdb1 server (duration: 00m 44s) [19:13:37] I agree; the same issue bit me once with the job queue config [19:13:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:13:44] elukey: sorry for not catching that [19:15:04] ori: nah I totally missed it, I should have thought about it. I'll add notes in the https://wikitech.wikimedia.org/wiki/Service_restarts docs for pool/de-pool once done [19:19:56] ori: proceeding with https://gerrit.wikimedia.org/r/#/c/273952/1 because it takes usually ~20/30 mins to get it rolled out everywhere [19:20:13] elukey: nod [19:20:13] (03PS2) 10Elukey: Remove mc1001 from the redis/memcached pools for maintenance [puppet] - 10https://gerrit.wikimedia.org/r/273952 (https://phabricator.wikimedia.org/T123711) [19:21:58] 6Operations, 10DBA: upgrade dbproxy1001/1002 to jessie - https://phabricator.wikimedia.org/T125027#2073194 (10jcrespo) a:3jcrespo [19:22:20] !log elastic2003.codfw.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101) [19:22:22] T109101: Currently elasticsearch logs do not leave nodes. We use logstash for this across the cluster generally. - https://phabricator.wikimedia.org/T109101 [19:22:22] T122697: Upgrade ElasticSearch to 1.7.5 - https://phabricator.wikimedia.org/T122697 [19:22:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:22:25] (03CR) 10Elukey: [C: 032] Remove mc1001 from the redis/memcached pools for maintenance [puppet] - 10https://gerrit.wikimedia.org/r/273952 (https://phabricator.wikimedia.org/T123711) (owner: 10Elukey) [19:22:33] 6Operations, 10DBA: upgrade dbproxy1001/1002 to jessie - https://phabricator.wikimedia.org/T125027#1972513 (10jcrespo) p:5Triage>3Normal [19:23:19] (03PS1) 10Jcrespo: [WIP]Update haproxy default file, as it cannot be dynamic in jessie [puppet] - 10https://gerrit.wikimedia.org/r/273958 (https://phabricator.wikimedia.org/T125027) [19:23:35] (03PS2) 10Jcrespo: [WIP]Update haproxy default file, as it cannot be dynamic in jessie [puppet] - 10https://gerrit.wikimedia.org/r/273958 (https://phabricator.wikimedia.org/T125027) [19:24:27] ^worse puppet ever [19:25:36] !log removed mc1001.equiad from the redis/memcached pools for maintenance [19:25:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:25:52] (03PS2) 10Dzahn: [Planet Wikimedia] Add Albanian and Bulgarian planets [puppet] - 10https://gerrit.wikimedia.org/r/273830 (owner: 10Nemo bis) [19:26:30] any idea why after puppet-merge strontium said fatal: empty ident puppet-lint in jenkins didn't complain [19:27:20] puppet merge on palladium? [19:27:21] elukey: the error you get from the puppet-merge script ? [19:27:26] repeat it one more time [19:27:30] is it gone then? [19:27:46] yep.. no changes to merge [19:28:21] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:28:36] it is weird because I changed only yaml files.. mmm [19:28:42] jynus: yes [19:29:00] it happens sometimes and isn't new but i cant explain why [19:29:23] if it's the same error [19:29:25] mutante: thanks for the support :) I am seeing my change beeing propagated because of the erros in logstash [19:29:30] (expected) [19:29:40] :) [19:29:56] it must be some kind of race with the sync script [19:31:22] ori: all right change pushed, waiting for puppet to do its job during the next 25 minutes more or less. After that I'll re-image the host and put it back into service [19:32:56] (03PS3) 10Dzahn: [Planet Wikimedia] Add Albanian and Bulgarian planets [puppet] - 10https://gerrit.wikimedia.org/r/273830 (owner: 10Nemo bis) [19:33:12] (03CR) 10Dzahn: [C: 032] [Planet Wikimedia] Add Albanian and Bulgarian planets [puppet] - 10https://gerrit.wikimedia.org/r/273830 (owner: 10Nemo bis) [19:34:23] (03CR) 10Dzahn: "PS3: converted UTF-8 to HTML entities because .. puppet breaks on the litereal UTF-8 chars in manifests" [puppet] - 10https://gerrit.wikimedia.org/r/273830 (owner: 10Nemo bis) [19:35:23] (03CR) 10Ottomata: Replace limn::data::generate by reportupdater (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/273487 (https://phabricator.wikimedia.org/T127327) (owner: 10Mforns) [19:36:18] 6Operations, 10Ops-Access-Requests: Requesting access to researchers for nschaaf - https://phabricator.wikimedia.org/T128381#2073231 (10Dzahn) @ema or the person on duty this week, this is just like https://gerrit.wikimedia.org/r/#/c/273038/ [19:37:22] (03CR) 10Dzahn: "Apache config, planet config, cron jobs added.. gotta add to DNS too" [puppet] - 10https://gerrit.wikimedia.org/r/273830 (owner: 10Nemo bis) [19:39:44] (03PS1) 10Dzahn: add Albanian and Bulgarian planet [dns] - 10https://gerrit.wikimedia.org/r/273960 [19:40:40] (03PS3) 10Hashar: nodepool 0.1.1-wmf4 [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/237700 (https://phabricator.wikimedia.org/T111377) [19:40:47] (03PS2) 10Dzahn: add Albanian and Bulgarian planet [dns] - 10https://gerrit.wikimedia.org/r/273960 [19:40:56] (03CR) 10Dzahn: [C: 032] add Albanian and Bulgarian planet [dns] - 10https://gerrit.wikimedia.org/r/273960 (owner: 10Dzahn) [19:41:14] (03CR) 10CSteipp: "Luke081515, I think the risk is pretty low, but I haven't looked at how frequently those are changed. I can try to see when the last time " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272660 (https://phabricator.wikimedia.org/T119100) (owner: 10CSteipp) [19:45:06] !log new planets, Albanian and Bulgarian: https://sq.planet.wikimedia.org/ | https://bg.planet.wikimedia.org/ [19:45:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:45:58] (03CR) 10Dzahn: "added to DNS, ran update scripts manually to not wait for cron" [puppet] - 10https://gerrit.wikimedia.org/r/273830 (owner: 10Nemo bis) [19:49:56] mutante: related to what you are doing?: Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not parse for environment production: invalid byte sequence in US-ASCII at /etc/puppet/manifests/role/planet.pp:1 on node wikidev16videos.wikidata-dev.eqiad.wmflabs [19:51:48] jzerebecki: probably, yes. interesting though because that is fine in production and is related to puppet version :/ [19:52:03] jzerebecki: hold on .. [19:52:32] that is precisely why i use HTML entities [19:52:54] damn UTF-8 puppet issues [19:53:08] well, and here US-ASCII [19:56:27] mutante: the ë [19:57:03] in Albanian, yep [19:57:14] fixing [19:58:31] the files wasn't even valid UTF-8... [19:59:08] (03PS1) 10Dzahn: planet: fix 'invalid byte sequence' issue in sq config [puppet] - 10https://gerrit.wikimedia.org/r/273964 [19:59:45] (03PS2) 10Dzahn: planet: fix 'invalid byte sequence' issue in sq config [puppet] - 10https://gerrit.wikimedia.org/r/273964 [20:00:16] (03CR) 10Dzahn: [C: 032] "like back in https://gerrit.wikimedia.org/r/#/c/194214/ but with US-ASCII umlaut brokenness" [puppet] - 10https://gerrit.wikimedia.org/r/273964 (owner: 10Dzahn) [20:00:53] (03CR) 10Dzahn: "11:54 < jzerebecki> mutante: related to what you are doing?: Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Co" [puppet] - 10https://gerrit.wikimedia.org/r/273964 (owner: 10Dzahn) [20:02:02] jzerebecki: also https://phabricator.wikimedia.org/T91453 :/ is it gone ? [20:02:50] jzerebecki: also, does not happen on jessie puppet [20:03:24] 6Operations, 10Continuous-Integration-Infrastructure, 13Patch-For-Review: invalid byte sequence in US-ASCII - puppet issues with UTF-8 - https://phabricator.wikimedia.org/T91453#2073355 (10Dzahn) one more https://gerrit.wikimedia.org/r/273964 [20:03:28] mutante: works now. I don't have access to the master... [20:04:55] !log Restarted statsdlb on graphite1001 with xxh64() improvement [20:04:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:07:57] ori: I added a variation of your command in https://wikitech.wikimedia.org/wiki/Service_restarts#Redis to check traffic drain on a redis instance. [20:08:10] elukey: thanks! [20:08:52] I will also add all the procedure for Lock Managers [20:09:00] those are painful ones [20:18:30] <_joe_> elukey: why painful? [20:19:43] <_joe_> (also, netstat -tunap | grep redis-server gives you an even better view of actual connections to redis, IMO) [20:19:48] _joe_ hello! Well two code reviews, mediawiki-deployment, etc.. [20:20:19] yes you are probably right, I'll add it to the docs too [20:21:34] !log elastic2004.codfw.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101) [20:21:36] T109101: Currently elasticsearch logs do not leave nodes. We use logstash for this across the cluster generally. - https://phabricator.wikimedia.org/T109101 [20:21:36] T122697: Upgrade ElasticSearch to 1.7.5 - https://phabricator.wikimedia.org/T122697 [20:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:29:28] (03PS1) 10Elukey: Fixed partman config for mcXXXX hosts. [puppet] - 10https://gerrit.wikimedia.org/r/273967 [20:29:41] PROBLEM - puppet last run on mw2021 is CRITICAL: CRITICAL: Puppet has 1 failures [20:30:56] (03CR) 10Dzahn: [C: 032 V: 032] Fixed partman config for mcXXXX hosts. [puppet] - 10https://gerrit.wikimedia.org/r/273967 (owner: 10Elukey) [20:32:07] 6Operations, 10Wikimedia-Mailing-lists: Fwd: 7 Fd-advisorygroup moderator request(s) waiting - https://phabricator.wikimedia.org/T128406#2073470 (10Dzahn) [20:41:33] 6Operations, 6Labs, 10Tool-Labs, 10Traffic, 7HTTPS: Fix all http-only tools in tools.wmflabs.org - https://phabricator.wikimedia.org/T102457#2073490 (10Nemo_bis) >>! In T102457#2072902, @Krinkle wrote: > Or maybe Magnus' tool should use the new [Page View API](https://wikitech.wikimedia.org/wiki/Analytic... [20:42:43] 6Operations, 6Labs, 10Tool-Labs, 10Traffic, 7HTTPS: Fix all http-only tools in tools.wmflabs.org - https://phabricator.wikimedia.org/T102457#1364862 (10Dzahn) >>! In T102457#2073490, @Nemo_bis wrote: > This bug is fixed as regards Magnus tools, I think; and it's probably too generic for the whole of Tool... [20:49:57] _joe_ still there? [20:50:40] I am adding shard1 to the mc1001 host in the memcached pool config in heira [20:50:46] as I did with the other ones [20:50:50] 6Operations, 6Labs, 10Tool-Labs, 10Traffic, 7HTTPS: Fix all http-only tools in tools.wmflabs.org - https://phabricator.wikimedia.org/T102457#2073559 (10Nemo_bis) >>! In T102457#2073493, @Dzahn wrote: > Use as a tracking ticket. Make subtasks for individual tools who are still http-only.? We need a list... [20:51:16] 6Operations, 6Labs, 10Tool-Labs, 10Traffic, and 2 others: Migrate tools.wmflabs.org to https only (and set HSTS) - https://phabricator.wikimedia.org/T102367#2073569 (10Nemo_bis) [20:51:20] 6Operations, 6Labs, 10Tool-Labs, 10Traffic, 7HTTPS: Make Magnus tools on tools.wmflabs.org work in HTTPS - https://phabricator.wikimedia.org/T102457#2073566 (10Nemo_bis) 5Open>3Resolved a:3Magnus [20:51:31] 6Operations, 6Labs, 10Tool-Labs, 10Traffic, 7HTTPS: Make Magnus tools on tools.wmflabs.org work in HTTPS - https://phabricator.wikimedia.org/T102457#1364862 (10Nemo_bis) [20:53:18] (03PS1) 10Elukey: Add mc1001.eqiad back to the redis/memcached pool after maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/273969 (https://phabricator.wikimedia.org/T123711) [20:55:33] (03CR) 10Elukey: [C: 032] Add mc1001.eqiad back to the redis/memcached pool after maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/273969 (https://phabricator.wikimedia.org/T123711) (owner: 10Elukey) [20:56:42] RECOVERY - puppet last run on mw2021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:56:49] !log Added mc1001.eqiad back into memcached/redis pools after maintenance [20:56:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:57:39] ori: still there/ [20:57:41] ? [20:59:27] 6Operations, 6Labs, 10Tool-Labs, 10Traffic, 7HTTPS: Detect tools.wmflabs.org tools which are HTTP-only - https://phabricator.wikimedia.org/T128409#2073607 (10Ricordisamoa) [20:59:54] 6Operations, 10RESTBase, 6Services, 10Traffic, 3Mobile-Content-Service: Split slash decoding from general percent normalization in Varnish VCL - https://phabricator.wikimedia.org/T127387#2073608 (10BBlack) Status update: I'm refactoring the above patch a bit to try to eliminate the code duplication, shou... [21:00:05] gwicke cscott arlolra subbu bearND mdholloway: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160229T2100). [21:01:58] (03PS1) 10Elukey: Add mc1001.eqiad back to the Redis Lock Managers after maintenance. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273970 (https://phabricator.wikimedia.org/T123711) [21:03:56] I'd need to deploy mediawiki-config after mc1001 maintenance, is there anybody willing to review ---^ [21:16:36] 6Operations, 6Labs, 10Tool-Labs, 10Traffic, 7HTTPS: Detect tools.wmflabs.org tools which are HTTP-only - https://phabricator.wikimedia.org/T128409#2073537 (10scfc) I don't think it is feasible to detect those tools this way because there would be way too many code paths that are only triggered under spec... [21:20:45] elukey: hi [21:20:47] i'll review [21:21:06] ori: o/ [21:21:12] (03CR) 10Ori.livneh: [C: 031] Add mc1001.eqiad back to the Redis Lock Managers after maintenance. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273970 (https://phabricator.wikimedia.org/T123711) (owner: 10Elukey) [21:21:14] lgtm [21:21:24] all right, merging [21:22:19] (03CR) 10Elukey: [C: 032] Add mc1001.eqiad back to the Redis Lock Managers after maintenance. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273970 (https://phabricator.wikimedia.org/T123711) (owner: 10Elukey) [21:23:07] (03CR) 10Ori.livneh: [C: 032] set SO_REUSEADDR before bind() [software/statsdlb] - 10https://gerrit.wikimedia.org/r/273920 (https://phabricator.wikimedia.org/T126447) (owner: 10Filippo Giunchedi) [21:24:11] (03PS1) 10GWicke: Add purged_cache_control config variable [puppet] - 10https://gerrit.wikimedia.org/r/273974 [21:24:17] (03CR) 10Ori.livneh: [V: 032] set SO_REUSEADDR before bind() [software/statsdlb] - 10https://gerrit.wikimedia.org/r/273920 (https://phabricator.wikimedia.org/T126447) (owner: 10Filippo Giunchedi) [21:24:40] !log elukey@tin Synchronized wmf-config/filebackend-production.php: Add mc1001 from the lock managers after maintenance (duration: 00m 54s) [21:24:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:24:57] ori: all done :) [21:25:09] elukey: awesome, thanks! [21:25:26] elukey: what's the plan for the other ones? [21:25:31] 6Operations, 10DNS, 10Internet-Archive, 10Traffic, 10Wikimedia-Video: Please set up a CNAME for videoserver.wikimedia.org to Video Editing Server - https://phabricator.wikimedia.org/T99216#1287289 (10Dzahn) also see T124127 [21:26:00] ori: I was about to ask you.. I'll work on them tomorrow with Giuseppe probably, asking him to review my code.. would it be ok? [21:26:12] elukey: yes, for sure [21:26:43] ori: all right then.. keeping an eye on logstash for a bit then I'll go offline :) [21:26:51] talk with you tomorrow! thanks for the help [21:34:00] 6Operations, 6Labs, 10Tool-Labs, 10Traffic, 7HTTPS: Detect tools.wmflabs.org tools which are HTTP-only - https://phabricator.wikimedia.org/T128409#2073765 (10Dzahn) step 0. figure out which instance(s) are "on the proxy" step 1. what is the relevant role class? i think "dynamicproxy" module but no role... [21:37:07] !log updated Parsoid to version d809ad7a [21:37:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:37:20] 6Operations, 10hardware-requests: additional graphite machines request, 1x per DC - https://phabricator.wikimedia.org/T126253#2073771 (10RobH) I'm assigning this task to @Mark for his approval to allocate one of the upcoming 6 restbase spare systems (1001-1006) for this. Once approved, we can move on the sub-... [21:37:30] 6Operations, 10hardware-requests: additional graphite machines request, 1x per DC - https://phabricator.wikimedia.org/T126253#2073772 (10RobH) a:5RobH>3mark [21:37:41] !log upgrade elastic2005.codfw.wmnet to elastic 1.7.5 [21:37:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:56:09] (03PS1) 10Ori.livneh: Make 'field' / 'fields' shell functions available to everyone [puppet] - 10https://gerrit.wikimedia.org/r/273983 (https://phabricator.wikimedia.org/T102793) [21:56:31] (03PS2) 10Ori.livneh: Make 'field' / 'fields' shell functions available to everyone [puppet] - 10https://gerrit.wikimedia.org/r/273983 (https://phabricator.wikimedia.org/T102793) [21:56:54] PROBLEM - MariaDB Slave Lag: m3 on db1048 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 310.00 seconds [21:57:43] jynus? [21:57:50] that is the phab host [21:57:57] ohsigh [21:57:58] (03CR) 10Mattflaschen: "There are still a few references:" [puppet] - 10https://gerrit.wikimedia.org/r/232903 (owner: 10Faidon Liambotis) [21:58:59] I think it is updating a repo [22:03:51] number of writes has multiplied by 26 [22:04:01] there is nothing I can do about that [22:06:00] 6Operations, 10EventBus, 10hardware-requests: 3 conf200x servers in codfw for zookeeper (and etcd?) - https://phabricator.wikimedia.org/T121882#2073882 (10Ottomata) Bump! [22:07:05] (03PS2) 10Andrew Bogott: Move horizon apache config into a vhost [puppet] - 10https://gerrit.wikimedia.org/r/270753 [22:07:07] (03PS1) 10Andrew Bogott: Add horizon to labtestweb. [puppet] - 10https://gerrit.wikimedia.org/r/273985 [22:07:30] bd808: kind of a blast from the past I know, but did you ever any thoughts on T106351? [22:07:30] T106351: RESTBase dashboard annotations for deployments (and more) - https://phabricator.wikimedia.org/T106351 [22:08:41] bd808: the ticket is ostensibly about restbase, but what i was thinking about was a means for creating arbitrary grafana annoations based on SAL entries [22:09:09] (more generally) [22:09:31] well... we have data in elasticsearch in labs -- https://tools.wmflabs.org/sal/ [22:09:34] 6Operations, 10Wikimedia-Mailing-lists: Fwd: 7 Fd-advisorygroup moderator request(s) waiting - https://phabricator.wikimedia.org/T128406#2073891 (10Dzahn) @Muehlenhoff since you are on duty this week, could you handle this? i added docs here: https://wikitech.wikimedia.org/wiki/Mailman#Reset_the_admin_passwo... [22:09:51] 6Operations, 10Wikimedia-Mailing-lists: Fwd: 7 Fd-advisorygroup moderator request(s) waiting - https://phabricator.wikimedia.org/T128406#2073893 (10Dzahn) a:3Muehlenhoff [22:10:54] (03CR) 10Gergő Tisza: "Scheduled for the Tuesday 16h PST SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271932 (https://phabricator.wikimedia.org/T127509) (owner: 10MarcoAurelio) [22:10:57] bd808: maybe grok some meta data from the irc message, and use that for some additional fields? [22:11:32] bd808: or maybe singular, a field [22:11:49] something that you could match on with an annotation query [22:12:29] _annotation:restbase [22:13:08] timestamp, nick, message and project are recorded right now. project == production for anything logged from this channel [22:13:37] bd808: but that seems over general [22:13:42] ostriches: can you un-(-2) https://gerrit.wikimedia.org/r/#/c/271932/ ? [22:14:29] !log setting innodb_flush_log_at_trx_commit=0 on phabricator db slave [22:14:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:14:36] bd808: no? if you wanted to annotate a restbase dashboard for example, with only messages that were apropos to restbase, how would you do that? [22:15:07] a "message containing X" query? [22:15:18] urandom: yeah, https://tools.wmflabs.org/sal/production?p=0&q=restbase [22:15:29] 6Operations, 10Ops-Access-Requests: Requesting access to researchers for nschaaf - https://phabricator.wikimedia.org/T128381#2073918 (10Dzahn) a:3Muehlenhoff [22:15:51] bd808: yeah, that might do it [22:16:30] bd808: what are the elasticsearch instances for that? [22:16:41] (03PS1) 10Ori.livneh: Report save timing by MediaWiki version [puppet] - 10https://gerrit.wikimedia.org/r/273990 (https://phabricator.wikimedia.org/T112557) [22:16:42] the backing elasticsearch instances aren't visible outside of the tools project so we'd need to build something to proxy. Add that to the sal tool wouldn't be hard [22:16:50] 6Operations, 10Ops-Access-Requests: Requesting access to researchers for nschaaf - https://phabricator.wikimedia.org/T128381#2072331 (10Dzahn) 'researchers' only does the mysql access to the research database. additional groups are needed to get on bastions and/or stat100x host(s). [22:16:58] (03CR) 10Ori.livneh: [C: 032] Make 'field' / 'fields' shell functions available to everyone [puppet] - 10https://gerrit.wikimedia.org/r/273983 (https://phabricator.wikimedia.org/T102793) (owner: 10Ori.livneh) [22:17:13] urandom: app source is at https://github.com/bd808/sal [22:17:47] bd808: add something to proxy elastic search queries to this, you mean? [22:18:09] tgr: Done [22:18:26] urandom: expose whatever grafana needs, yeah [22:18:30] * urandom wonders how hard it would be to teach grafana about other annotation sources [22:18:58] jynus: That was a batch of repos I was importing from gerrit. One of them had >13k commits :'( [22:19:10] The queue is almost gone now, writes should go back to normal. [22:19:18] ostriches, if this was known [22:19:20] it is ok [22:19:49] do you have icinga access? [22:22:04] jynus: No I do not [22:22:34] I mean I have read, I can't ack things [22:23:15] why is phabricator pooping itself on every other request? [22:23:43] it is ok, whenever you are going to do any batched processing just say "this is going to page db1048 (phabricator db), please downtime it for me" [22:24:10] and someone will help [22:24:15] is there documentation i can read anywhere about what the networks in codfw and eqiad look like? [22:24:18] MatmaRex: Was importing a few gerrit repos One had a crapton of commits and it backlogged. [22:24:29] jynus: mmk will do [22:24:47] what happened it is not a bad thing per sei [22:25:04] but it is impossible to distinguish from a bad thing [22:25:08] * ostriches nods [22:25:10] of course [22:25:39] we are still on 5.5 on many nodes, so we so not yet have parallel commits enabled [22:25:44] * apergos will keep a lookout for such requests if they are around [22:25:45] !log upgrade elastic2006.codfw.wmnet to elasticsearch 1.7.5 [22:25:46] so replication is slow [22:25:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:26:08] which was the 13k commit repo anyways? [22:26:19] enough for daily usage, but no enough for an import [22:26:56] jynus, ostriches and I looked at External Store on Beta. However, I noticed the script is broken (before running it for once :) ). There is a proposed fix and open questions at https://phabricator.wikimedia.org/T95871 . [22:27:55] hmpf Attempt to connect to phuser@m3-master.eqiad.wmnet failed with error #2003: Can't connect to MySQL server on 'm3-master.eqiad.wmnet' (99). [22:27:59] twentyafterfour: I'm getting "Attempt to connect to phuser@m3-master.eqiad.wmnet failed with error #2003: Can't connect to MySQL server on 'm3-master.eqiad.wmnet' (99)." for about 20% of my requests to phabricator.wm.o [22:28:07] Looks like MatmaRex and Nemo_bis are reporting the same thing [22:28:11] Refresh usually fixes it [22:28:27] too many open connection perhaps? [22:28:37] as I mentioned, the database is now under 26x the number of writes [22:28:43] !log phabricator: stopped phd daemons for the time being so replication can catch up and number of connections stop yelling ad people [22:28:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:28:58] jynus: Where can I see the replication status for m3? [22:29:22] https://tendril.wikimedia.org/host/view/db1048.eqiad.wmnet/3306 [22:29:26] ty [22:29:32] requires NDA [22:29:55] In particular, check the Query Write Traffic [22:30:22] Replication Seconds_Behind_Master [22:30:57] and for its master, https://tendril.wikimedia.org/host/view/db1043.eqiad.wmnet/3306 [22:31:29] Threads_connected and Connection Problems, which caused problems in the past [22:33:27] PROBLEM - PHD should be supervising processes on iridium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 997 (phd) [22:33:51] there seems to be issues larger than the database [22:34:06] need help? [22:34:09] well that's ostriches right? [22:34:14] ostriches: fyi it paged folks [22:34:15] RECOVERY - MariaDB Slave Lag: m3 on db1048 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [22:34:17] phd that is [22:34:20] yeah [22:34:22] Bleh, that pages? [22:34:24] yeah [22:34:27] so it seems [22:34:28] phd kicks itself all the time. [22:34:38] You guys really shouldn't get a page when it bounces... [22:34:43] seems like it maybe doesn't need to really [22:34:43] well, phabricator is a pretty important service of ours [22:34:53] maybe if it's down for 30m or something [22:35:08] Ok, replag is gone [22:35:28] did you restart it? [22:35:31] I suppose that means we're about to get a recovery page as you restart the service [22:35:55] phd should recover now [22:36:01] did you restart it? [22:37:05] Yep [22:37:26] RECOVERY - PHD should be supervising processes on iridium is OK: PROCS OK: 11 processes with UID = 997 (phd) [22:37:43] and there's the page :-D [22:38:10] so, remember when I said to please log and ask for help for downtiming the database? [22:38:23] it could apply to phabricator, too [22:39:39] FWIW I'm getting "AphrontConnectionQueryException Attempt to connect to phuser@m3-master.eqiad.wmnet failed with error #2003: Can't connect to MySQL server on 'm3-master.eqiad.wmnet' (99).” [22:39:46] See topic ^ [22:40:37] * James_F nods. [22:41:17] so the phabricator crashing is real? [22:41:35] Phab blows up when it hit connection limits it seems. [22:41:49] It'd be nice if we could have phd use a different connection pool from the users. [22:42:07] jynus: Yes. [22:42:08] So if phd goes crazy, it'd only start yelling at itself and not unsuspecting users. [22:42:35] we can limit the connections per user [22:42:47] Import backlog gone, shouldn't see any more connection errors now [22:43:02] so ostriches, which repo had the 13k commits? :-D [22:43:50] https://phabricator.wikimedia.org/diffusion/WFCV/ - 15,872 Commits [22:44:04] I think phd uses the normal user to connect to db atm, but it may be configurable? [22:44:04] daaannnggg [22:44:28] is that a CiviCRM fork? [22:44:52] fundraising... [22:44:53] I didn't realize we had a whole deep clone of civicrm in our repo [22:46:28] chasemp: I bet it does, and I doubt it is. [22:46:55] (03PS1) 10Ori.livneh: Apply xhgui role on tungsten [puppet] - 10https://gerrit.wikimedia.org/r/274000 [22:47:09] (03CR) 10Ori.livneh: [C: 032 V: 032] Apply xhgui role on tungsten [puppet] - 10https://gerrit.wikimedia.org/r/274000 (owner: 10Ori.livneh) [22:50:07] 6Operations, 10ops-ulsfo: ulsfo UL Nagios "host DOWN" for pdua-122/pdua-123 - https://phabricator.wikimedia.org/T128383#2074027 (10RobH) Since this is in an public task, it cannot simply be CC'd like many of our procurement tasks. As such, I'll just paste in the UL support updates as they happen. My outgoin... [22:52:39] (03PS3) 10Andrew Bogott: Move horizon apache config into a vhost [puppet] - 10https://gerrit.wikimedia.org/r/270753 [22:52:41] (03PS2) 10Andrew Bogott: Add horizon to labtestweb. [puppet] - 10https://gerrit.wikimedia.org/r/273985 [22:53:07] PROBLEM - puppet last run on tungsten is CRITICAL: CRITICAL: Puppet has 1 failures [22:53:07] (03PS1) 10Ori.livneh: Send XHGui profiling data to tungsten, not hafnium [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274001 [22:53:19] 6Operations, 10Analytics, 10ArchCom-RfC, 6Discovery, and 7 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#2074035 (10RobLa-WMF) [22:54:03] (03CR) 10Ori.livneh: [C: 032] Send XHGui profiling data to tungsten, not hafnium [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274001 (owner: 10Ori.livneh) [22:54:10] (03CR) 10Andrew Bogott: "I forgot that this uses misc-web, and misc-web handles ssl. So this is all way simpler." [puppet] - 10https://gerrit.wikimedia.org/r/270753 (owner: 10Andrew Bogott) [22:54:30] (03Merged) 10jenkins-bot: Send XHGui profiling data to tungsten, not hafnium [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274001 (owner: 10Ori.livneh) [22:55:04] (03PS4) 10Andrew Bogott: Move horizon apache config into a vhost [puppet] - 10https://gerrit.wikimedia.org/r/270753 [22:56:10] !log ori@tin Synchronized wmf-config/StartProfiler.php: I6d3ab949a6a: Apply xhgui role on tungsten (duration: 00m 56s) [22:56:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:58:08] (03PS1) 10Ori.livneh: Update perf.wm.o to proxy XHGui request to tungsten, not hafnium [puppet] - 10https://gerrit.wikimedia.org/r/274003 [22:58:10] (03CR) 10Andrew Bogott: [C: 032] Move horizon apache config into a vhost [puppet] - 10https://gerrit.wikimedia.org/r/270753 (owner: 10Andrew Bogott) [22:58:37] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:59:10] (03PS2) 10Ori.livneh: Update perf.wm.o to proxy XHGui request to tungsten, not hafnium [puppet] - 10https://gerrit.wikimedia.org/r/274003 [22:59:16] (03CR) 10Ori.livneh: [C: 032 V: 032] Update perf.wm.o to proxy XHGui request to tungsten, not hafnium [puppet] - 10https://gerrit.wikimedia.org/r/274003 (owner: 10Ori.livneh) [23:01:19] (03PS3) 10Andrew Bogott: Add horizon to labtestweb. [puppet] - 10https://gerrit.wikimedia.org/r/273985 [23:03:22] (03CR) 10Andrew Bogott: [C: 032] Add horizon to labtestweb. [puppet] - 10https://gerrit.wikimedia.org/r/273985 (owner: 10Andrew Bogott) [23:05:07] 6Operations, 10ops-ulsfo: ulsfo UL Nagios "host DOWN" for pdua-122/pdua-123 - https://phabricator.wikimedia.org/T128383#2074076 (10RobH) Reply from UL: > Hi Rob, > > Good afternoon. We ran into a little issue with our monitoring system last > Thursday afternoon causing it to generate false positives. > >... [23:05:27] PROBLEM - puppet last run on mw1035 is CRITICAL: CRITICAL: Puppet has 1 failures [23:05:44] !log upgrade elastic2006.codfw.wmnet to elasticsearch 1.7.5 [23:08:18] (03PS1) 10Nschaaf: Change rate for reader segmentation survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274005 (https://phabricator.wikimedia.org/T125946) [23:11:07] (03PS1) 10Andrew Bogott: Name horizon on labtest 'labtesthorizon.wikimedia.org' [puppet] - 10https://gerrit.wikimedia.org/r/274006 [23:11:36] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 59.09% of data above the critical threshold [5000000.0] [23:12:16] (03PS1) 10Andrew Bogott: move labtesthorizon behind misc-web [dns] - 10https://gerrit.wikimedia.org/r/274008 [23:12:46] (03PS2) 10Andrew Bogott: Name horizon on labtest 'labtesthorizon.wikimedia.org' behind misc-web [puppet] - 10https://gerrit.wikimedia.org/r/274006 [23:13:51] (03PS2) 10Andrew Bogott: move labtesthorizon behind misc-web [dns] - 10https://gerrit.wikimedia.org/r/274008 [23:15:08] (03CR) 10Andrew Bogott: [C: 032] move labtesthorizon behind misc-web [dns] - 10https://gerrit.wikimedia.org/r/274008 (owner: 10Andrew Bogott) [23:16:13] (03CR) 10Andrew Bogott: [C: 032] Name horizon on labtest 'labtesthorizon.wikimedia.org' behind misc-web [puppet] - 10https://gerrit.wikimedia.org/r/274006 (owner: 10Andrew Bogott) [23:19:14] (03PS1) 10Yuvipanda: k8s: Add auth for docker client to authenticate to registry [puppet] - 10https://gerrit.wikimedia.org/r/274011 [23:20:48] (03CR) 10jenkins-bot: [V: 04-1] k8s: Add auth for docker client to authenticate to registry [puppet] - 10https://gerrit.wikimedia.org/r/274011 (owner: 10Yuvipanda) [23:20:53] (03PS2) 10Yuvipanda: k8s: Add auth for docker client to authenticate to registry [puppet] - 10https://gerrit.wikimedia.org/r/274011 [23:22:16] PROBLEM - puppet last run on cp1058 is CRITICAL: CRITICAL: Puppet has 1 failures [23:22:37] RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 50.00% above the threshold [1000000.0] [23:22:52] (03PS3) 10Yuvipanda: k8s: Add auth for docker client to authenticate to registry [puppet] - 10https://gerrit.wikimedia.org/r/274011 [23:22:55] (03CR) 10jenkins-bot: [V: 04-1] k8s: Add auth for docker client to authenticate to registry [puppet] - 10https://gerrit.wikimedia.org/r/274011 (owner: 10Yuvipanda) [23:23:36] PROBLEM - puppet last run on mw2053 is CRITICAL: CRITICAL: puppet fail [23:24:26] (03PS4) 10Yuvipanda: k8s: Add auth for docker client to authenticate to registry [puppet] - 10https://gerrit.wikimedia.org/r/274011 [23:25:58] (03PS1) 10Andrew Bogott: Typo fix: s/server/service/ [puppet] - 10https://gerrit.wikimedia.org/r/274012 [23:26:17] (03PS5) 10Yuvipanda: k8s: Add auth for docker client to authenticate to registry [puppet] - 10https://gerrit.wikimedia.org/r/274011 [23:26:35] (03CR) 10Andrew Bogott: [C: 032] Typo fix: s/server/service/ [puppet] - 10https://gerrit.wikimedia.org/r/274012 (owner: 10Andrew Bogott) [23:31:18] RECOVERY - puppet last run on mw1035 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [23:33:35] (03PS6) 10Yuvipanda: k8s: Add auth for docker client to authenticate to registry [puppet] - 10https://gerrit.wikimedia.org/r/274011 [23:34:58] (03CR) 10jenkins-bot: [V: 04-1] k8s: Add auth for docker client to authenticate to registry [puppet] - 10https://gerrit.wikimedia.org/r/274011 (owner: 10Yuvipanda) [23:36:18] (03PS7) 10Yuvipanda: k8s: Add auth for docker client to authenticate to registry [puppet] - 10https://gerrit.wikimedia.org/r/274011 [23:37:12] (03PS1) 10Andrew Bogott: That's right, folks, it's the second typo in a row on the same line [puppet] - 10https://gerrit.wikimedia.org/r/274014 [23:38:27] PROBLEM - puppet last run on cp1061 is CRITICAL: CRITICAL: Puppet has 1 failures [23:38:51] (03CR) 10Andrew Bogott: [C: 032] That's right, folks, it's the second typo in a row on the same line [puppet] - 10https://gerrit.wikimedia.org/r/274014 (owner: 10Andrew Bogott) [23:39:38] PROBLEM - puppet last run on cp1045 is CRITICAL: CRITICAL: Puppet has 1 failures [23:41:39] (03PS8) 10Yuvipanda: k8s: Add auth for docker client to authenticate to registry [puppet] - 10https://gerrit.wikimedia.org/r/274011 [23:43:48] PROBLEM - puppet last run on cp1051 is CRITICAL: CRITICAL: Puppet has 1 failures [23:44:41] (03PS1) 10Andrew Bogott: Added labtesthorizon config [puppet] - 10https://gerrit.wikimedia.org/r/274016 [23:44:43] (03PS1) 10Andrew Bogott: Catch liberty up with some horizon apache config changes [puppet] - 10https://gerrit.wikimedia.org/r/274017 [23:46:10] (03CR) 10Andrew Bogott: [C: 032] Added labtesthorizon config [puppet] - 10https://gerrit.wikimedia.org/r/274016 (owner: 10Andrew Bogott) [23:51:08] RECOVERY - puppet last run on mw2053 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [23:52:06] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [23:53:08] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [23:53:14] * twentyafterfour didn't get the page ... :-/ [23:54:53] (03PS1) 10Ori.livneh: Remove xhgui role from hafnium [puppet] - 10https://gerrit.wikimedia.org/r/274021 [23:55:30] (03PS2) 10Ori.livneh: Remove xhgui role from hafnium [puppet] - 10https://gerrit.wikimedia.org/r/274021 [23:55:38] (03CR) 10Ori.livneh: [C: 032 V: 032] Remove xhgui role from hafnium [puppet] - 10https://gerrit.wikimedia.org/r/274021 (owner: 10Ori.livneh) [23:56:41] !log upgrade elastic2008.codfw.wmnet to elasticsearch 1.7.5 [23:56:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:59:36] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]