[00:04:40] (03CR) 10BryanDavis: [C: 031] Don't reimplement foreachwiki in l10nupdate-1 [puppet] - 10https://gerrit.wikimedia.org/r/257268 (owner: 10Alex Monk) [00:17:23] RECOVERY - puppet last run on mw2087 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:18:43] RECOVERY - puppet last run on cp2014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:41:15] (03CR) 10Alex Monk: "Hello?" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/248634 (owner: 10Alex Monk) [01:45:06] Is anybody out there? [01:47:34] Just nod if you can hear me... [01:53:07] Is there anyone home? [01:54:38] * YuviPanda turns into hitler, gets killed by a Time Travelling Jeb Bush [02:24:27] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.7) (duration: 09m 59s) [02:24:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:58:47] (03PS1) 10Tim Starling: Remove my non-Yubikey key [puppet] - 10https://gerrit.wikimedia.org/r/257272 [02:59:24] (03CR) 10Tim Starling: [C: 032] Remove my non-Yubikey key [puppet] - 10https://gerrit.wikimedia.org/r/257272 (owner: 10Tim Starling) [03:01:31] (03CR) 10Tim Starling: [V: 032] Remove my non-Yubikey key [puppet] - 10https://gerrit.wikimedia.org/r/257272 (owner: 10Tim Starling) [03:56:25] (03PS1) 10Yuvipanda: k8s: Make sure docker doesn't do ip-masq [puppet] - 10https://gerrit.wikimedia.org/r/257273 (https://phabricator.wikimedia.org/T120561) [03:56:36] I think ^ will fix it [03:56:38] let's see! [03:56:49] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Dec 7 03:56:49 UTC 2015 (duration 1h 32m 22s) [03:56:50] (03PS2) 10Yuvipanda: k8s: Make sure docker doesn't do ip-masq [puppet] - 10https://gerrit.wikimedia.org/r/257273 (https://phabricator.wikimedia.org/T120561) [03:56:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:57:39] (03CR) 10Yuvipanda: [C: 032 V: 032] k8s: Make sure docker doesn't do ip-masq [puppet] - 10https://gerrit.wikimedia.org/r/257273 (https://phabricator.wikimedia.org/T120561) (owner: 10Yuvipanda) [04:13:19] (03PS1) 10Yuvipanda: [WIP] base: Use the standard location for puppet ca [puppet] - 10https://gerrit.wikimedia.org/r/257275 [04:13:53] (03CR) 10Yuvipanda: [C: 04-2] [WIP] base: Use the standard location for puppet ca [puppet] - 10https://gerrit.wikimedia.org/r/257275 (owner: 10Yuvipanda) [04:39:04] 6operations, 6Zero: Security: Is it safe to enable Zero spoofing - https://phabricator.wikimedia.org/T120631#1857561 (10Yurik) 3NEW [04:59:12] 7Blocked-on-Operations, 6operations, 7Performance: Update HHVM package to recent release - https://phabricator.wikimedia.org/T119637#1857578 (10ori) https://github.com/facebook/hhvm/commit/75632c113d3ba80104febfbec55fbdcd73f1669f would help with T103886 [05:01:09] (03PS2) 10Ori.livneh: Update my .bashrc [puppet] - 10https://gerrit.wikimedia.org/r/257267 (owner: 10Alex Monk) [05:01:20] (03CR) 10Ori.livneh: [C: 032 V: 032] Update my .bashrc [puppet] - 10https://gerrit.wikimedia.org/r/257267 (owner: 10Alex Monk) [05:01:46] (03PS2) 10Ori.livneh: Don't reimplement foreachwiki in l10nupdate-1 [puppet] - 10https://gerrit.wikimedia.org/r/257268 (owner: 10Alex Monk) [05:01:53] (03CR) 10Ori.livneh: [C: 032 V: 032] Don't reimplement foreachwiki in l10nupdate-1 [puppet] - 10https://gerrit.wikimedia.org/r/257268 (owner: 10Alex Monk) [05:02:43] (03CR) 10Ori.livneh: "It's me" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/248634 (owner: 10Alex Monk) [05:58:15] (03CR) 10Yuvipanda: "Is there anybody out there?" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/248634 (owner: 10Alex Monk) [06:29:43] PROBLEM - puppet last run on db1015 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:03] PROBLEM - puppet last run on mw1260 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:03] PROBLEM - puppet last run on mw1073 is CRITICAL: CRITICAL: puppet fail [06:30:54] PROBLEM - puppet last run on wtp2008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:13] PROBLEM - puppet last run on lvs2002 is CRITICAL: CRITICAL: puppet fail [06:31:23] PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:52] PROBLEM - puppet last run on lvs1003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:02] PROBLEM - puppet last run on db2044 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:14] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:23] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:54] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:54] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:13] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:39:03] PROBLEM - puppet last run on mw2060 is CRITICAL: CRITICAL: puppet fail [06:42:13] PROBLEM - puppet last run on cp3043 is CRITICAL: CRITICAL: puppet fail [06:50:52] PROBLEM - puppet last run on db1034 is CRITICAL: CRITICAL: Puppet has 1 failures [06:51:48] <_joe_> and good morning [06:55:22] RECOVERY - puppet last run on mw1260 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [06:56:13] RECOVERY - puppet last run on wtp2008 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [06:56:34] RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:56:40] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "In general a checkout would be harmful, as differences in the manifests between production and the change could be caused by a number of r" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/248634 (owner: 10Alex Monk) [06:56:54] RECOVERY - puppet last run on db1015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:54] RECOVERY - puppet last run on lvs1003 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [06:57:13] RECOVERY - puppet last run on db2044 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:57:24] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:34] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [06:58:12] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:13] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:58:23] RECOVERY - puppet last run on lvs2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:14] RECOVERY - puppet last run on mw1073 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:32] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:02:44] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [07:08:32] RECOVERY - puppet last run on mw2060 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:09:43] RECOVERY - puppet last run on cp3043 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:14:33] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 1.00% above the threshold [1000000.0] [07:17:11] (03PS3) 10Muehlenhoff: Migrate the OpenDJ ACL for the "Directory Managers" group [puppet] - 10https://gerrit.wikimedia.org/r/256909 [07:18:23] RECOVERY - puppet last run on db1034 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:39:04] PROBLEM - puppet last run on lvs4004 is CRITICAL: CRITICAL: puppet fail [08:06:42] RECOVERY - puppet last run on lvs4004 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [08:06:56] 6operations, 6Security-Team: Use user-specific passwords for accessing EventLogging database - https://phabricator.wikimedia.org/T120532#1857653 (10jcrespo) Creating one account per user not only it is possible, but desirable. There are 2 main problems: * Administration overhead, in an already short staff * L... [08:15:24] (03PS3) 10Giuseppe Lavagetto: etcd: remove package etcdctl [puppet] - 10https://gerrit.wikimedia.org/r/255088 (https://phabricator.wikimedia.org/T118830) [08:15:26] (03PS2) 10Giuseppe Lavagetto: etcd: add client configuration facility [puppet] - 10https://gerrit.wikimedia.org/r/255998 [08:15:28] (03PS9) 10Giuseppe Lavagetto: etcd: auth puppettization [WiP] [puppet] - 10https://gerrit.wikimedia.org/r/255155 (https://phabricator.wikimedia.org/T97972) [08:16:23] (03CR) 10Giuseppe Lavagetto: [C: 032] "This won't have any practical consequence at the moment." [puppet] - 10https://gerrit.wikimedia.org/r/255088 (https://phabricator.wikimedia.org/T118830) (owner: 10Giuseppe Lavagetto) [08:17:34] anyone has idea about stuck Jenkins due to rake-jessie job? [08:19:39] (03CR) 10Giuseppe Lavagetto: [V: 032] etcd: remove package etcdctl [puppet] - 10https://gerrit.wikimedia.org/r/255088 (https://phabricator.wikimedia.org/T118830) (owner: 10Giuseppe Lavagetto) [08:27:51] <_joe_> !log uploaded etcd 2.2 package from stretch to jessie-wikimedia [08:27:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:31:49] 6operations, 10Salt: salt still has issues with grain selection? - https://phabricator.wikimedia.org/T114937#1857682 (10ArielGlenn) 5Open>3Resolved a:3ArielGlenn I'm going to close this as expected behavior given the timeout, and answered. [08:33:17] (03CR) 10Muehlenhoff: [C: 032 V: 032] Migrate the OpenDJ ACL for the "Directory Managers" group [puppet] - 10https://gerrit.wikimedia.org/r/256909 (owner: 10Muehlenhoff) [08:33:24] (03PS4) 10Muehlenhoff: Migrate the OpenDJ ACL for the "Directory Managers" group [puppet] - 10https://gerrit.wikimedia.org/r/256909 [08:33:33] (03CR) 10Muehlenhoff: [V: 032] Migrate the OpenDJ ACL for the "Directory Managers" group [puppet] - 10https://gerrit.wikimedia.org/r/256909 (owner: 10Muehlenhoff) [08:34:53] 7Blocked-on-Operations, 6operations, 10Parsoid, 10Salt, 6Scrum-of-Scrums: Disabling agent forwarding breaks dsh based restarts for Parsoid (required for deployments) - https://phabricator.wikimedia.org/T102039#1857687 (10ArielGlenn) This is indeed a failure between the client and the master; this behavio... [08:36:06] (03PS3) 10Giuseppe Lavagetto: etcd: add client configuration facility [puppet] - 10https://gerrit.wikimedia.org/r/255998 [08:36:58] 6operations, 10Salt: Move salt master to separate host from puppet master - https://phabricator.wikimedia.org/T115287#1857690 (10ArielGlenn) [08:37:00] 7Blocked-on-Operations, 6operations, 10Parsoid, 10Salt, 6Scrum-of-Scrums: Disabling agent forwarding breaks dsh based restarts for Parsoid (required for deployments) - https://phabricator.wikimedia.org/T102039#1857689 (10ArielGlenn) [08:41:19] (03PS4) 10Giuseppe Lavagetto: etcd: add client configuration facility [puppet] - 10https://gerrit.wikimedia.org/r/255998 [08:44:02] (03CR) 10Giuseppe Lavagetto: [C: 032] etcd: add client configuration facility [puppet] - 10https://gerrit.wikimedia.org/r/255998 (owner: 10Giuseppe Lavagetto) [08:44:19] (03CR) 10Giuseppe Lavagetto: [V: 032] etcd: add client configuration facility [puppet] - 10https://gerrit.wikimedia.org/r/255998 (owner: 10Giuseppe Lavagetto) [08:46:21] (03PS6) 10Jcrespo: New variable binlog_format and reconfiguring es1013 [puppet] - 10https://gerrit.wikimedia.org/r/256657 [08:51:15] (03CR) 10Jcrespo: [C: 032] New variable binlog_format and reconfiguring es1013 [puppet] - 10https://gerrit.wikimedia.org/r/256657 (owner: 10Jcrespo) [08:51:32] (03CR) 10Jcrespo: [V: 032] New variable binlog_format and reconfiguring es1013 [puppet] - 10https://gerrit.wikimedia.org/r/256657 (owner: 10Jcrespo) [08:54:25] still reports of thumbs not being purged [08:57:27] 6operations, 6Commons, 10MediaWiki-File-management, 6Multimedia, and 2 others: Image cache issue when 'over-writing' an image on commons - https://phabricator.wikimedia.org/T119038#1857705 (10TheDJ) More reports on WP:VP/T keep coming in. [08:57:42] <_joe_> thedj: sigh [09:13:31] !log es1013 maintenance (mysql restart, upgrade, possible reboot) [09:13:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:15:04] 6operations, 10Traffic, 5Patch-For-Review, 7Pybal: pybal fails to detect dead servers under production lb IPs for port 80 - https://phabricator.wikimedia.org/T113151#1857707 (10Joe) Actually I don't think *this* was caused by the apache filter - we were seeing this on varnish as well; while it was rather t... [09:16:56] (03PS3) 10Giuseppe Lavagetto: pybal: install monitoring [puppet] - 10https://gerrit.wikimedia.org/r/252245 (https://phabricator.wikimedia.org/T102394) [09:18:13] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] pybal: install monitoring [puppet] - 10https://gerrit.wikimedia.org/r/252245 (https://phabricator.wikimedia.org/T102394) (owner: 10Giuseppe Lavagetto) [09:27:30] !log nodetool decommission restbase1008 [09:27:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:27:37] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1857743 (10fgiunchedi) finshed dec 05 10:41 -icinga-wm:#wikimedia-operations- RECOVERY - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is OK:... [09:28:21] (03PS1) 10Giuseppe Lavagetto: pybal::monitoring: install libnagios-plugin-perl [puppet] - 10https://gerrit.wikimedia.org/r/257280 [09:29:29] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/257280 (owner: 10Giuseppe Lavagetto) [09:36:18] (03PS1) 10Giuseppe Lavagetto: pybal: fix diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/257281 [09:36:25] there are like 100 deletes/s on commons since 5AM, usually deletes are very low [09:37:09] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] pybal: fix diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/257281 (owner: 10Giuseppe Lavagetto) [09:37:49] (03PS1) 10Filippo Giunchedi: torrus: move icinga check to https [puppet] - 10https://gerrit.wikimedia.org/r/257282 (https://phabricator.wikimedia.org/T119582) [09:38:08] I think there is a lot of maintenance doing deletes and inserts [09:38:19] jynus, 5AM in which timezone? :) [09:38:32] I am an operator [09:38:43] for me there is no timezones, only UTC [09:39:00] :-) [09:39:58] yes, I think it fits https://grafana.wikimedia.org/dashboard/db/job-queue-rate [09:40:24] !log CI / Zuul stalled. Nodepool can no more spawn instances :-/ [09:40:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:41:26] jynus: mass upload perhaps? I'm seeing a spike in PUT from 6 to 7.30, https://grafana.wikimedia.org/dashboard/db/swift?panelId=7&fullscreen&from=1449464590422&to=1449477186043&var-DC=eqiad [09:41:40] that would explain it, yes [09:42:16] usually, after uploads there are dozens of deferred jobs [09:43:50] I am not concerned but either job queue should be slower or we should give more resources to commons, probably the second [09:44:37] (I am speaking aloud, ignore me) [09:45:02] ah yeah we've probably didn't notice before as the job queue was much slower [09:45:45] (03PS2) 10Filippo Giunchedi: torrus: move icinga check to https [puppet] - 10https://gerrit.wikimedia.org/r/257282 (https://phabricator.wikimedia.org/T119582) [09:45:52] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] torrus: move icinga check to https [puppet] - 10https://gerrit.wikimedia.org/r/257282 (https://phabricator.wikimedia.org/T119582) (owner: 10Filippo Giunchedi) [09:46:10] !log restarting Nodepool on labnodepool1001.eqiad.wment [09:46:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:46:25] the poor nodepool has troubles reaching the wmflabs openstack API apparently :( [10:01:42] PROBLEM - nodepoold running on labnodepool1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d [10:03:03] RECOVERY - nodepoold running on labnodepool1001 is OK: PROCS OK: 1 process with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d [10:04:49] (03PS1) 10Jcrespo: Repool es1013 (lower weight for now) and depool es1017 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257289 [10:05:31] (03CR) 10Jcrespo: [C: 032] Repool es1013 (lower weight for now) and depool es1017 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257289 (owner: 10Jcrespo) [10:05:55] !log stopped Nodepool. Can not create instances anymore on wmflabs ( https://phabricator.wikimedia.org/T120586 ) [10:06:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:07:35] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool es1013 (lower weight for now) and depool es1017 (duration: 00m 41s) [10:07:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:14:35] (03CR) 10Filippo Giunchedi: [C: 031] zookeeper: move roles to module/role [puppet] - 10https://gerrit.wikimedia.org/r/257035 (owner: 10Dzahn) [10:14:59] (03CR) 10Filippo Giunchedi: [C: 031] elasticsearch: move role to module/role [puppet] - 10https://gerrit.wikimedia.org/r/257036 (owner: 10Dzahn) [10:20:43] !log restarted nova-conductor and scheduler on labcontrol1001 [10:20:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:24:23] RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:26:13] RECOVERY - puppet last run on holmium is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [10:27:47] (03PS1) 10Jcrespo: Rolling restart and upgrade for es1017 [puppet] - 10https://gerrit.wikimedia.org/r/257292 [10:28:19] (03PS2) 10Jcrespo: Rolling restart and upgrade for es1017 [puppet] - 10https://gerrit.wikimedia.org/r/257292 [10:33:42] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, some notes:" [debs/bloomd] - 10https://gerrit.wikimedia.org/r/257167 (https://phabricator.wikimedia.org/T120544) (owner: 10Ori.livneh) [10:33:57] (03CR) 10Jcrespo: [C: 032] Rolling restart and upgrade for es1017 [puppet] - 10https://gerrit.wikimedia.org/r/257292 (owner: 10Jcrespo) [10:37:03] 6operations, 10Datasets-General-or-Unknown, 10netops: dumps.wikimedia.org seems to have poor networking towards Telia - https://phabricator.wikimedia.org/T120425#1857862 (10Nemo_bis) Thanks, I tried quickly crafting such a wget line but failed; now trying that one (but it seems that's only slightly faster, a... [10:43:15] !log CI / zuul / nodepool recovered. Root cause was some malfunction in openstack wmflabs [10:43:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:52:41] !log database and system maintenance to es1017 [10:52:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:53:34] PROBLEM - Auth DNS for labs pdns on labs-ns2.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [10:56:12] PROBLEM - Auth DNS for labs pdns on labs-ns3.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [11:05:25] PROBLEM - tools-home on tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 323 bytes in 0.031 second response time [11:06:51] oooh [11:07:07] godog: ^ authdns failed! I wonder if that means pdns itself has failed [11:08:23] godog: I'm going to try restarting pdns itself [11:08:30] YuviPanda: ok, on which host? [11:08:41] godog: holmium [11:08:53] !log restarting pdns on holmium [11:09:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:09:08] zilch effect [11:09:13] haha, morebots still works?! wtf [11:09:14] 6operations, 10Datasets-General-or-Unknown, 10netops: dumps.wikimedia.org seems to have poor networking towards Telia - https://phabricator.wikimedia.org/T120425#1857932 (10ArielGlenn) for comparison: wget http://dumps.wikimedia.org/itwiki/20151202/itwiki-20151202-pages-articles.xml.bz2 for me fluctuate... [11:09:26] (03CR) 10Addshore: [C: 031] Set $wgExtDistGraphiteRenderApi [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257077 (https://phabricator.wikimedia.org/T120339) (owner: 10Legoktm) [11:09:28] also seems to have no logs... [11:09:28] YuviPanda: do you know where pdns is supposed to log? [11:09:31] haha [11:09:34] godog: I see some logs in syslog [11:09:37] godog: not very useful [11:09:45] syslog, but not much atm [11:09:59] yeah [11:10:15] resolution is back for me on a labs instance btw [11:10:44] godog: what are you resolving? [11:10:48] google.com [11:11:12] godog: dig @208.80.154.20 tools-webgrid-lighttpd-1203.tools.eqiad.wmflabs [11:11:15] still fails [11:11:21] with SERVFAIL [11:12:06] indeedly [11:12:52] so SERVFAIL is the server itself failing... [11:12:57] and not a connectivity issue, right? [11:13:23] yeah that's an answer from the server [11:13:37] right [11:13:56] I'm trying to increase logging on holmium [11:15:30] Dec 7 11:15:11 holmium pdns[13266]: Done launching threads, ready to distribute questions [11:16:13] (03PS1) 10ArielGlenn: lower max reconnect time for salt minion [puppet] - 10https://gerrit.wikimedia.org/r/257295 [11:17:21] godog: I might page andrewbogott [11:17:40] YuviPanda: yeah that seems like a good idea at this point, we're not going very far [11:17:40] (03CR) 10ArielGlenn: [C: 032] lower max reconnect time for salt minion [puppet] - 10https://gerrit.wikimedia.org/r/257295 (owner: 10ArielGlenn) [11:18:03] godog: yeah [11:18:11] RECOVERY - Auth DNS for labs pdns on labs-ns3.wikimedia.org is OK: DNS OK: 5.053 seconds response time. nagiostest.eqiad.wmflabs returns [11:18:15] oh [11:18:17] wat [11:18:33] O_o [11:18:38] no [11:18:42] everything is still failing [11:18:49] also wtf is labs-ns [11:18:51] 3 [11:19:02] holmium [11:19:16] oh yeah [11:19:18] that's back [11:19:20] dig @208.80.155.118 tools-webgrid-lighttpd-1203.tools.eqiad.wmflabs [11:19:20] RECOVERY - Auth DNS for labs pdns on labs-ns2.wikimedia.org is OK: DNS OK: 0.043 seconds response time. nagiostest.eqiad.wmflabs returns [11:19:44] k [11:19:48] so... [11:19:56] the second restart didn't fix it but the *fourth* did?! [11:19:58] wat [11:20:01] YuviPanda: tools seems up from here again (from the outside) [11:20:08] things are back to normal yeah [11:20:11] .... [11:20:31] godog: does picking up the phone to call andrewbogott trigger this? [11:20:46] YuviPanda: hahaha perhaps! still no idea what happened [11:20:55] i think it was the restart of pdns-recursor after the restart of pdns which fixed it [11:20:59] yeah [11:21:04] that might have been it [11:21:05] RECOVERY - tools-home on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 951191 bytes in 3.610 second response time [11:21:20] I first restarted pdns-recursor (no effect), then pdns (no effect), then pdns-recursor [11:21:33] still, "loglevel 9" makes no noticable difference? [11:21:38] moritzm: nope [11:22:47] hmm, according to https://doc.powerdns.com/md/common/logging/ 9 should log everything [11:23:50] hmm, so my theory was that the lua file was causing problems (I have commented that out), but not the case [11:23:52] the file looks fine [11:24:16] moritzm: yeah, I set loglevel=9 on holmium, syslog has nothing [11:25:04] (03Abandoned) 10JanZerebecki: webperf::navtiming additionaly store wikidata seperately [puppet] - 10https://gerrit.wikimedia.org/r/249573 (https://phabricator.wikimedia.org/T116568) (owner: 10JanZerebecki) [11:28:24] still confused about labservices1001 and holmium having the same addresses bound to eth0, but I guess question for andrewbogott to answer [11:30:55] godog: heh [11:31:43] 6operations, 10Salt: Move salt master to separate host from puppet master - https://phabricator.wikimedia.org/T115287#1857952 (10ArielGlenn) I was seeing some really troubling behavior, a nonresponsive minion when test.ping to the one minion that made me think my changes had suddenly gone awry. But today it a... [11:37:34] godog: moritzm i sent out an email [11:37:55] (03CR) 10Filippo Giunchedi: "LGTM overall, a few comments" (039 comments) [debs/bloomd] - 10https://gerrit.wikimedia.org/r/257168 (owner: 10Ori.livneh) [11:38:10] godog: moritzm I'm going to peel an orange and monitor the situation while peeling, and I'll go to sleep if everything is still up by the end of the orange [11:38:42] (03CR) 10Filippo Giunchedi: [C: 04-1] "and forgot to -1" [debs/bloomd] - 10https://gerrit.wikimedia.org/r/257168 (owner: 10Ori.livneh) [11:39:16] YuviPanda: hahah ok! [11:39:17] 6operations, 10Datasets-General-or-Unknown, 10netops: dumps.wikimedia.org seems to have poor networking towards Telia - https://phabricator.wikimedia.org/T120425#1857963 (10Vituzzu) Here's mine (yep, from windows): 1 1 ms 1 ms 1 ms 192.168.0.1 2 2 ms 1 ms 1 ms 192.168.1.1 3... [11:39:51] godog: also checkout tools.wmflabs.org/paws (use OAuth with an account that does not have special chars in it, but needs fixing). Our first 'kubernetes native' application [11:40:18] YuviPanda: enjoy the orange and good night :-) [11:40:23] godog: gives everyone a shell with pywikibot setup [11:40:30] (in their own docker container) [11:45:51] PROBLEM - salt-minion processes on mw1196 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [11:47:32] YuviPanda: hahaa that's awesome! I've authenticated and waiting for startup [11:49:21] godog: you have used your wmf account [11:49:26] godog: won't work since it has a special char in it [11:49:38] godog: it's a bug that is fixed in latest master that I haven't deployed yet [11:49:47] godog: try an account without special chars, should show up immediately [11:50:16] my orange is finished now, so I'll go once godog tries this out with a non WMF account :D [11:50:21] <_joe_> kill -9 yuvi [11:50:26] YuviPanda: ah ok, yeah I don't think I have one on meta :) go to sleep! [11:51:35] _joe_: if I had been killed earlier, someone else would've had to deal with the labs DNS outage! [11:55:45] 6operations, 10Traffic, 5Patch-For-Review, 7Pybal: pybal fails to detect dead servers under production lb IPs for port 80 - https://phabricator.wikimedia.org/T113151#1857986 (10mark) @Joe: easy to test, no? :) [12:02:36] (03PS1) 10Jcrespo: Depool es1015, es1013 at 100% load, es1017 pooled with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257297 [12:06:06] PROBLEM - PyBal backends health check on lvs1009 is CRITICAL: PYBAL CRITICAL - apaches_80 - Could not depool server mw1062.eqiad.wmnet because of too many down!: aqs_7232 - Could not depool server aqs1002.eqiad.wmnet because of too many down!: search_9200 - Could not depool server elastic1011.eqiad.wmnet because of too many down!: citoid_1970 - Could not depool server sca1001.eqiad.wmnet because of too many down!: mathoid_10042 - Coul [12:11:46] RECOVERY - salt-minion processes on mw1196 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:18:11] (03CR) 10Jcrespo: [C: 032] Depool es1015, es1013 at 100% load, es1017 pooled with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257297 (owner: 10Jcrespo) [12:20:33] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool es1015; es1013 at 100% load; pool es1017 with low weight (duration: 00m 28s) [12:20:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:23:36] ACKNOWLEDGEMENT - torrus.wikimedia.org HTTP on netmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string Torrus Top: Wikimedia not found on https://torrus.wikimedia.org:443/torrus - 8354 bytes in 0.083 second response time Filippo Giunchedi https://phabricator.wikimedia.org/T119582 [12:24:11] 6operations, 7HTTPS, 5Patch-For-Review: move torrus behind misc-web - https://phabricator.wikimedia.org/T119582#1858024 (10fgiunchedi) I've moved the check to https but icinga still doesn't like it, acked there but I don't have the bandwidth to investigate @dzahn [12:36:39] PROBLEM - PyBal backends health check on lvs1011 is CRITICAL: PYBAL CRITICAL - streamlb_443 - Could not depool server rcs1002.eqiad.wmnet because of too many down!: streamlb6_443 - Could not depool server rcs1002.eqiad.wmnet because of too many down! [12:36:39] PROBLEM - PyBal backends health check on lvs1005 is CRITICAL: PYBAL CRITICAL - streamlb_443 - Could not depool server rcs1002.eqiad.wmnet because of too many down!: streamlb6_443 - Could not depool server rcs1002.eqiad.wmnet because of too many down! [12:38:40] PROBLEM - PyBal backends health check on lvs1008 is CRITICAL: PYBAL CRITICAL - streamlb_443 - Could not depool server rcs1002.eqiad.wmnet because of too many down!: ocg_8000 - Could not depool server ocg1001.eqiad.wmnet because of too many down!: streamlb6_443 - Could not depool server rcs1002.eqiad.wmnet because of too many down! [12:38:48] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - mathoid_10042 - Could not depool server sca1002.eqiad.wmnet because of too many down! [12:38:48] PROBLEM - PyBal backends health check on lvs3003 is CRITICAL: PYBAL CRITICAL - mobilelb6_443 - Could not depool server cp3015.esams.wmnet because of too many down! [12:39:00] PROBLEM - PyBal backends health check on lvs1002 is CRITICAL: PYBAL CRITICAL - streamlb_443 - Could not depool server rcs1002.eqiad.wmnet because of too many down!: streamlb6_443 - Could not depool server rcs1002.eqiad.wmnet because of too many down! [12:39:14] what is going on? [12:39:35] smells like a faulty check? [12:41:32] hrm [12:42:31] pybal has been complaining about rcs100x for days [12:42:39] indeed [12:43:04] the check is new [12:43:14] _joe_ introduced it today I think? [12:43:16] then it's already proving useful [12:44:16] <_joe_> Yes, a few hours agi [12:45:03] <_joe_> But there is already a ticket about rcstream on 443 [12:46:10] <_joe_> But The ocg failure seems legit [12:46:20] the rcs seems legit as well [12:46:24] what is the ticket you're talking about? [12:46:57] <_joe_> Heh, one sec, I'm currently at lunch [12:51:47] OCG seems online to me [12:52:07] so maybe it's not completely broken [12:53:04] (03PS1) 10Jcrespo: Appling rolling changes to es1015 [puppet] - 10https://gerrit.wikimedia.org/r/257302 [12:53:34] the lvs3001/3 are more interesting.. [12:54:32] (03CR) 10Jcrespo: [C: 032] Appling rolling changes to es1015 [puppet] - 10https://gerrit.wikimedia.org/r/257302 (owner: 10Jcrespo) [12:56:51] !log rolling restart, configuration upgrade of es1015 [12:56:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:58:05] 6operations, 10Salt: Move salt master to separate host from puppet master - https://phabricator.wikimedia.org/T115287#1858091 (10ArielGlenn) Well I have just seen this behavior again and I don't know if it's specific to my changes to the minion code or not, so I'm investigating it. Slow going because the mult... [12:58:29] <_joe_> paravoid: https://gerrit.wikimedia.org/r/#/c/253917/ [12:59:18] ?? [12:59:20] I fixed that a long time ago [12:59:48] ACKNOWLEDGEMENT - PyBal backends health check on lvs1009 is CRITICAL: PYBAL CRITICAL - apaches_80 - Could not depool server mw1062.eqiad.wmnet because of too many down!: aqs_7232 - Could not depool server aqs1002.eqiad.wmnet because of too many down!: search_9200 - Could not depool server elastic1011.eqiad.wmnet because of too many down!: citoid_1970 - Could not depool server sca1001.eqiad.wmnet because of too many down!: mathoid_1004 [12:59:50] paravoid@serenity:~$ openssl s_client -connect stream.wikimedia.org:443 [12:59:53] CONNECTED(00000003) [12:59:56] depth=0 C = US, ST = California, L = San Francisco, O = "Wikimedia Foundation, Inc.", CN = stream.wikimedia.org [12:59:59] etc. [13:00:44] <_joe_> paravoid: uhm I didn't check, I just remember brandon asking about it [13:01:54] 6operations, 10Traffic, 10Wikimedia-Stream, 5Patch-For-Review: rcstream service on port 443 is broken, spamming logs - https://phabricator.wikimedia.org/T118956#1858101 (10faidon) I'm not sure what is this about… the "cleartext over 443" bug was T102313, fixed with 0c57a368230711b5534dc4855155533ce7ed9d5d... [13:02:18] <_joe_> If you're looking at lvs3003 I can look at mathoid [13:02:51] PROBLEM - MariaDB Slave IO: es2 on es1015 is CRITICAL: CRITICAL slave_io_state could not connect [13:03:03] oh, I forgot to ack that [13:03:04] (03CR) 10Faidon Liambotis: [C: 04-2] "See bug." [puppet] - 10https://gerrit.wikimedia.org/r/253917 (https://phabricator.wikimedia.org/T118956) (owner: 10BBlack) [13:03:31] PROBLEM - MariaDB Slave SQL: es2 on es1015 is CRITICAL: CRITICAL slave_sql_state could not connect [13:05:03] ACKNOWLEDGEMENT - PyBal backends health check on lvs1008 is CRITICAL: PYBAL CRITICAL - streamlb_443 - Could not depool server rcs1002.eqiad.wmnet because of too many down!: ocg_8000 - Could not depool server ocg1001.eqiad.wmnet because of too many down!: streamlb6_443 - Could not depool server rcs1002.eqiad.wmnet because of too many down! Giuseppe Lavagetto Not in service [13:05:16] everything is scheduled, I've just forgotten to do that 1 out of 150 times [13:05:23] ACKNOWLEDGEMENT - PyBal backends health check on lvs1011 is CRITICAL: PYBAL CRITICAL - streamlb_443 - Could not depool server rcs1002.eqiad.wmnet because of too many down!: streamlb6_443 - Could not depool server rcs1002.eqiad.wmnet because of too many down! Giuseppe Lavagetto Not in service [13:05:45] why is it not in service? [13:05:53] it is in service afaik [13:06:48] _joe_: ^ [13:06:53] <_joe_> lvs1007-12 are not active since there was the network issue, right? [13:07:01] oh, the lvs servers you meant [13:07:08] <_joe_> yup [13:07:12] yeah sure, but the pybal checks are legitimate [13:07:16] so [13:07:19] this is interesting: [13:07:20] root@lvs3003:~# curl http://localhost:9090/alerts; echo [13:07:20] mobilelb6_443 - Could not depool server cp3015.esams.wmnet because of too many down! [13:07:23] root@lvs3003:~# journalctl -u pybal.service | grep cp3015 | tail -3 [13:07:26] Dec 06 05:36:41 lvs3003 pybal[2336]: [mobilelb6_443] ERROR: Monitoring instance ProxyFetch reports server cp3015.esams.wmnet (enabled/up/pooled) down: 503 Service Unavailable [13:07:28] <_joe_> they are? I think it's a reachibility probl;em [13:07:30] Dec 06 05:36:50 lvs3003 pybal[2336]: [mobilelb_443] INFO: Server cp3015.esams.wmnet (enabled/partially up/not pooled) is up [13:07:33] Dec 06 05:36:51 lvs3003 pybal[2336]: [mobilelb6_443] INFO: Server cp3015.esams.wmnet (enabled/partially up/not pooled) is up [13:07:45] <_joe_> uhm [13:08:21] <_joe_> paravoid: yes, the message is not accurate, but it should mean that we've still not above the depooling threshold? [13:08:24] <_joe_> I'll check [13:09:21] <_joe_> nope, this is a bug clearly. uhm [13:11:03] <_joe_> this looks like a bug in the code that should clear the alerts [13:11:14] alright :) [13:12:33] <_joe_> oh ffs it's clear enough [13:12:43] <_joe_> sigh [13:13:24] <_joe_> when removing one single server sends you below the depool threshold, the logic doesn't work [13:14:10] <_joe_> brb [13:16:14] (03PS1) 10Faidon Liambotis: lvs: fix ProxyFetch URLs for rcstream [puppet] - 10https://gerrit.wikimedia.org/r/257308 [13:17:03] (03CR) 10Faidon Liambotis: [C: 032] lvs: fix ProxyFetch URLs for rcstream [puppet] - 10https://gerrit.wikimedia.org/r/257308 (owner: 10Faidon Liambotis) [13:17:56] (03CR) 10Faidon Liambotis: [V: 032] lvs: fix ProxyFetch URLs for rcstream [puppet] - 10https://gerrit.wikimedia.org/r/257308 (owner: 10Faidon Liambotis) [13:18:00] stupid jenkins [13:19:47] RECOVERY - PyBal backends health check on lvs1005 is OK: PYBAL OK - All pools are healthy [13:19:55] operations-puppet-puppetlint-strict SUCCESS in 1m 34s <--- [13:19:59] should make that one faster [13:23:17] RECOVERY - PyBal backends health check on lvs1002 is OK: PYBAL OK - All pools are healthy [13:23:18] ^^^ rcstream issue fixed [13:24:04] fixed? [13:24:53] I think mariadb may page again even if its downtimed the UP [13:24:54] I thought (aside from any hostname issue) that we still had the issue that the rcstream servers were configured for HTTP over port 443 [13:24:59] bblack: I think you're talking about a different problem, one that was fixed back in June :P [13:25:09] bblack: see https://phabricator.wikimedia.org/T118956#1858101 [13:26:02] bblack@palladium:~$ curl -sv http://rcs1001.eqiad.wmnet:443/ [13:26:07] < HTTP/1.1 400 Bad Request [13:26:07] < Server: nginx/1.4.6 (Ubuntu) [13:26:15] yes, so? [13:26:28] http-over-443? [13:26:40] see the body of that bad request [13:26:51] oh ok [13:26:56] that's odd, is that normal for apache? [13:27:09] I would've thought just total protocol failure, not an HTTP response [13:27:26] yes it is [13:27:30] for both apache + nginx [13:27:42] (03Abandoned) 10BBlack: disable stream.wm.o:443, broken for a long time [puppet] - 10https://gerrit.wikimedia.org/r/253917 (https://phabricator.wikimedia.org/T118956) (owner: 10BBlack) [13:27:43] fwiw same thing happens with e.g. en.wp.org:443 [13:28:03] heh guess so [13:29:47] (03PS1) 10Giuseppe Lavagetto: Fix race condition in clearing the alerts [debs/pybal] - 10https://gerrit.wikimedia.org/r/257310 [13:30:33] <_joe_> paravoid: ^^ this should fix it [13:33:32] (03CR) 10Giuseppe Lavagetto: [C: 032] pybal: persist journal logs to disk [puppet] - 10https://gerrit.wikimedia.org/r/257208 (owner: 10BBlack) [13:35:15] (03CR) 10Mark Bergsma: Fix race condition in clearing the alerts (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/257310 (owner: 10Giuseppe Lavagetto) [13:37:29] (03CR) 10Giuseppe Lavagetto: Fix race condition in clearing the alerts (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/257310 (owner: 10Giuseppe Lavagetto) [13:39:22] (03PS3) 10BBlack: pybal: persist journal logs to disk [puppet] - 10https://gerrit.wikimedia.org/r/257208 [13:39:33] (03CR) 10BBlack: [C: 032] pybal: persist journal logs to disk [puppet] - 10https://gerrit.wikimedia.org/r/257208 (owner: 10BBlack) [13:39:37] (03CR) 10Mark Bergsma: Fix race condition in clearing the alerts (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/257310 (owner: 10Giuseppe Lavagetto) [13:39:41] (03CR) 10BBlack: [V: 032] pybal: persist journal logs to disk [puppet] - 10https://gerrit.wikimedia.org/r/257208 (owner: 10BBlack) [13:41:01] opsens, anyone knows who is managing gerrit? I cannot submit a patch, e.g. https://gerrit.wikimedia.org/r/#/c/257300/https://gerrit.wikimedia.org/r/#/c/257300/ [13:43:20] Is it zuul/jenkins gated? [13:44:55] yurik: that's a 116K-line patch, maybe it's taking a bit to process it first? [13:45:52] (03PS2) 10Giuseppe Lavagetto: Fix race condition in clearing the alerts [debs/pybal] - 10https://gerrit.wikimedia.org/r/257310 [13:47:36] lol @ that commit [13:56:15] bblack, paravoid - there "submit" button is gone for both of these repos - i only have the "publish comment" button. [13:56:55] <_joe_> that's probably gerrit protecting itself [13:58:04] _joe_, its never been a problem before - deploy repos always use large commits, usually much bigger than this one [13:58:43] there is a permission setting somewhere that controls if the user has an ability to submit or just to comment [13:59:05] some gerrit repos had that, but not the deploy repos. And now this has changed [13:59:27] https://gerrit.wikimedia.org/r/#/admin/projects/maps/kartotherian/deploy,access [14:03:22] Reedy, do you know which right i need to add back? And if someone has changed it, maybe this is not the right place to fix it [14:03:52] also, is there a history log for access rights changes? [14:05:16] I don't think anything has changed https://git.wikimedia.org/log/maps%2Fkartotherian%2Fdeploy.git/refs%2Fmeta%2Fconfig [14:05:40] config branch is exactly the same as it used to be according to that [14:06:00] unless replication gerrit -> gitblit is broken, which it doesn't seem like it is [14:07:49] (03PS1) 10Giuseppe Lavagetto: pybal: fix collection of individual stats via diamond [puppet] - 10https://gerrit.wikimedia.org/r/257315 [14:08:27] (03PS2) 10Giuseppe Lavagetto: pybal: fix collection of individual stats via diamond [puppet] - 10https://gerrit.wikimedia.org/r/257315 [14:10:15] akosiaris, something must have changed - the submit button disappeared. Do you know who is maintaining gerrit? [14:10:27] No one is [14:10:39] Gerrit is rotten :D [14:11:21] agreed )) actually, something did change by Krinkle in september - https://git.wikimedia.org/log/All-Projects.git/refs%2Fmeta%2Fconfig [14:11:25] i cannot view it for some reason [14:11:55] gitblit sucks [14:12:11] Reedy, didn't you just say that about gerrit? [14:12:15] )) [14:12:21] gitblit is even worse than gerrit [14:12:38] heh [14:12:54] i say we stop fighting the windmills and switch to github. [14:13:08] or Phabricator? [14:13:09] we would instantly gain all devs of the world [14:13:23] phabricator is great, but it still has one major problem - no users [14:13:34] Facebook uses it [14:13:37] github's ease of use is by far supperior [14:13:45] facebook does not solicit volonteer help [14:13:54] And then we're having a closed source, 3rd party host all our code as the canonical endpoint? [14:14:20] Reedy, i don't really like it, but the alternative is that we have a tiny percentage of contributions we could have had [14:14:30] GitHub is good for drive-by contributions. I'm still waiting for a comparison whether the GitHub workflow helps attracting long-term contributors more than a org-specific Git installation [14:14:40] and no, i didn't mean to host our passowrds on github ) [14:14:43] <_joe_> Reedy: we could try gogs.io [14:14:57] We don't host our passwords in gerrit [14:15:01] <_joe_> Reedy: I've installed it @home [14:15:11] (03PS1) 10JanZerebecki: Wikidata: set maxSerializedEntitySize to 2500 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257318 [14:15:14] andre__, the majority of wikipedia's content is also driveby )) [14:15:36] <_joe_> it's very nice and easy, I have no idea about how it would scale ot a larger thing. [14:15:50] O-O ... _joe_ is hosting our passwords at home?? [14:16:17] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "I won't wait for jenkins any longer." [puppet] - 10https://gerrit.wikimedia.org/r/257315 (owner: 10Giuseppe Lavagetto) [14:17:14] yurik: true, good point actually. Thanks! [14:18:36] I realized that one of my jobs is stuck in the gate-and-submit pipeline at https://integration.wikimedia.org/zuul/ if it's in the way for more important things don't heasitate to kill it [14:18:55] * yurik switches from `git review` to `git push` [14:20:07] ooo, sooo much nicer )) [14:20:19] 6operations, 10Traffic, 10Wikimedia-Stream, 5Patch-For-Review: rcstream service on port 443 is broken, spamming logs - https://phabricator.wikimedia.org/T118956#1858233 (10faidon) 5Open>3Invalid a:3faidon [14:21:15] Because fuck code review [14:21:17] yurik: Gerrit is maintained by Release Engineering [14:21:29] maintained is a strong word :) [14:21:38] it is [14:22:04] surely ops do more for gerrit? [14:22:10] restarting it when it gives up etc [14:22:11] :D [14:22:18] basically yeah [14:22:23] ostriches does maintain it [14:22:24] Reedy, reviewing 112k line change when we update to the newer lib version is ... hard in gerrit ) [14:22:32] qchris still follow up on a volunteer basis [14:22:34] git review -d 123456 [14:22:37] ops also have the option to rm -rf and rebuild ;) [14:22:53] https://phabricator.wikimedia.org/T70271 :) [14:22:55] and we have a bunch of admins doing the repos creations / access right tweak etc [14:23:05] * yurik hates gerrit as the most ridiculous system ever ... [14:23:13] open for 18 months now [14:23:59] * yurik is trying to think of anything worse... maybe \r\n or \ vs / or UTF16 vs UTF8 .... [14:24:23] (03CR) 10Addshore: "Is the biggest revision not bigger than this?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257318 (owner: 10JanZerebecki) [14:24:24] yurik: yeah it's a ridiculous system for uploading 116K-line unreviewable patches [14:24:40] it's not a bad code review system in general, though [14:24:59] string terminated by null is way worth :-D [14:25:22] hashar: speaking of things maintained by releng... https://gerrit.wikimedia.org/r/#/c/257315/ has been waiting for jenkins for almost 20 minutes now [14:25:30] agreed, but that's what we have for service deployment at the moment. As for code review - no, i think its horrible because its non-linear - we amend previous commits, ruining the history [14:25:46] hashar, yeah, you are right, gets() was probably the worst invetion [14:26:54] zomg, gerrit upgrade [14:28:52] (03CR) 10Addshore: "[wikidatawiki]> select MAX(page_len) from page;" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257318 (owner: 10JanZerebecki) [14:29:27] yurik: but keeping the old patchsets in the main history would be quite useless. would make bisecting so much more difficult. [14:30:30] !log CI / Zuul stalled somehow [14:30:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:30:37] paravoid: yeah poor Zuul is blocked looking [14:30:42] ok :) [14:30:43] maybe it doesn't hurt for small niche github projects… but we merge 20 patches a day [14:30:44] thx [14:30:57] (and that's counting medawiki/core only) [14:31:02] MatmaRex, first, we can always squash on merge - that's actually a git option there. But more importantly, we cannot do complex development -- if i do a patch, i cannot develop something on top of it until the first is merged, or else i cannot rebase easily [14:31:25] yurik: of course you can. i do that all the time. [14:31:31] me too [14:32:00] (you can also have proper feature branches in gerrit, although no one here seems to use that) [14:32:23] !log Jenkins lost a bunch of executors :/ [14:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:33:15] MatmaRex, it "kinda" works. But way too many times git got confused if the same code got changed in both patches, and failed on rebase, forcing me to manually remerge it every time i did a rebase [14:33:47] (03CR) 10Ottomata: [C: 031] "Should be transparent change since class names are the same, right?" [puppet] - 10https://gerrit.wikimedia.org/r/257035 (owner: 10Dzahn) [14:33:54] as the result, i gave up on chained patches, and simply do a bigger one if the merge is stalling [14:34:09] yurik: well, yes, that's how it works. but i don't see how github would help here [14:34:18] i guess you'd submit more patches on top instead of amending older ones? [14:34:58] MatmaRex, github helps because if you did the hard work of rebasing once, you don't have to do it again -- it stays in the history, and you are rebasing only on top of the newer changes, not from the start [14:35:04] (03CR) 10Addshore: [C: 031] "Well, it is hard to determine which is correct." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257318 (owner: 10JanZerebecki) [14:35:17] (03PS1) 10coren: Labs: make labs-ip-alias-dump ignore addressless hosts [puppet] - 10https://gerrit.wikimedia.org/r/257323 (https://phabricator.wikimedia.org/T120586) [14:35:23] (not from the fork point) [14:36:01] i don't understand [14:36:29] (03PS1) 10Giuseppe Lavagetto: Add a warning endpoint to catch misconfigurations [debs/pybal] - 10https://gerrit.wikimedia.org/r/257324 [14:37:10] Anyone here using irccloud? [14:37:20] Git is hard. I miss svn. [14:37:30] ? [14:37:32] haha [14:37:37] andrewbogott_: I sold my soul yes. [14:37:45] ostriches: and it’s working for you right now? [14:37:48] I do not miss creating branches by copying all files [14:37:58] ostriches: btw, I saw on that task above that you're working on gerrit 2.12 -- I didn't know this was the plan, so.. kudos :) [14:38:02] andrewbogott_: I'm using it as we speak :) [14:38:14] huh [14:38:17] I can connect directly to freenode but not to the irccloud server [14:38:24] maybe it uses a different port which is blocked... [14:38:59] !log restarting Jenkins [14:39:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:39:08] jynus: see you're using svn all wrong. Branches don't work so you just shouldn't use them! [14:39:19] ah! [14:39:41] :-) [14:40:45] (03CR) 10Faidon Liambotis: "I can't say that the idea of automatically removing a trusted point of origin (puppet's TOFU) thrills me. Especially since this is somethi" [puppet] - 10https://gerrit.wikimedia.org/r/256890 (https://phabricator.wikimedia.org/T120159) (owner: 10Yuvipanda) [14:42:17] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [14:42:42] <_joe_> !log restarted pybal on lvs1006 [14:42:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:44:02] (03CR) 10JanZerebecki: "Just to document our discussion, I remember something about the size there being an estimate, that is why I downloaded the revision, thoug" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257318 (owner: 10JanZerebecki) [14:44:06] PROBLEM - jenkins_service_running on gallium is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/java .*-jar /usr/share/jenkins/jenkins.war [14:44:26] 7Blocked-on-Operations, 6operations, 10Traffic: Improve handling of mobile variants in Varnish - https://phabricator.wikimedia.org/T120151#1858261 (10fgiunchedi) p:5Triage>3Normal [14:45:30] 6operations, 6Commons, 10Wikimedia-Media-storage: Update rsvg on the image scalers - https://phabricator.wikimedia.org/T112421#1858264 (10fgiunchedi) [14:46:36] 6operations, 6Commons, 10Wikimedia-Media-storage: Update rsvg on the image scalers - https://phabricator.wikimedia.org/T112421#1634588 (10fgiunchedi) I've removed the block on operations since this seems blocked on regressions now [14:46:36] RECOVERY - PyBal backends health check on lvs3003 is OK: PYBAL OK - All pools are healthy [14:46:41] <_joe_> !log also restarted pybal on lvs3003 [14:46:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:48:15] (03CR) 10Andrew Bogott: [C: 031] "This is, at worst, harmless :)" [puppet] - 10https://gerrit.wikimedia.org/r/257323 (https://phabricator.wikimedia.org/T120586) (owner: 10coren) [14:48:47] 6operations, 10ops-eqiad: db1019 failing disk (degraded RAID) - https://phabricator.wikimedia.org/T120511#1858268 (10fgiunchedi) p:5Triage>3Normal [14:49:43] 6operations, 10Reading-Web, 7Varnish: https://wikitech.m.wikimedia.org/ serves wikimedia.org portal - https://phabricator.wikimedia.org/T120527#1858270 (10fgiunchedi) p:5Triage>3Low [14:49:48] (03CR) 10coren: [C: 032] "This is harmless in the general case, and prevents it from breaking in edge cases." [puppet] - 10https://gerrit.wikimedia.org/r/257323 (https://phabricator.wikimedia.org/T120586) (owner: 10coren) [14:50:06] (03CR) 10coren: [V: 032] "This is harmless in the general case, and prevents it from breaking in edge cases." [puppet] - 10https://gerrit.wikimedia.org/r/257323 (https://phabricator.wikimedia.org/T120586) (owner: 10coren) [14:51:07] 6operations, 6Security-Team: Use user-specific passwords for accessing EventLogging database - https://phabricator.wikimedia.org/T120532#1858274 (10fgiunchedi) p:5Triage>3Normal [14:52:02] deploying tilerator... waiting for sync [14:52:05] 6operations, 6Zero: Security: Is it safe to enable Zero spoofing - https://phabricator.wikimedia.org/T120631#1858276 (10fgiunchedi) p:5Triage>3Normal [14:52:05] RECOVERY - jenkins_service_running on gallium is OK: PROCS OK: 1 process with regex args ^/usr/bin/java .*-jar /usr/share/jenkins/jenkins.war [14:52:43] 6operations, 6Analytics-Backlog, 7HTTPS: EventLogging sees too few distinct client IPs - https://phabricator.wikimedia.org/T119144#1858280 (10fgiunchedi) p:5Triage>3Normal [14:56:17] !log deployed latest tilerator [14:56:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:56:45] heya akosiaris, a further rsyslog q for you if you are around [14:58:21] ottomata: I am [14:58:49] akosiaris: ha, actually,i may have answered my own question, but it will be good for you to verify [14:58:56] there will be multiple instances of an eventlogging program [14:59:05] so i'm trying to figure out what $programname actually is [14:59:19] but....look slike it is the ga...which is TagName at the start of the actual log line? [15:00:03] tag* [15:00:04] yurik: Dear anthropoid, the time has come. Please deploy Deploy Kartotherian & Tilerator service (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151207T1500). [15:00:55] (03CR) 10Faidon Liambotis: [C: 04-1] Add a new security module with ::pam and ::access (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/256693 (https://phabricator.wikimedia.org/T120106) (owner: 10coren) [15:01:07] 6operations, 10Traffic, 5Patch-For-Review, 7Pybal: pybal fails to detect dead servers under production lb IPs for port 80 - https://phabricator.wikimedia.org/T113151#1858288 (10Joe) So testing this with a single apache in a pool, I can see that AcceptFilter http none doesn't change the behaviour of Idlecon... [15:01:47] ottomata: not sure I follow [15:02:05] so [15:02:17] each evenltoggging daemon instance will need to have its own log file [15:02:34] (03PS1) 10Muehlenhoff: Imported Upstream version 1.0.2e [debs/openssl] - 10https://gerrit.wikimedia.org/r/257329 [15:02:35] e.g. a kafka -> mysql consumer, or a raw kafka -> parsed kafka processer, etc. [15:02:36] or [15:02:36] (03PS1) 10Muehlenhoff: Build 1.0.2e for jessie-wikimedia [debs/openssl] - 10https://gerrit.wikimedia.org/r/257330 [15:02:38] (03PS1) 10Muehlenhoff: Cherrypick three changes from 1.0.2e-1 fixing the build [debs/openssl] - 10https://gerrit.wikimedia.org/r/257331 [15:02:40] makes sense [15:02:42] http -> kafka sevice [15:02:51] if we do this via rsyslog [15:02:56] i need to set $programname somehow [15:02:56] right? [15:02:57] so i can do [15:03:06] like [15:03:06] if $programname == 'varnishkafka' then /var/log/varnishkafka.log [15:03:14] but with whatever daemon instance [15:03:29] just trying to find the right name to match the daemon instance name to a log file in rsyslog conf [15:03:39] its looking like I have to prepend each log message with a name? [15:03:43] not sure yet though [15:04:09] (03PS11) 10coren: Add a new security module with ::pam and ::access [puppet] - 10https://gerrit.wikimedia.org/r/256693 (https://phabricator.wikimedia.org/T120106) [15:04:21] hmm, that one I do not know, I 'll have to research it a bit [15:04:25] (03CR) 10coren: "All fixed." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/256693 (https://phabricator.wikimedia.org/T120106) (owner: 10coren) [15:04:51] ottomata: will every instance of eventlogging have it's own config file ? [15:05:02] yes [15:05:04] well [15:05:07] i mean, doesn't have to [15:05:12] but if I have to do it this way, i guess so [15:05:16] (03CR) 10Faidon Liambotis: [C: 031] Add a new security module with ::pam and ::access [puppet] - 10https://gerrit.wikimedia.org/r/256693 (https://phabricator.wikimedia.org/T120106) (owner: 10coren) [15:05:52] ja could configure a python logging formatter to output the log line that way [15:05:56] if that is what you are getting at [15:06:25] ottomata: it wasn't, I was thing maybe have eventlogging set the process name and perhaps that's what syslog matches [15:06:51] <_joe_> akosiaris: so we have a baffling problem, apparently [15:06:56] _joe_: ? [15:07:06] I am not baffled [15:07:10] ;-) [15:07:17] <_joe_> we moved our base::service unit defs to /lib/systemd/system... [15:07:28] <_joe_> but when we reinstall a package [15:07:30] 6operations, 10Datasets-General-or-Unknown, 10netops: dumps.wikimedia.org seems to have poor networking towards Telia - https://phabricator.wikimedia.org/T120425#1858292 (10faidon) Traceroutes *to* dumps provide some insight but are generally not very useful, since you're downloading files and not uploading.... [15:07:32] ah yes to allow mask to work [15:07:42] Nemo_bis: ^ [15:07:46] <_joe_> apt will happily overwrite what we put there via puppet [15:07:59] ah... hmmm [15:08:16] <_joe_> which almost made me faint when upgrading etcd in k8s this morning [15:08:35] <_joe_> so the correct solution would be to make sure every deb declares its systemd units as conffiles [15:08:38] <_joe_> I think [15:08:50] that would solve the problem indeed [15:08:55] <_joe_> but we don't want to rebuild each and every one of them [15:08:56] kind of a lot of work though [15:09:14] <_joe_> are you baffled now? [15:09:17] <_joe_> :P [15:09:21] nope [15:10:30] 6operations, 10Datasets-General-or-Unknown, 10netops: dumps.wikimedia.org seems to have poor networking towards Telia - https://phabricator.wikimedia.org/T120425#1858296 (10Nemo_bis) Sorry i did not mean to make a recommendation, just to show a workaround that anyone can use with few cents. [15:10:44] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "LGTM" [debs/pybal] - 10https://gerrit.wikimedia.org/r/256942 (owner: 10Mark Bergsma) [15:11:56] 7Blocked-on-Operations, 6operations, 10Traffic: Improve handling of mobile variants in Varnish - https://phabricator.wikimedia.org/T120151#1846476 (10Tnegrin) Hi Adam -- we should review this "soon" as it potentially impacts users who are trying to increase page performance by not downloading images but have... [15:13:10] apergos, around? i'm trying to sync kartotherian, and its stuck 1/4 [15:13:23] even though tilerator just went through to the same servers [15:13:25] yurik: I'm here [15:13:48] are you still in the deploy at the 'retry, report' etc prompt? [15:13:58] apergos, yep [15:14:14] ok lemme poke around a minute [15:14:15] 2001 is done, 200{2,3,4} are stalling [15:14:20] all three eh [15:14:21] hrm [15:14:42] * ostriches uses a trebuchet to hurl some insults at Trebuchet. [15:14:56] * yurik is not happy with trebuchet at all ;) [15:15:23] ostriches: thorw insults and see what sticks/hits? ;) [15:15:25] * yurik renames trebuchet to russian roulette [15:15:43] Reedy, 0/10000000 fetched [15:16:31] apergos, should i [n] or [retry] it? [15:16:45] not yet [15:16:57] btw, on each "enter", it takes some time to show the 1/4 [15:20:31] (03PS1) 10Jcrespo: Depool es1019; es1017 at 100% load; Repool es1015 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257333 [15:20:57] yurik: yeah actually now I want you to retry [15:21:12] ottomata: I think I 've figured it out [15:21:15] yeah it's going to take some time to show up [15:21:17] f = logging.Formatter('TagName: %(message)s') [15:21:28] apergos, retrying [15:21:57] ottomata: so, seems like you are correct [15:22:29] * apergos is watching it happen [15:22:34] is TagName actually the tag? [15:22:38] or is it a placeholder I need to figrue out [15:22:42] i'm getting closer too that too [15:22:43] to* [15:23:11] (03CR) 10Jcrespo: [C: 032] Depool es1019; es1017 at 100% load; Repool es1015 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257333 (owner: 10Jcrespo) [15:24:13] looks like they all did the checkout this time [15:24:51] er, the fetch [15:25:11] so far i've tried both TagName: eventlogging-service-eventbus and just eventlogging-service-eventbus: [15:25:23] with if $programname == 'eventloggin-service-eventbus' [15:25:28] no bites from rsyslog yet... :) [15:25:35] it just goes to regular syslog so far [15:25:40] so yurik you should go ahead to the checkout phase now if you haven't already [15:25:46] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool es1019; es1017 at 100% load; pool es1015 with low weight (duration: 00m 28s) [15:25:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:26:27] apergos, works! [15:26:31] continuing [15:26:57] watching it some more [15:27:29] (03CR) 10Muehlenhoff: [C: 032 V: 032] Imported Upstream version 1.0.2e [debs/openssl] - 10https://gerrit.wikimedia.org/r/257329 (owner: 10Muehlenhoff) [15:27:39] akosiaris: its ok, i'll figure it out eventually, just wondering if you already knew [15:27:52] (03CR) 10Muehlenhoff: [C: 032 V: 032] Build 1.0.2e for jessie-wikimedia [debs/openssl] - 10https://gerrit.wikimedia.org/r/257330 (owner: 10Muehlenhoff) [15:27:56] doo dee doo dee doo [15:28:10] (03CR) 10Muehlenhoff: [C: 032 V: 032] Cherrypick three changes from 1.0.2e-1 fixing the build [debs/openssl] - 10https://gerrit.wikimedia.org/r/257331 (owner: 10Muehlenhoff) [15:28:12] (03PS1) 10Giuseppe Lavagetto: pybal: do not overwrite stats for each server [puppet] - 10https://gerrit.wikimedia.org/r/257335 [15:28:51] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] pybal: do not overwrite stats for each server [puppet] - 10https://gerrit.wikimedia.org/r/257335 (owner: 10Giuseppe Lavagetto) [15:29:11] yurik: looks like they all completed [15:29:15] so now if you like I can tell you what broke :-D [15:29:31] (03CR) 10Giuseppe Lavagetto: [C: 032] Fix race condition in clearing the alerts [debs/pybal] - 10https://gerrit.wikimedia.org/r/257310 (owner: 10Giuseppe Lavagetto) [15:30:01] (03PS3) 10Giuseppe Lavagetto: Fix race condition in clearing the alerts [debs/pybal] - 10https://gerrit.wikimedia.org/r/257310 [15:31:31] (03PS1) 10Muehlenhoff: Add a .gitreview file [debs/openssl] - 10https://gerrit.wikimedia.org/r/257336 [15:32:22] apergos, yep, finished! thx. Is it something i did? :) [15:32:29] not exactly [15:33:15] so for whatever reason the salt minions were restarted (a pupppet change? something pushed out?) on all the maps-tests at once. just slightly before you started your deploy [15:33:15] they take a little bit of time to do setup and such of the connection [15:33:26] which clearly had not completed for most of them by the time you ran your deploy [15:33:32] hell of a race condition [15:33:41] naturally on the retry they all ran just fine :-/ [15:34:26] trebuchet should be able to check that the job is still running on the minions and if not, decide that the job has not and will not run and prompt you to retry immediately [15:34:34] I mean after its timeout [15:34:55] so in that sense it is a trebuchet bug that this check is not made [15:36:39] apergos, funny, akosiaris just restarted them probably because i saw that my access rights were not enabled [15:37:17] thanks for looking into it! [15:37:20] all's good now [15:37:25] ok! [15:37:26] !log deployed kartotherian [15:37:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:39:49] !log uploaded openssl 1.0.2e-1~wmf1 to carbon [15:39:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:40:47] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add a .gitreview file [debs/openssl] - 10https://gerrit.wikimedia.org/r/257336 (owner: 10Muehlenhoff) [15:43:07] _joe_, just a reminder, would appreciate review of this [15:43:08] https://gerrit.wikimedia.org/r/#/c/253465/ [15:43:19] i'm still working on logging, but the general placement of things is mainly what I'd like review of [15:43:24] <_joe_> ottomata: I know, trying to get there [15:43:27] <_joe_> :/ [15:45:06] np danke! [15:46:25] ok, akosiaris i think I got it. [15:46:26] but [15:46:37] the logs still also go to regular syslog [15:46:41] which doesn't seem ideal [15:55:43] YES got it! [15:55:52] 6operations, 10RESTBase-Cassandra: track/alert cassandra certs expiration - https://phabricator.wikimedia.org/T120662#1858341 (10fgiunchedi) 3NEW a:3fgiunchedi [15:56:16] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Set up multi-DC replication for Cassandra - https://phabricator.wikimedia.org/T108613#1858356 (10fgiunchedi) [15:56:18] 6operations: Encrypt all the things - https://phabricator.wikimedia.org/T111653#1858355 (10fgiunchedi) [15:56:20] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Cassandra inter-node encryption (TLS) - https://phabricator.wikimedia.org/T108953#1858351 (10fgiunchedi) 5Open>3Resolved ca/certs expiration tracked in {T120662}, resolving [15:58:29] ottomata: :-) [15:58:30] (03PS1) 10Alexandros Kosiaris: monitoring: Move all groups into hiera [puppet] - 10https://gerrit.wikimedia.org/r/257344 [15:59:48] _joe_: what do you think of this ? https://gerrit.wikimedia.org/r/#/c/257344/ [16:00:04] anomie ostriches thcipriani marktraceur Krenair: Respected human, time to deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151207T1600). Please do the needful. [16:00:04] tto Dereckson bearND, mdholloway kart_ bblack jzerebecki: A patch you scheduled for Morning SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:01:06] o/ [16:02:55] I can SWAT. OK, so we have more than 8 patches. I'm going to bump the last 2 (since we have 10) unless SWAT moves quickly. [16:03:55] here [16:03:57] jzerebecki: Any chance you could do mine first? It's 3am here and I'd like to go back to bed :) [16:04:27] sorry, thcipriani: ^ [16:04:33] tto: ping for SWAT. Also, noticed that 157338 says "Beta Cluster only for now." but it doesn't use the realm-specific files. [16:04:34] <_joe_> akosiaris: love the idea, I think we're repeating shit around a _lot_ [16:04:35] see what I mean about 3am... [16:04:43] (03PS2) 10Alexandros Kosiaris: monitoring: Move all groups into hiera [puppet] - 10https://gerrit.wikimedia.org/r/257344 [16:04:59] _joe_: ok, I 'll run it through the compiler and merge then [16:05:12] I'm here. Mine's pretty trivial, it's a config revert to a state from a few weeks ago, nothing can go wrong with it [16:05:13] thcipriani: That's true, the intention is to use the config for production if it works on labs [16:05:27] Hi. [16:05:37] tto: ok, so you just want this deployed on beta? [16:05:55] thcipriani: the list looks long, if you run outta time I can push mine to tomorrow as well [16:06:04] thcipriani: I can't see why not. [16:06:05] bblack: thanks :) [16:07:22] (03PS3) 10Alexandros Kosiaris: monitoring: Move all groups into hiera [puppet] - 10https://gerrit.wikimedia.org/r/257344 [16:07:35] if you don't get to mine, I'm fine with it being pushed back [16:08:11] tto: that's fine, we can do that, but merging it to master to get it out to beta-only is not typically what we'd do. Ideally we could cherry-pick to the deployment-bastion and check its behavior. SWAT is mainly for getting it deployed to prod (unless it's something realm-specific that needs merged). [16:08:14] (03CR) 10Alexandros Kosiaris: [C: 032] monitoring: Move all groups into hiera [puppet] - 10https://gerrit.wikimedia.org/r/257344 (owner: 10Alexandros Kosiaris) [16:08:17] jzerebecki: kk, thanks! [16:08:33] thcipriani: I scheduled it for SWAT because no-one was merging it otherwise. [16:08:41] It should be a no-op for production. [16:08:48] (03PS1) 10Giuseppe Lavagetto: New package version [debs/pybal] - 10https://gerrit.wikimedia.org/r/257347 [16:09:07] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] New package version [debs/pybal] - 10https://gerrit.wikimedia.org/r/257347 (owner: 10Giuseppe Lavagetto) [16:09:19] If you would prefer, I can move it to the labs-only config file? [16:10:13] tto: that'd be awesome, makes me a little nervous otherwise :) [16:10:44] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255954 (https://phabricator.wikimedia.org/T118067) (owner: 10TTO) [16:11:12] thcipriani: If you'll give me a couple minutes... :) [16:12:04] tto: np, I've got plenty more patches to SWAT in the interim. Getting out your other one now: https://gerrit.wikimedia.org/r/#/c/255954/ [16:12:07] (03Merged) 10jenkins-bot: Translate project namesapce for jbowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255954 (https://phabricator.wikimedia.org/T118067) (owner: 10TTO) [16:14:26] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Translate project namesapce for jbowiki [[gerrit:255954]] (duration: 00m 28s) [16:14:28] ^ tto check please [16:14:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:15:09] (03PS13) 10TTO: Enable cluster-wide import setup on the Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157338 (https://phabricator.wikimedia.org/T17583) [16:15:15] thcipriani, will do [16:15:23] New patch incoming for the second one as well [16:15:36] tto: ty! [16:16:19] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254665 (https://phabricator.wikimedia.org/T119308) (owner: 10Dereckson) [16:17:08] It works: https://jbo.wikipedia.org/wiki/uikipedi%27as:bende_ckupau https://jbo.wikipedia.org/wiki/casnu_la_.uikipedi%27as.:bende_ckupau [16:17:08] (03Merged) 10jenkins-bot: Namespace configuration for ur.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254665 (https://phabricator.wikimedia.org/T119308) (owner: 10Dereckson) [16:17:14] thcipriani: I removed the patch about the lad.wikipedia logo, it's not ready yet [16:17:33] Dereckson: ack, thanks. [16:17:45] tto: awesome. Thanks for checking. [16:18:03] <_joe_> !log uploaded pybal 1.13.2 to reprepro [16:18:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:19:41] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Namespace configuration for ur.wikipedia [[gerrit:254665]] (duration: 00m 30s) [16:19:44] ^ Dereckson check please [16:19:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:20:05] Works [16:21:17] Dereckson: awesome. tto: first patch looks scoped-to-labs now. Ready for it to merge? [16:21:40] Sure, so long as it look OK to you, after all, it is the middle of the night here and I am not very awake [16:22:09] _joe_: i'll see if I can test my MED changes on the test host as well btw [16:22:17] i'll setup a quagga instance if there isn't one already [16:22:32] doubt there is one :) [16:22:41] I didn't set up one and I don't know of anyone else that would know quagga :P [16:23:10] alright ;) [16:23:19] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157338 (https://phabricator.wikimedia.org/T17583) (owner: 10TTO) [16:23:26] perhaps I should have said "unless someone beats me to it" ;p [16:24:01] (03Merged) 10jenkins-bot: Enable cluster-wide import setup on the Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157338 (https://phabricator.wikimedia.org/T17583) (owner: 10TTO) [16:24:06] <_joe_> paravoid: indeed :P [16:24:35] <_joe_> mark: yeah I left that and the adding of the warning to the next "release", as they look like bigger changes [16:24:39] tto: so this should get deployed on beta via https://integration.wikimedia.org/ci/job/beta-mediawiki-config-update-eqiad/ [16:24:42] <_joe_> this should be able to go out easily [16:24:49] there was a time that I'd know that if I hadn't done it, it wouldn't be done [16:24:53] made life simpler in some ways! [16:25:23] _joe_: yeah it's still fairly trivial BUT should at least be tested against a BGP speaker before deployment ;) [16:25:36] thciprian: OK, looks to take some time [16:25:37] i haven't run that code at all [16:25:46] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254486 (https://phabricator.wikimedia.org/T118847) (owner: 10Dereckson) [16:26:23] (03Merged) 10jenkins-bot: Rights configuration on fa.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254486 (https://phabricator.wikimedia.org/T118847) (owner: 10Dereckson) [16:27:12] 6operations, 10Traffic: Improve Varnish XFF processing for trusted proxies - https://phabricator.wikimedia.org/T120121#1858529 (10BBlack) [16:27:45] 6operations, 10Traffic: Improve Varnish XFF processing for trusted proxies - https://phabricator.wikimedia.org/T120121#1845623 (10BBlack) @Yurik - added info about Testproxy to the end of the description. [16:28:08] RECOVERY - PyBal backends health check on lvs1011 is OK: PYBAL OK - All pools are healthy [16:28:20] <_joe_> !log upgrading pybal on lvs1007-12 [16:28:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:28:35] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Rights configuration on fa.wikipedia [[gerrit:254486]] (duration: 00m 29s) [16:28:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:28:41] ^ Dereckson check please [16:28:46] Testing. [16:29:02] Works. [16:29:17] thanks [16:29:48] RECOVERY - PyBal backends health check on lvs1008 is OK: PYBAL OK - All pools are healthy [16:30:02] Their botadmin group has a lot of add/del rights by the way. That's strange. They use them for bureaucrat work? [16:30:15] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254476 (https://phabricator.wikimedia.org/T119207) (owner: 10Dereckson) [16:30:34] 6operations, 6Commons, 10MediaWiki-File-management, 6Multimedia, and 2 others: Image cache issue when 'over-writing' an image on commons - https://phabricator.wikimedia.org/T119038#1858539 (10BBlack) The revert was pushed off for tomorrow, because SWAT today was too full of post-freeze changes... [16:31:29] Well, I'll leave a comment closing the bug about that, so they'll be able to examine the matter in their local community. [16:31:45] <_joe_> !log upgrading pybal on all the backup LVSs in codfw [16:31:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:33:49] hmm, zuul didn't pick up https://gerrit.wikimedia.org/r/#/c/254476/ [16:34:15] moritzm: Do you have any time to spare before the meeting? [16:34:41] (03CR) 10Thcipriani: Namespace configuration for en.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254476 (https://phabricator.wikimedia.org/T119207) (owner: 10Dereckson) [16:34:46] thcipriani: yeah there is some issue going on with Zuul / CI / Jenkins :-( [16:35:01] (03CR) 10Thcipriani: [C: 032] "SWAT, again" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254476 (https://phabricator.wikimedia.org/T119207) (owner: 10Dereckson) [16:35:04] thcipriani: but then for this patch [16:35:10] thcipriani: it is pending on parent change https://gerrit.wikimedia.org/r/#/c/254475/1 [16:35:32] there is a GErrit dependency [16:35:46] hashar: oh, right, missed that. [16:35:57] yeah that is very confusing [16:36:14] Zuul reports nothing because it knows the patch is not going to land due to the dep [16:36:23] as soon as you +2 the dep, it should enqueue both [16:36:40] thcipriani: I add 254475 to the SWAT deploy list so? [16:36:46] Dereckson: is https://gerrit.wikimedia.org/r/#/c/254475/1 acutally needed? [16:36:56] <_joe_> !log installing the new pybal version on the backup lvss in esams and ulsfo [16:37:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:37:06] (03PS1) 10Jcrespo: Rolling configuration and restart of es1019 [puppet] - 10https://gerrit.wikimedia.org/r/257360 [16:37:21] Yes, they want this namespace with subpages. [16:37:23] if so, sure, go ahead and add it. Seems fairly innocuous. [16:38:09] Added to the list. [16:38:19] kk [16:38:22] (03CR) 10Jcrespo: [C: 032] Rolling configuration and restart of es1019 [puppet] - 10https://gerrit.wikimedia.org/r/257360 (owner: 10Jcrespo) [16:38:34] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254475 (owner: 10Dereckson) [16:38:59] now they should both roll into master :) [16:39:18] My tool to see what patches are to deploy look Gerrit and https://phabricator.wikimedia.org/project/sprint/board/178/query/open/ so it doesn't print a Gerrit patch not linked in the column To deploy. [16:39:26] (03Merged) 10jenkins-bot: Enable subpages on custom aliases from 112 to 119 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254475 (owner: 10Dereckson) [16:39:36] thcipriani: I can't wait indefinitely for the beta cluster zuul queue to complete, how long would it normally take? [16:39:53] (03Merged) 10jenkins-bot: Namespace configuration for en.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254476 (https://phabricator.wikimedia.org/T119207) (owner: 10Dereckson) [16:40:43] hmm... no sooner do I speak than it all suddenly happens at once [16:41:02] 6operations, 10ops-eqiad, 6Labs, 10Labs-Infrastructure: labstore1002 issues while trying to reboot - https://phabricator.wikimedia.org/T98183#1858577 (10coren) I'm attempting to reproduce the issue without any shelf attached as a first attempt at isolating where the actual issue lies. If that happens with... [16:41:11] tto: not terribly long, it looks like the config-deploy on beta already has the code there, then next update should be beta-scap-eqiad to push everything out. [16:41:26] (also, seems like there were some zuul things stuck) [16:42:36] thcipriani: We can put my patch for tomorrow. [16:42:39] not urgent. [16:42:46] thcipriani: should I do that? [16:42:58] kart_: if that's fine, please do :) [16:43:03] !log restarting mysql, upgrading and rebooting mysql on es1019 [16:43:03] and thank you. [16:43:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:43:37] <_joe_> !log installing pybal on eqiad backups [16:43:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:43:51] thcipriani: okay! [16:44:34] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Namespace configuration for en.wiktionary [[gerrit:254476]] and Enable subpages on custom aliases from 112 to 119 [[gerrit:254475]] (duration: 00m 27s) [16:44:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:44:40] ^ Dereckson check please [16:45:18] thcipriani: done. [16:45:20] subpages ok [16:45:22] :) [16:45:26] Reconstruction ok too. Works. [16:45:35] Dereckson: awesome! thanks. [16:45:41] kart_: appreciated. [16:45:54] 6operations, 10hardware-requests: Hardware access request for yubico auth servers - https://phabricator.wikimedia.org/T118983#1858617 (10Papaul) [16:46:01] bearND: or mdholloway ping for SWAT [16:46:08] Thanks for the deploy. [16:46:32] thcipriani: hi [16:47:02] 7Blocked-on-Operations, 6operations, 10Traffic: Improve handling of mobile variants in Varnish - https://phabricator.wikimedia.org/T120151#1858632 (10Yurik) @dr0ptp4kt, do you know if we still need the ZeroOpts=tls? https://github.com/wikimedia/mediawiki-extensions-ZeroBanner/blob/master/includes/PageRender... [16:47:42] (03PS2) 10Eevans: setup EventBus extension for deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256418 (https://phabricator.wikimedia.org/T116786) [16:47:48] mdholloway: howdy, do you have a backport to .7 for the MobileApp change? [16:49:02] thcipriani: Works! [16:49:18] tto: awesome! glad to hear it rolled out ok. [16:49:49] Thanks for the SWATs! I'm going to go back to sleep now. [16:51:43] thcipriani: sorry, i'm pretty ignorant of how this is supposed to work. :) i'm reviewing an email thread between bearND and greg-g about what is supposed to happen here. i don't *think* we need anything backported, just deployed. [16:52:13] Good night tto. [16:52:28] mdholloway: np, just need to cherry pick this change to the branch that's deployed out currently which is the .7 branch, I'll make a quick cherry pick and have you double check. [16:52:45] thcipriani: cool, thanks! [16:52:59] 6operations, 10Traffic: Improve Varnish XFF processing for trusted proxies - https://phabricator.wikimedia.org/T120121#1858651 (10BBlack) [16:53:01] mdholloway: https://gerrit.wikimedia.org/r/#/c/257365/ [16:53:18] ^ can you check that for me, make sure it looks like what you need deployed. [16:53:33] thcipriani: looks good to me! [16:53:40] awesome. [16:53:46] 6operations, 10Datasets-General-or-Unknown, 10netops: dumps.wikimedia.org seems to have poor networking towards Telia - https://phabricator.wikimedia.org/T120425#1858664 (10Nemo_bis) == Comparison eqiad/esams from nike.fixme.fi to upload.wikimedia.org == ``` $ for c in eqiad esams; do for p in 4 6; do wget... [16:53:51] 6operations, 10procurement: Upgrade restbase100[7-9] to match restbase100[1-6] hardware - https://phabricator.wikimedia.org/T119935#1858665 (10faidon) a:3RobH [16:54:37] 6operations, 10ops-codfw: rack new yubico auth system - https://phabricator.wikimedia.org/T120263#1858670 (10RobH) [16:55:44] 6operations, 10Traffic: Improve Varnish XFF processing for trusted proxies - https://phabricator.wikimedia.org/T120121#1858676 (10BBlack) [16:56:47] <_joe_> !log upgrading pybal on primaries in esams, ulsfo, codfw [16:56:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:58:24] mdholloway: just waiting on jenkins to merge the code to the wmf .7 branch and then we can get it out the door. You just need this file deployed, correct? Nothing else needs to happen (e.g., database changes, maintenance scripts, etc.) [16:58:45] thcipriani: nope, i think that's it! thanks for your help. [16:59:04] mdholloway: np :) [17:00:03] !log Restarted Jenkins to start the ZeroMQ publisher https://phabricator.wikimedia.org/T120668 [17:00:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:00:56] thcipriani: thank you [17:02:35] !log reset uncommited change to includes/jobqueue/JobRunner.php in /srv/mediawiki-staging/php-1.27.0-wmf.7 [17:02:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:03:46] 7Blocked-on-Operations, 6Analytics-Backlog, 10Analytics-EventLogging: Drop tables MobileWebClickTracking_* from eventlogging db - https://phabricator.wikimedia.org/T120674#1858723 (10Nuria) 3NEW a:3jcrespo [17:04:46] (03PS3) 10Alex Monk: Checkout and then rebase instead of cherry-pick [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/248634 [17:04:53] !log thcipriani@tin Synchronized php-1.27.0-wmf.7/extensions/MobileApp/config/config.json: SWAT: Roll out RESTBase usage to Android Beta app: 10% [[gerrit:254045]] (duration: 00m 29s) [17:04:55] ^ mdholloway check please [17:04:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:05:37] thcipriani: ok. cc/bearND [17:08:00] (03PS1) 10Yuvipanda: Revert "Labs: make labs-ip-alias-dump ignore addressless hosts" [puppet] - 10https://gerrit.wikimedia.org/r/257368 [17:08:20] chasemp: Coren andrewbogott ^ [17:08:23] also see my email to ops@ [17:08:58] YuviPanda: I disagree with you about that - instances can have no addresses in normal operation, and the script shouldn't explode. [17:09:19] Coren: failing explicitly causes no issues, and the only time it did fail we had issues :) [17:09:22] so... [17:09:27] silent failure is a big no-no [17:09:30] let's make a task to sort this out? [17:09:32] ok [17:09:39] I mean I agree I was goig to throw a syslog line in there at teh least [17:10:01] I agree that silent failure is a no-no - except that an instance not having addresses yet is not a failure condition. [17:10:02] chasemp: I think a puppet failure notice is better I think, since it's louder :) [17:10:14] Coren: the only time it's happened it's been a failure condition. [17:10:21] (03CR) 10DCausse: Set initial titlesuggest shard sizes (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256974 (owner: 10EBernhardson) [17:10:24] all of this is also a red herring and unrelated to the real reason for the outage [17:10:27] which I've no idea about [17:10:29] that's a whole bag of world YuviPanda :) but in general idk care about puppet notify, atm I disagree with puppet running it [17:10:37] thcipriani: mdholloway: we expect to see the changes on https://meta.wikimedia.org/static/current/extensions/MobileApp/config/android.json [17:10:39] because puppet disabled can not be an error condition [17:10:45] which means a non-error state breaks new instances [17:10:49] ? [17:10:52] 6operations, 7Mail: New Onboard's (Marc Brent) Email not working - https://phabricator.wikimedia.org/T120672#1858748 (10faidon) p:5Unbreak!>3High [17:10:58] chasemp: hmm? [17:11:06] chasemp: this doesn't break new instances... [17:11:15] better sorted on teh task :) this is all slightly tangential to the issue [17:11:29] chasemp: there are *two* outages and two issues :D one was the instance creation fail and then the DNS one [17:11:37] agreed [17:11:46] so this puppet failure was a result of the former, and unrelated (IMO?) to the later [17:11:51] or if related, not a direct cause of [17:11:57] YuviPanda: Yep, I'm pretty sure that's right. [17:12:05] that I'm honestly just not sure of on the second [17:12:11] what happens if that ip-alias script never runs? [17:12:16] or fails to happen even [17:12:22] chasemp: the old script is used? [17:12:48] I have no idea but it seems the big question on whether it breaking caused an outage [17:12:49] 6operations, 10Analytics-Cluster, 10Traffic: Secure inter-datacenter web request log (Kafka) traffic - https://phabricator.wikimedia.org/T92602#1858788 (10BBlack) Due to scheduling constraints and the new-ness of the TLS-capable release of kafka, we're going to rely on IPSec for this for now and revisit TLS... [17:13:16] bearND: thcipriani: hmm, yeah. a quick test in the app isn't registering the change yet, either. [17:13:17] chasemp: That's almost certainly not the case. [17:13:23] bearND: hmm, just spot checked a few servers, the change has definitely made it out for config/config.json for MobileApp [17:13:24] chasemp: if you read the email there are a bunch of safeguards in place, me, andrewbogott and Krenair debated this a fair bit when we merged this. [17:13:28] but yeah, let's file a task [17:13:44] sure and I pointed some of them out pre-your email actually :) [17:13:46] in the same thread [17:13:47] in the meantime, I'm getting somewhat angsty about the lack of failure notification there [17:13:54] * andrewbogott is in a meeting, believe it or not [17:14:01] chasemp: yeah, I don't think I managed to read it fully and hence just woke up... [17:14:09] but that doesn't answer the question of what that exec does and what breaks if it doesn't run [17:14:12] since it wasn't running [17:14:15] that seems relevant [17:14:19] chasemp: The reason I fixed the script is so that I could reenable pdns calling it so that aliasing occurs again - but the script didn't break that it was Yuvi commenting it out. :-) [17:14:48] chasemp: what it does break is 'new instances with public IPs do not get their public DNS names resolved inside labs' is what breaks [17:14:49] not sure what that means [17:14:52] which is totally ok to break. [17:15:08] well, not really? but that's another side topic [17:15:14] no? exactly that? [17:15:18] Yeah, task time. [17:15:21] ok... [17:15:22] thcipriani: on meta.wikimedia.org? [17:15:35] Coren: mind if I merge my revert right now? having no notification on failure makes me feel nervous [17:15:35] why would dns for public ip's be ok to break? [17:15:40] chasemp: no... [17:15:47] chasemp: it's the *translation* that breaks. [17:16:01] chasemp: aka if you hit something.wmflabs.org from inside labs it usually doesn't work until this script runs [17:16:15] chasemp: at which point it just gives you the internal IP. it's a split horizon thing. [17:16:28] sure but we offer it and people depend on it it seems better to not to have it or to say it's not ok to break [17:16:42] sure, more reason for it to break really loudly? [17:16:44] :) [17:16:46] YuviPanda: I think that revert is ill-advised. Instances not having IP addresses is not an error condition; openstack being unable to assign them is but that's not that script's fault. [17:17:13] Coren: I... disagree. The only time it happened is when there *was* an error condition... [17:17:35] that's the crux of my question, why did this not break before if it's normal [17:17:57] chasemp: Because in normal operation is very unlikely to happen. It would have done the same thing eventually. [17:18:15] ^ and in a working scheduler things don't not have an ip for long [17:18:19] bearND: meta is on the .7 version, so it _should_ be there. Does meta route to a specific server I can check? [17:18:21] it's usually less than a few seconds [17:18:27] ok fair, that's a point of contentino then, and if that is true then we should //at least log// the occurrance [17:18:45] chasemp: I've no objection to logging it [17:18:58] I think because 1. it is rare, 2. if it happens it is a reflection of this or something else being really broken... [17:19:01] chasemp: It makes sense we'd want to note it - but not that it breaks the script. [17:19:19] 6operations, 10hardware-requests: determine which eqiad spares to decommission - https://phabricator.wikimedia.org/T120679#1858826 (10RobH) 3NEW a:3RobH [17:19:27] meh. whatever. [17:19:53] thcipriani: mdholloway: no idea about meta [17:22:17] (03CR) 10Yuvipanda: "Yes. I can make it explicitly fail if that flag is set and it isn't realm labs..." [puppet] - 10https://gerrit.wikimedia.org/r/256890 (https://phabricator.wikimedia.org/T120159) (owner: 10Yuvipanda) [17:23:51] <_joe_> YuviPanda: can you wait for my review before you merge ^^ [17:24:05] _joe_: sure. [17:25:40] 6operations, 6Labs, 10Labs-Infrastructure: logrotate/disk space on silver for nutcracker log - https://phabricator.wikimedia.org/T120683#1858873 (10Dzahn) 3NEW [17:28:43] 6operations, 10procurement: Upgrade restbase100[7-9] to match restbase100[1-6] hardware - https://phabricator.wikimedia.org/T119935#1858892 (10RobH) Last week we processed the disk upgrades, and I need to followup and get RAM/CPU options. This wasn't a high priority compared to the other orders, so it was shi... [17:31:55] 6operations, 7Mail: New Onboard's (Marc Brent) Email not working - https://phabricator.wikimedia.org/T120672#1858912 (10akosiaris) So, we got an LDAP server in production that mirrors `ldap1.corp.wikimedia.org`. That is to avoid very expensive SMTP checks or high latency LDAP checks to OIT's LDAP servers . An... [17:36:28] 6operations, 7Mail: New Onboard's (Marc Brent) Email not working - https://phabricator.wikimedia.org/T120672#1858922 (10JGulingan) I can go create a ticket for Joel or Byron to go and take a look at it. Thanks!! [17:38:56] moritzm: I don’t want to ruin your evening but would appreciate a bit of your time yet this evening; want to work through a pdns concern I had when I tested on Friday [17:39:04] Unfortunately I have a 30 minute meeting right after this one :/ [17:47:03] andrewbogott: sure, we can do that later the evening [17:47:12] moritzm: thank you! [17:53:27] bleh. I can't ssh into search-datavis-experimental.shiny-r.eqiad.wmflabs (public key denied) even though I can ssh into search-datavis.shiny-r.eqiad.wmflabs & search-datavis-beta.shiny-r.eqiad.wmflabs no problem :\ I just created that instance so I don't know why it's acting that way. [17:53:35] (03PS1) 10Yuvipanda: lab_debrepo: Fix the sources file's contents [puppet] - 10https://gerrit.wikimedia.org/r/257371 [17:54:05] andrewbogott: ^ is what broke shinken-01 [17:54:12] bearloga: that interests me but I’m in meetings now. Can you ping me in an hour or so in #wikimedia-labs if it’s still not working? [17:54:24] bearloga: also (standard answer) check your project’s security groups [17:55:32] andrewbogott: thanks, will do. [17:55:58] (03CR) 10Yuvipanda: [C: 032] lab_debrepo: Fix the sources file's contents [puppet] - 10https://gerrit.wikimedia.org/r/257371 (owner: 10Yuvipanda) [17:58:11] (03PS1) 10Giuseppe Lavagetto: gdash: completely decommission all req* dashboards [puppet] - 10https://gerrit.wikimedia.org/r/257373 (https://phabricator.wikimedia.org/T118979) [17:58:25] 6operations: Check incoming requests to secure.wm.o - https://phabricator.wikimedia.org/T119274#1858994 (10Milimetric) [17:58:57] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Pending creation of one grafana dashboard." [puppet] - 10https://gerrit.wikimedia.org/r/257373 (https://phabricator.wikimedia.org/T118979) (owner: 10Giuseppe Lavagetto) [17:59:09] (03CR) 10Ori.livneh: "Is there any reason not to remove the vhost entirely?" [puppet] - 10https://gerrit.wikimedia.org/r/257373 (https://phabricator.wikimedia.org/T118979) (owner: 10Giuseppe Lavagetto) [18:00:29] 6operations, 10Analytics, 10Traffic, 5Patch-For-Review: Replace Analytics XFF/client.ip data with X-Client-IP - https://phabricator.wikimedia.org/T118557#1859016 (10Milimetric) [18:01:54] 7Blocked-on-Operations, 6operations, 10Traffic: Improve handling of mobile variants in Varnish - https://phabricator.wikimedia.org/T120151#1859026 (10ori) Even more amazing: when the 'disableImages' cookie is unset, it is not actually cleared; instead it is simply set to a blank value. This means that changi... [18:03:18] <_joe_> ori: heh, I might agree :P [18:03:54] <_joe_> ori: I'll take a look at the access stats for it [18:03:58] I access it [18:04:01] so I inflate the stats [18:04:10] never totally unlearned the muscle memory for reqStats [18:04:19] (03PS2) 10EBernhardson: Set initial titlesuggest shard sizes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256974 [18:04:28] but the "Heads up! Site operations metric dashboards have been migrated to https://grafana.wikimedia.org/. Please update your bookmarks. " notice has been up since october 25 [18:04:31] (03CR) 10EBernhardson: Set initial titlesuggest shard sizes (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256974 (owner: 10EBernhardson) [18:08:16] 6operations, 7Monitoring, 7Privacy, 7Security-Core: status.wikimedia.org should not load Google Analytics - https://phabricator.wikimedia.org/T115945#1859058 (10Milimetric) [18:10:24] 6operations, 7Mail: New Onboard's (Marc Brent) Email not working - https://phabricator.wikimedia.org/T120672#1859073 (10JKrauska) DNS issue -- resolved -- please attempt to resync? [18:12:21] akosiaris: can you try to hup ldap to force a resync? [18:12:38] akosiaris: and double check to make sure dns resolution is now working for you? [18:14:03] cajoel: ok I can see some answers for ldap1.corp.wikimedia.org, but I think I got a negative TTL [18:14:09] lemme flush the caches to make sure [18:14:16] and I 'll restart slapd afterwards [18:15:08] should return 198.73.209.23 [18:16:01] !log deleting pmtpa host groups from nagios/icinga [18:16:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:24:15] akosiaris: I see a connect from dubnium [18:25:04] cajoel: ok, so dubnium seems fine [18:25:16] I see mbrent in the tree just fine [18:25:21] I think we might be back in business! [18:25:22] thanks! [18:25:31] what was the problem cajoel? [18:25:35] cajoel: yw, thanks as well [18:25:44] ns1/ns2.corp were returning SERVFAIL when I tried before [18:26:17] cajoel: pollux is fine as well, so both our mirrors are fine now [18:26:19] simplified our dns setup, and those records were missed [18:26:24] fail [18:26:41] explains another sync concern I had late last week [18:26:54] thanks for the prompt responses [18:27:01] we'll get mbrent up and running [18:27:04] why SERVFAIL and not NXDOMAIN thought... [18:27:06] though* [18:28:02] paravoid: it wasn' [18:28:10] t being marked authorative [18:28:16] so double fail [18:28:20] oh heh [18:28:21] resolved both right now.. [18:29:04] 6operations, 10Analytics: analytics1013 crashed, investigate... - https://phabricator.wikimedia.org/T97380#1859172 (10Milimetric) Can this be closed? Seems out of date. [18:29:06] insert cartman police uniform authority joke.. [18:29:17] (03CR) 10Giuseppe Lavagetto: "@ori: I just didn't assess if/how much is gdash being used at the moment. We really need to decommission requstats now because that will b" [puppet] - 10https://gerrit.wikimedia.org/r/257373 (https://phabricator.wikimedia.org/T118979) (owner: 10Giuseppe Lavagetto) [18:29:33] (03CR) 10Ori.livneh: [C: 031] gdash: completely decommission all req* dashboards [puppet] - 10https://gerrit.wikimedia.org/r/257373 (https://phabricator.wikimedia.org/T118979) (owner: 10Giuseppe Lavagetto) [18:29:50] 6operations, 10Analytics, 10Traffic: Fix annoying varnishncsa+initsystem issues on jessie - https://phabricator.wikimedia.org/T97351#1859176 (10Milimetric) Is this still valid? [18:32:39] 6operations, 10ops-eqiad: es1019 and its management interface are unresponsive - https://phabricator.wikimedia.org/T120689#1859187 (10jcrespo) 3NEW [18:33:12] 6operations, 10ops-eqiad: es1019 and its management interface are unresponsive - https://phabricator.wikimedia.org/T120689#1859199 (10jcrespo) [18:33:13] 6operations, 10DBA: Perform a rolling restart of all MySQL slaves (masters too for those services with low traffic) - https://phabricator.wikimedia.org/T120122#1845648 (10jcrespo) [18:36:50] !log es1019 and its management interface became unresponsive [18:36:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:39:47] bblack: can we get a cache purge for https://meta.wikimedia.org/static/current/extensions/MobileApp/config/android.json? [18:42:54] 6operations, 10ops-eqiad, 6Labs, 10Labs-Infrastructure: labstore1002 issues while trying to reboot - https://phabricator.wikimedia.org/T98183#1859258 (10coren) I've gotten the problem to reproduce once out of 17 attempts with POST stalling at `F/W Initializing Devices 0%`; which is the original issues. FW... [18:44:53] (03PS1) 10Faidon Liambotis: varnish: don't store hit-for-pass objects for logged-in users [puppet] - 10https://gerrit.wikimedia.org/r/257382 [18:46:55] bearND: echo "https://meta.wikimedia.org/static/current/extensions/MobileApp/config/android.json" | mwscript purgeList.php --wiki=aawiki [18:46:59] That, in theory, should do it [18:47:23] Might want a few versions, like http:// and just // [18:49:26] (03CR) 10Paladox: "The redirections seem to work now in gerrit they redirect to correct callsign. Meaning with this patch it fixes it all I have tested it an" [puppet] - 10https://gerrit.wikimedia.org/r/257193 (owner: 10Paladox) [18:50:08] Hey YuviPanda, where would a thumbnail error for betacommons end up? [18:51:02] marktraceur: no idea :) ask in #wikimedia-releng? [18:51:04] (or whomever) [18:51:07] Oh, hm [18:51:15] I guess I thought "labs" so you were my first stop [18:51:27] awww [18:52:32] YuviPanda: print that and frame it :) [18:52:42] Reedy: Thanks. We only need https since the Android app only uses this specific URL. Not sure where I should run this command. I'm not an ops person so I probably don't have the permission to do so. [18:52:57] bearND: Can you deploy code? [18:53:00] ie access to tin? [18:53:08] (03CR) 10Andrew Bogott: "one bikeshed" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/256693 (https://phabricator.wikimedia.org/T120106) (owner: 10coren) [18:53:51] Reedy: I have access to tin, but I've only deployed a RESTBase service. I don't have experience deploying MW services [18:54:16] bearND: literally just ssh onto tin, then paste what I put, press enter [18:54:20] Reedy, I don't think that purge you gave will work [18:54:28] Why not? [18:54:29] It's /static/, which gets normalised to www.wikimedia.org in varnish [18:54:30] * YuviPanda prints and frames marktraceur [18:54:35] (03CR) 10coren: Add a new security module with ::pam and ::access (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/256693 (https://phabricator.wikimedia.org/T120106) (owner: 10coren) [18:54:52] So rewrite the url and purge that instead? [18:55:16] oh wow, bearND has deployment access without ever having deployed MW stuff? [18:55:20] Ok, I'll give it a shot [18:56:18] oh wait. What does the --wiki parameter do? [18:56:33] runs purgeList.php in the context of aawiki [18:56:56] Technically it's no longer necessary [18:57:03] 6operations, 10DBA: Drop phlegal_* databases from m3 - https://phabricator.wikimedia.org/T112573#1859349 (10jcrespo) [18:57:05] you can omit it and multiversion will default to aawiki [18:57:05] (03CR) 10Andrew Bogott: [C: 031] "bikeshed withdrawn" [puppet] - 10https://gerrit.wikimedia.org/r/256693 (https://phabricator.wikimedia.org/T120106) (owner: 10coren) [18:57:06] 6operations, 10DBA: Drop the tables old_growth, hitcounter, click_tracking, click_tracking_user_properties from enwiki, maybe other schemas - https://phabricator.wikimedia.org/T115982#1859348 (10jcrespo) [18:57:17] this only works with specific maintenance scripts like purgeList [18:57:51] (03PS1) 10Ori.livneh: Prevent Apache from setting TCP_DEFER_ACCEPT by default [puppet] - 10https://gerrit.wikimedia.org/r/257388 (https://phabricator.wikimedia.org/T119372) [18:58:03] 7Blocked-on-Operations, 6operations, 10Traffic, 5Patch-For-Review: Improve handling of mobile variants in Varnish - https://phabricator.wikimedia.org/T120151#1859353 (10dr0ptp4kt) Gents, just acknowledging I saw this task. I'll tag this with #reading-admin. @jkatzwmf, @jdlrobson is this eligible for the cu... [18:58:10] That's not new [18:58:10] It just didn't always work [18:58:12] 7Blocked-on-Operations, 6operations, 6Reading-Admin, 10Traffic, 5Patch-For-Review: Improve handling of mobile variants in Varnish - https://phabricator.wikimedia.org/T120151#1859356 (10dr0ptp4kt) [19:00:32] (03CR) 10Ori.livneh: [C: 032] Prevent Apache from setting TCP_DEFER_ACCEPT by default [puppet] - 10https://gerrit.wikimedia.org/r/257388 (https://phabricator.wikimedia.org/T119372) (owner: 10Ori.livneh) [19:00:35] Reedy: Krenair : Great, that worked. Thank you for your help. Cache is purged /cc: dr0ptp4kt_ mdholloway [19:00:45] echo "https://www.wikimedia.org/static/current/extensions/MobileApp/config/android.json" | mwscript purgeList.php --wiki=aawiki [19:00:49] Krenair: See. I was right :) [19:01:02] :P [19:01:03] right about the hostname? [19:01:13] bearND: Make a note of it somewhere for next time [19:01:13] It working [19:01:13] Reedy: I had to use www instead of meta [19:01:28] Easy enough fixed I guess [19:01:37] So I was right then :P [19:02:02] Reedy: done. Thanks. Yes we'll need that a couple more times in the near term future when we increase the roll out percentage [19:02:11] Reedy, templates/varnish/text-frontend.inc.vcl.erb: if (req.url ~ "^/static/") { set req.http.host = "<%= @vcl_config.fetch("static_host") %>"; } [19:09:18] PROBLEM - puppet last run on cp1067 is CRITICAL: CRITICAL: Puppet has 3 failures [19:09:30] PROBLEM - puppet last run on mc2010 is CRITICAL: CRITICAL: Puppet has 1 failures [19:09:58] PROBLEM - puppet last run on mw1176 is CRITICAL: CRITICAL: Puppet has 3 failures [19:10:05] grr [19:10:08] apache on puppetmaster [19:10:20] PROBLEM - puppet last run on mw1164 is CRITICAL: CRITICAL: Puppet has 1 failures [19:10:28] PROBLEM - puppet last run on strontium is CRITICAL: CRITICAL: Puppet has 2 failures [19:10:29] PROBLEM - puppet last run on mw1217 is CRITICAL: CRITICAL: Puppet has 1 failures [19:10:40] these are ephemeral [19:10:59] PROBLEM - puppet last run on mw2148 is CRITICAL: CRITICAL: Puppet has 1 failures [19:11:09] PROBLEM - puppet last run on mw2104 is CRITICAL: CRITICAL: Puppet has 1 failures [19:11:19] RECOVERY - puppet last run on cp1067 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [19:11:29] !log restarting ldap on neptunium and nembus due to replication warnings in the logs [19:11:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:11:39] PROBLEM - puppet last run on mw2128 is CRITICAL: CRITICAL: Puppet has 3 failures [19:11:58] PROBLEM - puppet last run on mw1088 is CRITICAL: CRITICAL: Puppet has 1 failures [19:12:02] andrewbogott: i assume you got those because the ldap corp mirror was gone in between [19:12:20] PROBLEM - puppet last run on mw2063 is CRITICAL: CRITICAL: Puppet has 1 failures [19:12:29] PROBLEM - puppet last run on mw2156 is CRITICAL: CRITICAL: Puppet has 1 failures [19:12:30] or i might be totally wrong, nevermind [19:12:40] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [19:12:40] mutante: I don’t know what you’re talking about :( [19:12:43] you mean those puppet last run errors? [19:13:05] no, the puppet last run errors were caused by my change. they are ephemeral; the run failed for those hosts that were mid-run when apache2 on palladium reloaded. [19:13:30] they'll be fine on the next run [19:16:28] 7Blocked-on-Operations, 6operations, 6Reading-Admin, 10Traffic, 5Patch-For-Review: Improve handling of mobile variants in Varnish - https://phabricator.wikimedia.org/T120151#1859424 (10ori) I think the disableImages feature is a misfeature, at the end of the day. (a) It never worked properly. (b) Every... [19:20:25] !log restarted pdns on labcontrol1001 and 1002 to catch up with recent opendj restarts [19:20:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:21:09] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is failed [19:21:40] (03CR) 10Ottomata: [C: 031] gdash: completely decommission all req* dashboards [puppet] - 10https://gerrit.wikimedia.org/r/257373 (https://phabricator.wikimedia.org/T118979) (owner: 10Giuseppe Lavagetto) [19:26:49] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 1.00% above the threshold [1000000.0] [19:27:42] 6operations, 7Mail: New Onboard's (Marc Brent) Email not working - https://phabricator.wikimedia.org/T120672#1859474 (10akosiaris) 5Open>3Resolved a:3akosiaris Resync done, uid has been imported, seems like resolved. Closing this [19:31:03] (03CR) 1020after4: [C: 04-1] "this is not the solution. The redirection explicitly depends on the semicolons as delimiters. We can modify the phabricator side of this " [puppet] - 10https://gerrit.wikimedia.org/r/257193 (owner: 10Paladox) [19:31:31] (03CR) 10Dzahn: [C: 04-1] "i asked in -devtools about this and there is a redirector in phabricator handlings this. and it relies on the ";" part, so this should sta" [puppet] - 10https://gerrit.wikimedia.org/r/257193 (owner: 10Paladox) [19:32:11] (03PS1) 10Bartosz Dziewoński: gerrit: Map /tools/hooks/commit-msg to /r/tools/hooks/commit-msg [puppet] - 10https://gerrit.wikimedia.org/r/257396 [19:33:31] akosiaris: errgggh maybe not. [19:33:39] you are probably gone for the day, eh? [19:33:50] RECOVERY - puppet last run on mc2010 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [19:34:12] (03CR) 10Chad: [C: 031] gerrit: Map /tools/hooks/commit-msg to /r/tools/hooks/commit-msg [puppet] - 10https://gerrit.wikimedia.org/r/257396 (owner: 10Bartosz Dziewoński) [19:34:40] RECOVERY - puppet last run on strontium is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [19:35:18] RECOVERY - puppet last run on mw2148 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [19:36:53] RECOVERY - puppet last run on mw2128 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [19:36:53] RECOVERY - puppet last run on mw1176 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [19:36:53] RECOVERY - puppet last run on mw1088 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:36:53] (03PS2) 10Bartosz Dziewoński: gerrit: Map /tools/hooks/commit-msg to /r/tools/hooks/commit-msg [puppet] - 10https://gerrit.wikimedia.org/r/257396 [19:36:54] (03PS1) 10Matěj Suchánek: Update Wikidata property blacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257397 [19:36:54] RECOVERY - puppet last run on mw1164 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:36:54] RECOVERY - puppet last run on mw2063 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [19:36:54] RECOVERY - puppet last run on mw1217 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:36:54] RECOVERY - puppet last run on mw2156 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [19:37:29] RECOVERY - puppet last run on mw2104 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:37:50] (03PS1) 10Papaul: Add DNS entries for auth2001 Bug:T120263 [dns] - 10https://gerrit.wikimedia.org/r/257398 (https://phabricator.wikimedia.org/T120263) [19:38:31] 7Blocked-on-Operations, 6operations, 6Reading-Admin, 10Traffic, 5Patch-For-Review: Improve handling of mobile variants in Varnish - https://phabricator.wikimedia.org/T120151#1859513 (10JKatzWMF) @dr0ptp4kt I think this needs to wait at least until next sprint [19:38:59] hey folks, who do i bribe to deploy https://gerrit.wikimedia.org/r/257396 ? or do i have to sign up for puppetswat with it? it'd be nice to have it done quickly, since all the GCI students are going to be running into this. (at least one did already) [19:39:56] 7Blocked-on-Operations, 6operations, 6Reading-Admin, 10Traffic, 5Patch-For-Review: Improve handling of mobile variants in Varnish - https://phabricator.wikimedia.org/T120151#1859528 (10JKatzWMF) Ugh, ignore ^. In a rush and thought this was something else. Do we know how many users this impacts? [19:40:14] (03CR) 10Alex Monk: "not mentioned on https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions yet" [dns] - 10https://gerrit.wikimedia.org/r/257398 (https://phabricator.wikimedia.org/T120263) (owner: 10Papaul) [19:47:35] 7Blocked-on-Operations, 6operations, 6Reading-Admin, 10Traffic, 5Patch-For-Review: Improve handling of mobile variants in Varnish - https://phabricator.wikimedia.org/T120151#1859585 (10Tnegrin) If people turn it on, then off, will we be able to count them? [19:47:47] 6operations, 10ops-codfw: rack new yubico auth system - https://phabricator.wikimedia.org/T120263#1859586 (10Papaul) [19:49:20] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1001 is OK: OK - create-dbusers is active [19:49:28] PROBLEM - Packetloss_Average on analytics1026 is CRITICAL: packet_loss_average CRITICAL: 14.1560366667 [19:50:13] (03PS1) 10Jcrespo: Repool es1015 at 100% load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257402 [19:51:44] (03CR) 10Jcrespo: [C: 032] Repool es1015 at 100% load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257402 (owner: 10Jcrespo) [19:52:52] !log restarting nova-compute on all labvirt1X nodes [19:52:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:52:57] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool es1015 at 100% load (duration: 00m 27s) [19:53:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:53:33] (03PS12) 10coren: Add a new security module with ::pam and ::access [puppet] - 10https://gerrit.wikimedia.org/r/256693 (https://phabricator.wikimedia.org/T120106) [19:54:34] no, really, if anyone deploys https://gerrit.wikimedia.org/r/257396 for me, i'll set you up with delicious Polish candy the next time we meet. [19:56:16] (03CR) 10coren: [C: 032] "This should be a noop for all but two projects (tools, deployment-prep); the latter two should get functionally identical but slightly dif" [puppet] - 10https://gerrit.wikimedia.org/r/256693 (https://phabricator.wikimedia.org/T120106) (owner: 10coren) [19:59:44] (03PS1) 10Mobrovac: AQS: Configure Cassandra for AQS in BetaCluster [puppet] - 10https://gerrit.wikimedia.org/r/257406 (https://phabricator.wikimedia.org/T116206) [20:00:58] (03PS2) 10Mobrovac: AQS: Configure Cassandra for AQS in BetaCluster [puppet] - 10https://gerrit.wikimedia.org/r/257406 (https://phabricator.wikimedia.org/T116206) [20:01:29] RECOVERY - Packetloss_Average on analytics1026 is OK: packet_loss_average OKAY: 3.33353151515 [20:05:45] 6operations, 10ops-codfw: db2019 has a failed disk - https://phabricator.wikimedia.org/T120073#1859661 (10Papaul) This System is out of warranty. [20:06:24] (03PS1) 10GWicke: Update RESTBase configs for RESTBase v0.9.1 [puppet] - 10https://gerrit.wikimedia.org/r/257408 [20:06:56] 6operations: Check incoming requests to secure.wm.o - https://phabricator.wikimedia.org/T119274#1859674 (10Dzahn) everything left appears to be from these 3: internal wiki of northgrum.com internal wiki of alcatel-lucent.com wellwiki.org [20:07:14] 6operations, 10ops-codfw: db2019 has a failed disk - https://phabricator.wikimedia.org/T120073#1859675 (10jcrespo) Yes, we will arrange with Chris sending you some disks soon. [20:07:26] !log disabling puppet on labcontrol1002 for an ldap/pdns test [20:07:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:08:07] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service, 10hardware-requests: Additional diskspace of wdqs1001/wdqs1002 - https://phabricator.wikimedia.org/T119579#1859678 (10RobH) Well, we have @Mark's approval to use the on-site spares for this (as we have plenty.) Now the only question is how b... [20:08:45] SMalyshev: I'm not sure which is the best answer on the task I've updated above ^ but i'd like more input [20:08:50] the disks are approved though =] [20:10:04] robh: I think it's better to add the disks to raid1 for now and make a task to reimage them later... I don't want to do it right now when everybody is busy and/or on vacation [20:10:09] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service, 10hardware-requests: Additional diskspace of wdqs1001/wdqs1002 - https://phabricator.wikimedia.org/T119579#1859697 (10RobH) My issue is if we toss these in as a raid1 and extend the LVM, that is NOT how we would typically install it from scra... [20:10:16] SMalyshev: ahh [20:10:21] cool, that makes sense to me [20:10:34] my comment just now was before i saw your reply (we both replied at same time) [20:10:43] so if you can feedback with that on task, i think we can move ahead like that. [20:10:52] with a stalled reimage task for a later date seems sane to me [20:10:56] somewhere in Jan we can take our time to shut them down one by one and reinstall [20:11:06] cool [20:11:47] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service, 10hardware-requests: Additional diskspace of wdqs1001/wdqs1002 - https://phabricator.wikimedia.org/T119579#1859716 (10Smalyshev) I think we can add the disks now to raid1 and file a task to reimage them somewhere in January. [20:12:38] SMalyshev: so i dont think adding in the disks should cause them to halt or stop working, but im never 100% certain [20:12:39] ever [20:12:51] so of those two, which is ideal to test this on first? [20:13:03] robh: doesn't matter, they are supposed to be identical [20:13:13] so either one is fine [20:13:18] yea but if we toss it in and it freezes the system, neither is a master? [20:13:19] cool [20:13:25] chris will ahve ssh open and watching it when he does [20:13:30] but one can never be certain [20:13:48] so approved and now i'll make the tasks to make it happen [20:14:10] (03PS1) 10Muehlenhoff: Add associatedDomain to LDAP index for labs [puppet] - 10https://gerrit.wikimedia.org/r/257409 [20:14:36] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service, 10hardware-requests: Additional diskspace of wdqs1001/wdqs1002 - https://phabricator.wikimedia.org/T119579#1859729 (10RobH) That sounds reasonable to me. I'll create the sub-tasks for the on-site installation of the disks and implementation... [20:15:22] (03CR) 10Andrew Bogott: [C: 031] "Tabs are slightly misaligned but I don't care :)" [puppet] - 10https://gerrit.wikimedia.org/r/257409 (owner: 10Muehlenhoff) [20:16:52] (03PS2) 10Muehlenhoff: Add associatedDomain to LDAP index for labs [puppet] - 10https://gerrit.wikimedia.org/r/257409 [20:17:19] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add associatedDomain to LDAP index for labs [puppet] - 10https://gerrit.wikimedia.org/r/257409 (owner: 10Muehlenhoff) [20:20:15] 6operations, 10ops-eqiad, 6Discovery, 10Wikidata, 10Wikidata-Query-Service: install two Intel 320 Series SSDSA2CW300G3 2.5" 300GB each in wdqs1001/wdqs1002 - https://phabricator.wikimedia.org/T120712#1859742 (10RobH) 3NEW a:3Cmjohnson [20:20:33] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service, 10hardware-requests: Additional diskspace of wdqs1001/wdqs1002 - https://phabricator.wikimedia.org/T119579#1859752 (10RobH) 5Open>3Resolved Resolving this task as the onsite task now exists. [20:24:10] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service: implement wdqs1001/1002 disk upgrades (extend lvm) - https://phabricator.wikimedia.org/T120714#1859769 (10RobH) 3NEW [20:24:33] (03PS1) 10coren: Labs: build images without pam.d and access.conf hacks [puppet] - 10https://gerrit.wikimedia.org/r/257411 (https://phabricator.wikimedia.org/T120710) [20:25:14] (03PS1) 10Muehlenhoff: Also add aRecord to LDAP indices [puppet] - 10https://gerrit.wikimedia.org/r/257412 [20:25:32] (03PS3) 10Bartosz Dziewoński: gerrit: Map /tools/hooks/commit-msg to /r/tools/hooks/commit-msg [puppet] - 10https://gerrit.wikimedia.org/r/257396 [20:26:04] (03PS2) 10Andrew Bogott: Also add aRecord to LDAP indices [puppet] - 10https://gerrit.wikimedia.org/r/257412 (owner: 10Muehlenhoff) [20:26:12] (03CR) 10Andrew Bogott: [C: 031] Also add aRecord to LDAP indices [puppet] - 10https://gerrit.wikimedia.org/r/257412 (owner: 10Muehlenhoff) [20:26:42] (03CR) 10Muehlenhoff: [C: 032 V: 032] Also add aRecord to LDAP indices [puppet] - 10https://gerrit.wikimedia.org/r/257412 (owner: 10Muehlenhoff) [20:32:32] 6operations, 6Labs, 10Labs-Infrastructure: logrotate/disk space on silver for nutcracker log - https://phabricator.wikimedia.org/T120683#1859807 (10chasemp) p:5Triage>3High There is a logrotate there now I guess we need to get more aggressive on this box: silver:~# cat /etc/logrotate.d/nutcracker /var/l... [20:32:51] (03PS1) 10Ori.livneh: memcached: /a/redis -> /srv/redis [puppet] - 10https://gerrit.wikimedia.org/r/257415 [20:33:06] (03PS2) 10Ori.livneh: memcached: /a/redis -> /srv/redis [puppet] - 10https://gerrit.wikimedia.org/r/257415 [20:33:15] (03CR) 10Ori.livneh: [C: 032 V: 032] memcached: /a/redis -> /srv/redis [puppet] - 10https://gerrit.wikimedia.org/r/257415 (owner: 10Ori.livneh) [20:33:35] !log enabling puppet on labcontrol1002, testing over [20:33:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:33:56] 6operations, 6Labs, 10Labs-Infrastructure, 7HTTPS: add a https-only option to dynamicproxy - https://phabricator.wikimedia.org/T120486#1859817 (10chasemp) p:5Triage>3Normal [20:40:40] PROBLEM - puppet last run on db2010 is CRITICAL: CRITICAL: Puppet has 1 failures [20:42:44] (03PS35) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd in role::eventbus [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [20:44:11] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Puppetize eventlogging-service with systemd in role::eventbus [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) (owner: 10Ottomata) [20:45:49] (03PS36) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd in role::eventbus [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [20:45:58] PROBLEM - Host ores.wmflabs.org is DOWN: check_ping: Invalid hostname/address - ores.wmflabs.org [20:46:28] (03PS37) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd in role::eventbus [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [20:47:09] (03PS1) 10Ori.livneh: migrate mc* to redis::instance [puppet] - 10https://gerrit.wikimedia.org/r/257417 [20:48:09] (03CR) 10Thcipriani: [C: 031] "Cherry picked to `deployment-puppetmaster` works as expected." [puppet] - 10https://gerrit.wikimedia.org/r/257406 (https://phabricator.wikimedia.org/T116206) (owner: 10Mobrovac) [20:49:50] (03PS38) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd in role::eventbus [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [20:51:29] RECOVERY - Host ores.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 3.29 ms [20:51:44] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Puppetize eventlogging-service with systemd in role::eventbus [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) (owner: 10Ottomata) [20:53:04] (03PS39) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd in role::eventbus [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [20:59:01] (03PS40) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd in role::eventbus [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [20:59:42] (03PS2) 10Ori.livneh: migrate mc* to redis::instance [puppet] - 10https://gerrit.wikimedia.org/r/257417 [21:00:04] gwicke cscott arlolra subbu bearND mdholloway: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151207T2100). Please do the needful. [21:02:34] (03CR) 10Ori.livneh: [C: 032] migrate mc* to redis::instance [puppet] - 10https://gerrit.wikimedia.org/r/257417 (owner: 10Ori.livneh) [21:03:04] (03CR) 10Eevans: "This is ready for review; I'd like to work on getting the extension setup in deployment-prep!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256418 (https://phabricator.wikimedia.org/T116786) (owner: 10Eevans) [21:03:13] getting ready to deploy parsoid [21:04:39] (03PS1) 10coren: Add stable.toolserver.org to legacy redirects [puppet] - 10https://gerrit.wikimedia.org/r/257425 (https://phabricator.wikimedia.org/T120526) [21:05:12] !log starting parsoid deploy [21:05:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:08:22] greg-g: Hi... I would like to deploy the fix for https://phabricator.wikimedia.org/T120716 sometime soon [21:09:00] RECOVERY - puppet last run on db2010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:09:43] !log restart parsoid on wtp1002 as a canary [21:09:46] hoo: before swat? [21:09:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:09:55] greg-g: Preferably [21:10:11] I need to get up tomorrow and would like to watch the dump creation for a bit [21:10:22] also late SWAT is after the RDF dump creation started [21:10:28] and I don't want that failing as well [21:10:48] The code touched is only being run in the maint. script, it's not hit during web requests or anything [21:11:14] (03PS1) 10Ori.livneh: Fix-up for I0f55f28364: include {append,db}filename settings [puppet] - 10https://gerrit.wikimedia.org/r/257428 [21:11:30] (03CR) 10Ori.livneh: [C: 032 V: 032] Fix-up for I0f55f28364: include {append,db}filename settings [puppet] - 10https://gerrit.wikimedia.org/r/257428 (owner: 10Ori.livneh) [21:12:38] restarting parsoid on all nodes [21:13:43] hoo: right right, doit at will (things are clear now other than parsoid) [21:15:09] !finished deploying parsoid sha 4a7df427 (cf0b9ef + cherry-pick of d65debd) [21:15:29] greg-g: Thanks :) [21:15:43] !log finished deploying parsoid sha 4a7df427 (cf0b9ef + cherry-pick of d65debd) [21:15:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:16:46] (03PS1) 10Ori.livneh: Fix-up for I0f55f28364: omit settings for AOF; not enabled [puppet] - 10https://gerrit.wikimedia.org/r/257430 [21:16:58] (03CR) 10Ori.livneh: [C: 032 V: 032] Fix-up for I0f55f28364: omit settings for AOF; not enabled [puppet] - 10https://gerrit.wikimedia.org/r/257430 (owner: 10Ori.livneh) [21:16:58] subbu: Are you done? [21:17:07] hoo, yes. [21:17:12] nice [21:19:43] (03PS1) 10Jdlrobson: Replicate Wikivoyage RelatedArticles behaviour on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257431 (https://phabricator.wikimedia.org/T116676) [21:20:20] (03PS2) 10Dzahn: Add mgmt DNS entries for auth2001 Bug:T120263 [dns] - 10https://gerrit.wikimedia.org/r/257398 (https://phabricator.wikimedia.org/T120263) (owner: 10Papaul) [21:31:58] !log hoo@tin Synchronized php-1.27.0-wmf.7/extensions/Wikidata/: (no message) (duration: 00m 37s) [21:32:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:32:58] PROBLEM - Redis on mc2001 is CRITICAL: Connection refused [21:33:05] mc2001 is me [21:33:07] PROBLEM - Redis on mc2005 is CRITICAL: Connection refused [21:33:17] PROBLEM - Redis on mc2012 is CRITICAL: Connection refused [21:33:36] PROBLEM - Redis on mc2010 is CRITICAL: Connection refused [21:34:37] PROBLEM - Redis on mc2009 is CRITICAL: Connection refused [21:35:37] PROBLEM - Redis on mc2013 is CRITICAL: Connection refused [21:35:47] PROBLEM - Redis on mc2014 is CRITICAL: Connection refused [21:35:56] that doesn't sound healthy [21:35:57] PROBLEM - Redis on mc2004 is CRITICAL: Connection refused [21:35:57] PROBLEM - Redis on mc2007 is CRITICAL: Connection refused [21:35:57] PROBLEM - Redis on mc2016 is CRITICAL: Connection refused [21:36:10] oh wait, it's codfw [21:36:27] PROBLEM - Redis on mc2008 is CRITICAL: Connection refused [21:36:30] (03PS1) 10Jdlrobson: Enable Cards and RelatedArticles so it rides the train [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257434 [21:36:32] (03PS1) 10Jdlrobson: Enable RelatedArticles on all wikipedias in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257435 (https://phabricator.wikimedia.org/T116676) [21:36:57] ori: ^ [21:37:01] ori: is that from the migration? [21:37:14] yeah, i'm on top of it [21:37:37] PROBLEM - Redis on mc2011 is CRITICAL: Connection refused [21:37:37] PROBLEM - Redis on mc2006 is CRITICAL: Connection refused [21:37:37] PROBLEM - Redis on mc2015 is CRITICAL: Connection refused [21:37:49] ori: kk [21:38:07] PROBLEM - Redis on mc2003 is CRITICAL: Connection refused [21:39:54] (03PS2) 10BryanDavis: Replicate Wikivoyage RelatedArticles behaviour on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257431 (https://phabricator.wikimedia.org/T116676) (owner: 10Jdlrobson) [21:40:11] (03CR) 10BryanDavis: [C: 032] "Beta only; I'll pull to tin" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257431 (https://phabricator.wikimedia.org/T116676) (owner: 10Jdlrobson) [21:40:38] (03Merged) 10jenkins-bot: Replicate Wikivoyage RelatedArticles behaviour on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257431 (https://phabricator.wikimedia.org/T116676) (owner: 10Jdlrobson) [21:41:18] RECOVERY - Redis on mc2010 is OK: TCP OK - 0.042 second response time on port 6379 [21:41:36] RECOVERY - Redis on mc2011 is OK: TCP OK - 0.037 second response time on port 6379 [21:41:36] RECOVERY - Redis on mc2013 is OK: TCP OK - 0.036 second response time on port 6379 [21:41:38] RECOVERY - Redis on mc2014 is OK: TCP OK - 0.037 second response time on port 6379 [21:41:47] RECOVERY - Redis on mc2004 is OK: TCP OK - 0.036 second response time on port 6379 [21:41:48] RECOVERY - Redis on mc2007 is OK: TCP OK - 0.039 second response time on port 6379 [21:41:48] RECOVERY - Redis on mc2016 is OK: TCP OK - 0.036 second response time on port 6379 [21:42:07] RECOVERY - Redis on mc2003 is OK: TCP OK - 0.038 second response time on port 6379 [21:42:18] RECOVERY - Redis on mc2008 is OK: TCP OK - 0.038 second response time on port 6379 [21:42:36] RECOVERY - Redis on mc2009 is OK: TCP OK - 0.042 second response time on port 6379 [21:42:47] RECOVERY - Redis on mc2001 is OK: TCP OK - 0.037 second response time on port 6379 [21:42:48] RECOVERY - Redis on mc2005 is OK: TCP OK - 0.037 second response time on port 6379 [21:43:06] RECOVERY - Redis on mc2012 is OK: TCP OK - 0.038 second response time on port 6379 [21:43:28] RECOVERY - Redis on mc2006 is OK: TCP OK - 0.041 second response time on port 6379 [21:43:36] RECOVERY - Redis on mc2015 is OK: TCP OK - 0.036 second response time on port 6379 [21:45:20] (03PS1) 10Krinkle: Remove profiler config variables that no longer exist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257437 [21:49:12] (03CR) 10Dzahn: [C: 032] Add mgmt DNS entries for auth2001 Bug:T120263 [dns] - 10https://gerrit.wikimedia.org/r/257398 (https://phabricator.wikimedia.org/T120263) (owner: 10Papaul) [21:51:33] (03PS2) 10Jdlrobson: Enable RelatedArticles on all wikipedias in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257435 (https://phabricator.wikimedia.org/T116676) [21:51:35] (03PS2) 10Jdlrobson: Enable Cards and RelatedArticles so it rides the train [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257434 [21:51:37] (03PS1) 10Jdlrobson: Allow disabling of side bar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257440 [21:53:34] (03PS3) 10Jdlrobson: Enable RelatedArticles on all wikipedias in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257435 (https://phabricator.wikimedia.org/T116676) [21:53:36] (03PS3) 10Jdlrobson: Enable Cards and RelatedArticles so it rides the train [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257434 [21:53:38] (03PS2) 10Jdlrobson: Allow disabling of side bar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257440 [21:54:12] (03CR) 10BryanDavis: [C: 032] Allow disabling of side bar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257440 (owner: 10Jdlrobson) [21:54:47] (03Merged) 10jenkins-bot: Allow disabling of side bar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257440 (owner: 10Jdlrobson) [22:00:46] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [22:01:02] (03CR) 10Legoktm: Allow disabling of side bar (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257440 (owner: 10Jdlrobson) [22:01:53] !log mobileapps deployed 984f14b [22:01:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:02:14] 6operations, 10ops-codfw: rack new yubico auth system - https://phabricator.wikimedia.org/T120263#1860065 (10Papaul) @Robh or @MoritzMuehlenhoff I discussed this with Daniel on IRC. Since this server has a new name that does not exist on the wiki page, the wiki page needs to be update with the new name. Since... [22:03:07] (03CR) 10BryanDavis: Allow disabling of side bar (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257440 (owner: 10Jdlrobson) [22:03:56] (03CR) 10BBlack: [C: 04-1] "Right idea, but we probably need to sort out a few other details (notably, we don't want to pass on mobile requests with just disableImage" [puppet] - 10https://gerrit.wikimedia.org/r/257382 (owner: 10Faidon Liambotis) [22:04:35] bd808: I don't get it...why would changing InitializeSettings cause a race? [22:06:36] (03PS1) 10EBernhardson: Update portals submodule to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257445 [22:10:05] 6operations, 10ops-codfw: rack new yubico auth system - https://phabricator.wikimedia.org/T120263#1860079 (10Papaul) @RobH or @MoritzMuehlenhoff I update the wiki page if you need to change anything please do. Thanks https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions [22:11:14] legoktm: so the way I have seen it happen before... new $wmgX used in CommonSettings, new $wmgX introduced in InitializeSettings, scap, MW server sees change to InitializeSettings on local disk before change to CommonSettings, boom! [22:13:29] bd808: okay, but it's not new in this change right? [22:13:53] right. it's overkill. [22:14:12] (03PS2) 10Dzahn: elasticsearch: move role to module/role [puppet] - 10https://gerrit.wikimedia.org/r/257036 [22:14:36] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/1454/" [puppet] - 10https://gerrit.wikimedia.org/r/257036 (owner: 10Dzahn) [22:16:23] 6operations, 10ops-codfw: rack new yubico auth system - https://phabricator.wikimedia.org/T120263#1860126 (10RobH) @papaul: looks good, that is all it really needs to have on that page for now. Thanks (both for noticing its lack and adding it!) [22:19:03] (03CR) 10Ori.livneh: "Thanks Filippo!" (035 comments) [debs/bloomd] - 10https://gerrit.wikimedia.org/r/257168 (owner: 10Ori.livneh) [22:20:05] (03PS2) 10Ori.livneh: import debian directory [debs/bloomd] - 10https://gerrit.wikimedia.org/r/257168 [22:20:09] (03CR) 10EBernhardson: [C: 032] Update portals submodule to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257445 (owner: 10EBernhardson) [22:20:39] (03Merged) 10jenkins-bot: Update portals submodule to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257445 (owner: 10EBernhardson) [22:21:35] (03PS1) 10Dereckson: Fix typo in namespaces configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257487 [22:22:41] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [22:22:56] !log ebernhardson@tin Synchronized portals/prod: Deploy updated portals (duration: 00m 27s) [22:23:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:23:11] ebernhardson: just being curious, what's the path to that submodule [22:23:49] tried mediawiki-config/portals [22:24:06] mutante: thats the one [22:24:39] mutante: oh in gerrit? wikimedia/portals [22:25:25] ebernhardson: yes, in gerrit, i guess it confused me that it's a submodule of mediawiki-config but not actually mediawiki-config/portals so i couldnt find it. thanks! [22:26:36] (03PS3) 10Ori.livneh: setup EventBus extension for deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256418 (https://phabricator.wikimedia.org/T116786) (owner: 10Eevans) [22:27:15] (03CR) 10Ori.livneh: [C: 032] setup EventBus extension for deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256418 (https://phabricator.wikimedia.org/T116786) (owner: 10Eevans) [22:27:37] (03Merged) 10jenkins-bot: setup EventBus extension for deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256418 (https://phabricator.wikimedia.org/T116786) (owner: 10Eevans) [22:29:15] !log ori@tin Synchronized php-1.27.0-wmf.7/extensions/MobileFrontend: I8c94425693: Improve disableImages cookie code (T120151) (duration: 00m 30s) [22:29:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:35:24] (03CR) 10Dzahn: "@ottomata yes, class names are the same. compiler link shows no diff" [puppet] - 10https://gerrit.wikimedia.org/r/257035 (owner: 10Dzahn) [22:36:29] (03PS3) 10Dzahn: zookeeper: move roles to module/role [puppet] - 10https://gerrit.wikimedia.org/r/257035 [22:36:38] 10Ops-Access-Requests, 6operations: eventlog1001 access for user Madhuvishy - https://phabricator.wikimedia.org/T120731#1860250 (10madhuvishy) 3NEW [22:37:00] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/1440/" [puppet] - 10https://gerrit.wikimedia.org/r/257035 (owner: 10Dzahn) [22:37:23] 10Ops-Access-Requests, 6operations: eventlog1001 access for user Madhuvishy - https://phabricator.wikimedia.org/T120731#1860262 (10madhuvishy) [22:38:09] 10Ops-Access-Requests, 6operations: eventlog1001 access for user Madhuvishy - https://phabricator.wikimedia.org/T120731#1860265 (10yuvipanda) Have you performed the required human and/or animal sacrifices? [22:40:28] (03PS1) 10Ori.livneh: Improve handling of disableImages cookie [puppet] - 10https://gerrit.wikimedia.org/r/257491 [22:40:38] 10Ops-Access-Requests, 6operations: eventlog1001 access for user Madhuvishy - https://phabricator.wikimedia.org/T120731#1860272 (10madhuvishy) @kevinator Can you approve? [22:43:32] 10Ops-Access-Requests, 6operations: eventlog1001 access for user Madhuvishy - https://phabricator.wikimedia.org/T120731#1860250 (10kevinator) @yuvipanda how about a "world burns" token as a sacrifice? [22:43:58] 10Ops-Access-Requests, 6operations: eventlog1001 access for user Madhuvishy - https://phabricator.wikimedia.org/T120731#1860284 (10kevinator) approved for @madhuvishy to get access. [22:44:37] 10Ops-Access-Requests, 6operations: eventlog1001 access for user Madhuvishy - https://phabricator.wikimedia.org/T120731#1860287 (10yuvipanda) @kevinator only if you did it with a 7 business day wait period while standing on one foot with your eyes closed (this is much harder than standing on one foot with your... [22:45:57] 10Ops-Access-Requests, 6operations: eventlog1001 access for user Madhuvishy - https://phabricator.wikimedia.org/T120731#1860297 (10Deskana) >>! In T120731#1860265, @yuvipanda wrote: > Have you performed the required human and/or animal sacrifices? The process for access requests has taken a strange turn, late... [22:48:13] 6operations, 10hardware-requests: determine which eqiad spares to decommission - https://phabricator.wikimedia.org/T120679#1860309 (10RobH) The spares page has long been a point of pain. It is difficult to track more than a simple list on wiki, even with the VE improvements to wikitables. We already have a s... [22:49:07] (03PS2) 10Ori.livneh: Improve handling of disableImages cookie [puppet] - 10https://gerrit.wikimedia.org/r/257491 (https://phabricator.wikimedia.org/T120151) [22:52:47] (03PS1) 10Ori.livneh: Split the mobile cache for clients with 'NetSpeed=B' cookie [puppet] - 10https://gerrit.wikimedia.org/r/257496 (https://phabricator.wikimedia.org/T119798) [23:00:00] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [23:01:31] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [23:02:02] mutante: that's yours ^ [23:02:17] Dzahn: zookeeper: move roles to module/role (2105f7217c) [23:02:48] ori: sorry, merging now [23:03:02] done [23:03:30] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [23:04:01] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [23:06:18] (03CR) 10Florianschmidtwelzow: [C: 031] Fix typo in namespaces configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257487 (owner: 10Dereckson) [23:09:27] 10Ops-Access-Requests, 6operations: eventlog1001 access for user Madhuvishy - https://phabricator.wikimedia.org/T120731#1860468 (10RobH) Access to 'eventlogging-roots' is a sudo request. It will require that this access-request be approved during our operations meetings. As this went in on Monday 2015-12-07;... [23:10:05] 10Ops-Access-Requests, 6operations: eventlog1001 access for user Madhuvishy - https://phabricator.wikimedia.org/T120731#1860469 (10madhuvishy) @RobH yes I understand, no problem. [23:25:57] 6operations, 10hardware-requests: migrate spares into google sheet tracking & determine which eqiad spares to decommission - https://phabricator.wikimedia.org/T120679#1860565 (10RobH) [23:38:17] (03CR) 10Jforrester: [C: 031] "Now ready." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251523 (https://phabricator.wikimedia.org/T117991) (owner: 10Jforrester) [23:43:44] (03PS9) 10Thcipriani: RESTBase configuration for scap3 deployment [puppet] - 10https://gerrit.wikimedia.org/r/252887 [23:43:46] (03PS3) 10Thcipriani: AQS: Configure Cassandra for AQS in BetaCluster [puppet] - 10https://gerrit.wikimedia.org/r/257406 (https://phabricator.wikimedia.org/T116206) (owner: 10Mobrovac) [23:48:09] (03PS2) 10Ori.livneh: Improve handling of mobile variant cookies [puppet] - 10https://gerrit.wikimedia.org/r/257496 (https://phabricator.wikimedia.org/T119798) [23:48:59] bblack: ^ [23:58:21] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 18.18% of data above the critical threshold [100000000.0]