[00:29:28] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3035_v6, cp3038_v6 [00:31:31] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 60 ESP OK [01:19:18] PROBLEM - IPsec on cp1060 is CRITICAL: Strongswan CRITICAL - ok: 21 not-conn: cp3017_v6, cp4012_v6, cp4020_v6 [01:21:10] RECOVERY - IPsec on cp1060 is OK: Strongswan OK - 24 ESP OK [01:31:31] !log mwscript deleteEqualMessages.php --wiki sqwiki [01:31:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:34:39] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 59 connecting: (unnamed) not-conn: cp3045_v6 [01:35:50] PROBLEM - are wikitech and wt-static in sync on silver is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (103515s 100000s) [01:38:39] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 60 ESP OK [01:46:39] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp2024_v6 [01:50:39] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 60 ESP OK [02:15:08] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp2024_v6 [02:17:08] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 60 ESP OK [02:26:05] !log l10nupdate@tin Synchronized php-1.26wmf22/cache/l10n: l10nupdate for 1.26wmf22 (duration: 06m 59s) [02:26:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:29:49] !log l10nupdate@tin LocalisationUpdate completed (1.26wmf22) at 2015-09-14 02:29:48+00:00 [02:29:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:38:59] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [02:47:00] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [02:48:09] PROBLEM - IPsec on cp1059 is CRITICAL: Strongswan CRITICAL - ok: 23 not-conn: cp4012_v6 [02:50:10] RECOVERY - IPsec on cp1059 is OK: Strongswan OK - 24 ESP OK [03:02:06] (03PS1) 10Tim Landscheidt: Tools: Migrate from labsdebrepo to aptly [puppet] - 10https://gerrit.wikimedia.org/r/238089 [03:05:58] PROBLEM - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-others/snapshot is not accessible: Permission denied [03:09:55] (03CR) 10Tim Landscheidt: "Tested the pinning on a Precise instance in Toolsbeta (after setting up toolsbeta-webproxy-01 as an aptly server):" [puppet] - 10https://gerrit.wikimedia.org/r/238089 (owner: 10Tim Landscheidt) [03:28:00] RECOVERY - Disk space on labstore1002 is OK: DISK OK [04:04:38] PROBLEM - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-maps/snapshot is not accessible: Permission denied [04:09:28] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=737.91 Read Requests/Sec=4338.45 Write Requests/Sec=593.37 KBytes Read/Sec=23464.04 KBytes_Written/Sec=2373.47 [04:11:20] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=0.40 Read Requests/Sec=0.00 Write Requests/Sec=1.60 KBytes Read/Sec=0.00 KBytes_Written/Sec=6.40 [04:30:49] RECOVERY - are wikitech and wt-static in sync on silver is OK: wikitech-static OK - wikitech and wikitech-static in sync (-6457 100000s) [04:34:27] the wikitech-static check seems to fail and recover regularly. does the threshold just need to be put up a bit? [04:47:58] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Sep 14 04:47:58 UTC 2015 (duration 47m 57s) [04:48:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:02:52] (03PS3) 10Ori.livneh: mediawiki: kill HHVM graphite checks [puppet] - 10https://gerrit.wikimedia.org/r/237998 (owner: 10Faidon Liambotis) [05:03:06] (03PS4) 10Ori.livneh: mediawiki: kill HHVM graphite checks [puppet] - 10https://gerrit.wikimedia.org/r/237998 (owner: 10Faidon Liambotis) [05:03:27] (03CR) 10Ori.livneh: [C: 032 V: 032] mediawiki: kill HHVM graphite checks [puppet] - 10https://gerrit.wikimedia.org/r/237998 (owner: 10Faidon Liambotis) [05:09:35] (03PS1) 10Ori.livneh: Get rid of mw:monitoring:webserver [puppet] - 10https://gerrit.wikimedia.org/r/238091 [05:11:22] yay, the catalog compiler works again [05:15:02] (03PS2) 10Ori.livneh: mediawiki:monitoring:webserver: ensure => absent [puppet] - 10https://gerrit.wikimedia.org/r/238091 [05:15:36] (03CR) 10Ori.livneh: [C: 032 V: 032] mediawiki:monitoring:webserver: ensure => absent [puppet] - 10https://gerrit.wikimedia.org/r/238091 (owner: 10Ori.livneh) [05:20:59] PROBLEM - puppet last run on mendelevium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:24:18] PROBLEM - Check size of conntrack table on mendelevium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:24:39] PROBLEM - DPKG on mendelevium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:25:00] PROBLEM - dhclient process on mendelevium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:25:09] PROBLEM - Disk space on mendelevium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:25:29] PROBLEM - RAID on mendelevium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:25:30] PROBLEM - configured eth on mendelevium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:25:31] PROBLEM - salt-minion processes on mendelevium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:25:49] PROBLEM - spamassassin on mendelevium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:01:17] PROBLEM - NTP on mendelevium is CRITICAL: NTP CRITICAL: No response from NTP server [06:27:39] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [06:29:47] PROBLEM - puppet last run on cp1068 is CRITICAL: CRITICAL: Puppet has 2 failures [06:29:48] PROBLEM - puppet last run on wtp2008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:08] PROBLEM - puppet last run on mc2015 is CRITICAL: CRITICAL: puppet fail [06:30:28] PROBLEM - puppet last run on wtp2017 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:28] PROBLEM - puppet last run on subra is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:29] PROBLEM - puppet last run on db2060 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:38] PROBLEM - puppet last run on mc2007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:28] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:48] PROBLEM - puppet last run on cp3017 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:58] PROBLEM - puppet last run on db2055 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:39] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:15] 6operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-Requests: Bhojpuri wikipedia should start with 'bho' instead of 'bh' to avoid confusion with Bihari - https://phabricator.wikimedia.org/T41968#1636255 (10Liuxinyu970226) [06:37:09] 6operations, 10Wikimedia-Media-storage: Update rsvg on the image scalers - https://phabricator.wikimedia.org/T112421#1636259 (10Menner) [06:39:40] 6operations, 10Wikimedia-Media-storage: Update rsvg on the image scalers - https://phabricator.wikimedia.org/T112421#1636272 (10Menner) [06:39:41] 6operations, 10Wikimedia-SVG-rendering, 7Upstream: Filter effect Gaussian blur filter not rendered correctly for small to medium thumbnail sizes - https://phabricator.wikimedia.org/T44090#1636271 (10Menner) [06:42:48] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [06:46:10] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [06:49:17] PROBLEM - OTRS SMTP on mendelevium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:56:18] RECOVERY - puppet last run on cp1068 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [06:56:19] PROBLEM - RAID on db1043 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [06:56:28] RECOVERY - puppet last run on wtp2008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:38] RECOVERY - puppet last run on cp3017 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:56:39] RECOVERY - puppet last run on mc2015 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:56:49] RECOVERY - puppet last run on db2055 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [06:56:58] RECOVERY - puppet last run on wtp2017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:59] RECOVERY - puppet last run on subra is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:07] RECOVERY - puppet last run on db2060 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:17] RECOVERY - puppet last run on mc2007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:37] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [06:57:58] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:02:50] (03CR) 10Filippo Giunchedi: [C: 031] "good chance to test T109711 too" [puppet] - 10https://gerrit.wikimedia.org/r/237389 (https://phabricator.wikimedia.org/T105307) (owner: 10Muehlenhoff) [07:04:17] PROBLEM - SSH on mendelevium is CRITICAL: Server answer [07:05:57] RECOVERY - SSH on mendelevium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [07:06:05] (03CR) 10Filippo Giunchedi: [C: 04-1] "good to merge, minor error in units" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/225292 (owner: 10Nemo bis) [07:09:20] (03CR) 10Filippo Giunchedi: [C: 031] Slightly increase RESTBase job runner concurrency [puppet] - 10https://gerrit.wikimedia.org/r/237868 (owner: 10GWicke) [07:10:10] <_joe_> mendelevium? [07:10:21] <_joe_> ach some catchup to do I'd say :) [07:17:51] _joe_: yeah we're playing biology with hosts [07:18:29] (03PS4) 10Nemo bis: Add some more redis monitoring metrics to ganglia [puppet] - 10https://gerrit.wikimedia.org/r/225292 [07:20:39] PROBLEM - SSH on mendelevium is CRITICAL: Server answer [07:21:05] (03Abandoned) 10Giuseppe Lavagetto: labstore: fix replication checks [puppet] - 10https://gerrit.wikimedia.org/r/234490 (owner: 10Giuseppe Lavagetto) [07:22:18] RECOVERY - SSH on mendelevium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [07:23:45] (03CR) 10Giuseppe Lavagetto: [C: 032] Backport of D45165: Limit log message length for unserialize failures [debs/hhvm] - 10https://gerrit.wikimedia.org/r/237862 (owner: 10BryanDavis) [07:23:54] (03CR) 10Giuseppe Lavagetto: [V: 032] Backport of D45165: Limit log message length for unserialize failures [debs/hhvm] - 10https://gerrit.wikimedia.org/r/237862 (owner: 10BryanDavis) [07:30:33] (03PS5) 10Filippo Giunchedi: Add some more redis monitoring metrics to ganglia [puppet] - 10https://gerrit.wikimedia.org/r/225292 (owner: 10Nemo bis) [07:30:40] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Add some more redis monitoring metrics to ganglia [puppet] - 10https://gerrit.wikimedia.org/r/225292 (owner: 10Nemo bis) [07:30:48] PROBLEM - SSH on mendelevium is CRITICAL: Server answer [07:37:18] RECOVERY - SSH on mendelevium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [07:38:38] (03PS6) 10KartikMistry: CX: Enable suggestion for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/237327 (https://phabricator.wikimedia.org/T112498) [07:42:45] (03PS1) 10KartikMistry: CX: Enable Suggestions in ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238097 (https://phabricator.wikimedia.org/T111901) [07:48:34] (03PS2) 10Muehlenhoff: Create ferm rules for Hadoop master and Hadoop standby (common rules) [puppet] - 10https://gerrit.wikimedia.org/r/237335 [07:48:49] PROBLEM - SSH on mendelevium is CRITICAL: Server answer [07:52:23] !log reboot ms-be1010 to pick up disk ordering change [07:52:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:56:47] RECOVERY - Disk space on ms-be1010 is OK: DISK OK [08:02:28] RECOVERY - SSH on mendelevium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [08:06:12] 6operations, 10MediaWiki-General-or-Unknown, 5Patch-For-Review: Create a poolcounter instance in deployment-prep - https://phabricator.wikimedia.org/T112501#1636375 (10Joe) 3NEW a:3Joe [08:07:28] PROBLEM - SSH on mendelevium is CRITICAL: Server answer [08:11:37] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds [08:12:37] RECOVERY - SSH on mendelevium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [08:13:08] (03PS1) 10Giuseppe Lavagetto: poolcounter: Add configuration for labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238099 [08:16:27] 6operations, 10ops-eqiad: db1043 degraded RAID - https://phabricator.wikimedia.org/T112502#1636393 (10jcrespo) 3NEW [08:16:39] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds [08:17:38] PROBLEM - SSH on mendelevium is CRITICAL: Server answer [08:17:48] ACKNOWLEDGEMENT - RAID on db1043 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) Jcrespo https://phabricator.wikimedia.org/T112502 [08:21:45] I am running a CPU intensive task on db1043 (but on only 1 CPU), let me know if you see some slow down on phabricator (you shouldn't) [08:22:24] (uptime load is 1) [08:23:14] jynus you breaking things? :) (buenos días) [08:23:47] mafk, only when I have the time [08:23:54] :D [08:24:11] mobrovac: I'm about to start renaming cassandra test cluster btw T112257 [08:24:18] RECOVERY - SSH on mendelevium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [08:25:00] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [08:29:18] 6operations, 10ops-eqiad: db1043 degraded RAID - https://phabricator.wikimedia.org/T112502#1636409 (10jcrespo) [08:30:44] !log endinf profiling and executing pt-query-digest on db1043 [ETA:4h] [08:30:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:31:09] PROBLEM - SSH on mendelevium is CRITICAL: Server answer [08:34:28] RECOVERY - SSH on mendelevium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [08:39:37] PROBLEM - SSH on mendelevium is CRITICAL: Server answer [08:41:17] RECOVERY - SSH on mendelevium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [08:43:23] 6operations: staged dumps: use the "cutoff" option as little as possible - https://phabricator.wikimedia.org/T110305#1636436 (10ArielGlenn) 5Open>3Resolved this worked fine for the September run, closing. [08:43:24] 6operations, 7Tracking: staged dumps implementation - https://phabricator.wikimedia.org/T107757#1636438 (10ArielGlenn) [08:43:50] 6operations, 7Tracking: staged dumps implementation - https://phabricator.wikimedia.org/T107757#1502963 (10ArielGlenn) [08:43:50] 6operations: worker bash script terminates early when there are still more wikis to run - https://phabricator.wikimedia.org/T107759#1636446 (10ArielGlenn) 5Open>3Resolved September run looked ok, closing. [08:44:27] (03CR) 10Hashar: "Will probably be in conflict with https://gerrit.wikimedia.org/r/#/c/220308/ which is currently cherry picked on the integration puppetmas" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/237876 (https://phabricator.wikimedia.org/T110865) (owner: 10Zfilipin) [08:44:30] !log silence mendelevium for today, status unclear T111532 [08:44:33] akosiaris: ^ [08:44:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:44:51] (03PS1) 10Jcrespo: Depool es1002, es1005, es1008 for decommission [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238102 [08:45:08] 6operations: Make dumps run via cron on each snapshot host - https://phabricator.wikimedia.org/T107750#1636457 (10ArielGlenn) [08:45:09] 6operations: need script that handles all bash worker scripts on a given snapshot, per stage, rerunning failures as appropriate, managing resources as appropriate - https://phabricator.wikimedia.org/T107760#1636455 (10ArielGlenn) 5Open>3Resolved Partial run completed fine, September full run is in last phase... [08:45:49] (03CR) 10Jcrespo: [C: 031] Depool es1002, es1005, es1008 for decommission [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238102 (owner: 10Jcrespo) [08:46:21] 6operations: Make dumps run via cron on each snapshot host - https://phabricator.wikimedia.org/T107750#1502816 (10ArielGlenn) [08:46:22] 6operations, 7Tracking: staged dumps implementation - https://phabricator.wikimedia.org/T107757#1636458 (10ArielGlenn) 5Open>3Resolved done. closing. [08:47:09] (03PS2) 10Filippo Giunchedi: cassandra: adjust test cluster name [puppet] - 10https://gerrit.wikimedia.org/r/237643 (https://phabricator.wikimedia.org/T112257) [08:47:16] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: adjust test cluster name [puppet] - 10https://gerrit.wikimedia.org/r/237643 (https://phabricator.wikimedia.org/T112257) (owner: 10Filippo Giunchedi) [08:52:37] !log rename cassandra test cluster and restart [08:52:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:55:49] PROBLEM - Restbase endpoints health on praseodymium is CRITICAL: /page/html/{title} is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /page/data-parsoid/{title} is CRITICAL: Test Get data-parsoid by title returned the unexpected status 500 (expecting: 200): /page/title/{title} is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200) [08:58:46] looking ^ [09:00:18] (03PS8) 10Hashar: ci: Role for running Raita [puppet] - 10https://gerrit.wikimedia.org/r/208024 (owner: 10Dduvall) [09:01:01] (03CR) 10Hashar: [C: 031 V: 032] "Rebased. That is applied on labs maybe we can get this change added to the next PuppetSWAT ?" [puppet] - 10https://gerrit.wikimedia.org/r/208024 (owner: 10Dduvall) [09:01:09] RECOVERY - Restbase endpoints health on praseodymium is OK: All endpoints are healthy [09:02:09] (03PS4) 10Hashar: contint: Install chromedriver for running MW-Selenium tests [puppet] - 10https://gerrit.wikimedia.org/r/223691 (https://phabricator.wikimedia.org/T103039) (owner: 10Dduvall) [09:02:54] (03PS3) 10Hashar: contint: apt conf and python packages for light slaves [puppet] - 10https://gerrit.wikimedia.org/r/226715 (https://phabricator.wikimedia.org/T103972) [09:03:14] (03CR) 10Hashar: [C: 031] contint: apt conf and python packages for light slaves [puppet] - 10https://gerrit.wikimedia.org/r/226715 (https://phabricator.wikimedia.org/T103972) (owner: 10Hashar) [09:05:16] (03CR) 10Hashar: "_joe_ can you please merge this one in? It has been done for the etcd python jobs so we can ship on the Jessie slave the python utilities" [puppet] - 10https://gerrit.wikimedia.org/r/226715 (https://phabricator.wikimedia.org/T103972) (owner: 10Hashar) [09:06:58] (03CR) 10Giuseppe Lavagetto: [C: 04-1] ci: Role for running Raita (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/208024 (owner: 10Dduvall) [09:07:57] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Cassandra inter-node encryption (TLS) - https://phabricator.wikimedia.org/T108953#1636498 (10fgiunchedi) [09:08:00] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: rename cassandra test cluster - https://phabricator.wikimedia.org/T112257#1636496 (10fgiunchedi) 5Open>3Resolved rename has been successful, it involved following the above procedure and rolling-restart the cluster. of course since this... [09:08:46] (03CR) 10Giuseppe Lavagetto: [C: 032] contint: apt conf and python packages for light slaves [puppet] - 10https://gerrit.wikimedia.org/r/226715 (https://phabricator.wikimedia.org/T103972) (owner: 10Hashar) [09:08:48] (03PS2) 10ArielGlenn: fixes for cert cleaner script for labs [puppet] - 10https://gerrit.wikimedia.org/r/237626 [09:08:53] !log applying schema change to flowdb [09:08:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:09:25] (03PS3) 10ArielGlenn: fixes for cert cleaner script for labs [puppet] - 10https://gerrit.wikimedia.org/r/237626 [09:10:03] _joe_: thank you :) [09:10:26] (03CR) 10ArielGlenn: [C: 032] fixes for cert cleaner script for labs [puppet] - 10https://gerrit.wikimedia.org/r/237626 (owner: 10ArielGlenn) [09:12:07] <_joe_> uh someone merged the change already? [09:16:01] 6operations: sysctl::parameters don't take effect until next reboot (on Trusty at least) - https://phabricator.wikimedia.org/T109711#1636517 (10MoritzMuehlenhoff) In my tests the class works per se, the confusion might stem from the fact that the net.netfilter.nf_conntrack_buckets value cannot be changed the sam... [09:26:13] (03CR) 10Mobrovac: [C: 031] Slightly increase RESTBase job runner concurrency [puppet] - 10https://gerrit.wikimedia.org/r/237868 (owner: 10GWicke) [09:35:43] (03PS1) 10DCausse: Upgrade to extra plugin 1.7.1 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/238105 (https://phabricator.wikimedia.org/T112499) [09:36:42] (03CR) 10DCausse: [C: 04-1] "Should not be merged now" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/238105 (https://phabricator.wikimedia.org/T112499) (owner: 10DCausse) [09:38:58] (03PS1) 10Filippo Giunchedi: redis: match ganglia monitoring configuration with latest changes [puppet] - 10https://gerrit.wikimedia.org/r/238106 [09:42:05] (03PS2) 10Filippo Giunchedi: redis: match ganglia monitoring configuration with latest changes [puppet] - 10https://gerrit.wikimedia.org/r/238106 [09:45:21] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] redis: match ganglia monitoring configuration with latest changes [puppet] - 10https://gerrit.wikimedia.org/r/238106 (owner: 10Filippo Giunchedi) [09:51:17] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [09:55:00] (03PS1) 10Giuseppe Lavagetto: poolcounter: add connect_timeout in codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238108 (https://phabricator.wikimedia.org/T105378) [09:55:02] (03PS1) 10Giuseppe Lavagetto: poolcounter: enable connect_timeout for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238109 (https://phabricator.wikimedia.org/T105378) [09:55:56] 6operations, 10MediaWiki-General-or-Unknown, 5Patch-For-Review: Create a poolcounter instance in deployment-prep - https://phabricator.wikimedia.org/T112501#1636584 (10Joe) Cannot make a new instance communicate with the deployment-prep puppetmaster. @andrewbogott any help would be appreciated. [09:57:05] 6operations, 10MediaWiki-General-or-Unknown, 5Patch-For-Review: Create a poolcounter instance in deployment-prep - https://phabricator.wikimedia.org/T112501#1636585 (10Joe) [10:02:48] (03PS2) 10Muehlenhoff: Raise default conntrack table size [puppet] - 10https://gerrit.wikimedia.org/r/237389 (https://phabricator.wikimedia.org/T105307) [10:04:09] !log db1029 (x1-master) temporarily saturated by connections- flow was unresponsive for 10 minutes; migration partially aborted [10:04:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:16:39] 6operations, 7discovery-system: Remove etcd1001,2 from the etcd cluster, decommission them. - https://phabricator.wikimedia.org/T108010#1636600 (10Joe) 5Open>3Resolved [10:19:18] (03PS3) 10Muehlenhoff: Raise default conntrack table size [puppet] - 10https://gerrit.wikimedia.org/r/237389 (https://phabricator.wikimedia.org/T105307) [10:25:32] 6operations, 10Traffic, 7Monitoring, 7Pybal: Implement pybal pool state monitoring and alerting via icinga - https://phabricator.wikimedia.org/T102394#1636617 (10Joe) I don't think it would get to be much easier, no. What we need is for pybal to write its state to disk or to expose it in some other way. [10:25:50] (03PS2) 10Catrope: Set $wgFlowMigrateReferenceWiki false on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234207 (https://phabricator.wikimedia.org/T107204) (owner: 10Mattflaschen) [10:25:52] (03PS1) 10Catrope: Set $wgFlowMigrateReferenceWiki to false in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238111 (https://phabricator.wikimedia.org/T107204) [10:33:22] (03CR) 10Jcrespo: [C: 032] Depool es1002, es1005, es1008 for decommission [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238102 (owner: 10Jcrespo) [10:35:10] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool es1002, es1005, es1008 (duration: 00m 12s) [10:35:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:41:35] (03PS4) 10Muehlenhoff: Raise default conntrack table size [puppet] - 10https://gerrit.wikimedia.org/r/237389 (https://phabricator.wikimedia.org/T105307) [10:44:04] (03PS5) 10Muehlenhoff: Raise default conntrack table size (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/237389 (https://phabricator.wikimedia.org/T105307) [10:48:56] (03CR) 10Jean-Frédéric: [C: 031] Add *.ggpht.com to Wikimedia Commons upload whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234980 (https://phabricator.wikimedia.org/T110869) (owner: 10Dereckson) [10:52:58] PROBLEM - puppet last run on mw1075 is CRITICAL: CRITICAL: Puppet has 1 failures [10:57:37] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 22 data above and 8 below the confidence bounds [11:01:07] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 21 data above and 8 below the confidence bounds [11:03:02] (03CR) 10Steinsplitter: "open since two weeks. can we please go ahead and merge it?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234980 (https://phabricator.wikimedia.org/T110869) (owner: 10Dereckson) [11:17:14] (03CR) 10Jcrespo: Add scap scripts to all canary app servers [puppet] - 10https://gerrit.wikimedia.org/r/237707 (https://phabricator.wikimedia.org/T112174) (owner: 10Jcrespo) [11:17:35] 6operations, 10MediaWiki-General-or-Unknown, 5Patch-For-Review: Create a poolcounter instance in deployment-prep - https://phabricator.wikimedia.org/T112501#1636776 (10Krenair) ```root@deployment-poolcounter01:/var/lib/puppet# ping deployment-puppetmaster PING deployment-puppetmaster.deployment-prep.eqiad.wm... [11:19:23] 6operations, 10Beta-Cluster, 6Labs, 10MediaWiki-General-or-Unknown, 5Patch-For-Review: Create a poolcounter instance in deployment-prep - https://phabricator.wikimedia.org/T112501#1636779 (10Krenair) [11:19:46] 6operations, 10Beta-Cluster, 6Labs, 10MediaWiki-General-or-Unknown: Create a poolcounter instance in deployment-prep - https://phabricator.wikimedia.org/T112501#1636375 (10Krenair) [11:19:47] RECOVERY - puppet last run on mw1075 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:21:48] 6operations, 10Traffic, 10fundraising-tech-ops, 7IPv6, 5Patch-For-Review: Enable IPv6 on donate.wikimedia.org - https://phabricator.wikimedia.org/T73267#1636783 (10Pcoombe) @jrobell AFAIK this change won't affect banners. The main sites already use IPv6, this change is only for donatewiki. Related: it s... [11:23:47] (03CR) 10Matthias Mullie: [C: 031] "AFAICT, references are only fetched:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238111 (https://phabricator.wikimedia.org/T107204) (owner: 10Catrope) [11:26:53] 6operations, 6Labs, 10Salt: salt does not run reliably for toollabs / labs generally - https://phabricator.wikimedia.org/T99213#1636790 (10ArielGlenn) These changes are now live on labstore1001. Check of instances that don't reply to test.ping now. **The following have 'no route to host' so presumably they... [11:30:09] 6operations, 6Labs, 10Salt: salt does not run reliably for toollabs / labs generally - https://phabricator.wikimedia.org/T99213#1636794 (10ArielGlenn) response fast, all ten hosts, no timeout needed: ``` root@labcontrol1001:~# salt -G 'fqdn:tools-webgrid-lighttpd-12*' cmd.run hostname tools-webgrid-lightt... [11:30:56] 6operations, 6Labs, 10Salt: salt does not run reliably for toollabs / labs generally - https://phabricator.wikimedia.org/T99213#1636802 (10ArielGlenn) 5Open>3Resolved a:3ArielGlenn closing this, opening another ticket specific to the instances that owners must fix. [11:33:41] !log citoid deploying d569951 [11:33:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:36:38] (03PS6) 10Muehlenhoff: Raise default conntrack table size [puppet] - 10https://gerrit.wikimedia.org/r/237389 (https://phabricator.wikimedia.org/T105307) [11:48:00] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 14 data above and 9 below the confidence bounds [11:51:19] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 13 data above and 8 below the confidence bounds [11:54:47] 6operations, 7HHVM: /var/cache/hhvm/cli.hhbc.sq3 owned by root on some mw hosts - https://phabricator.wikimedia.org/T112517#1636888 (10Krenair) 3NEW [11:56:27] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 8 below the confidence bounds [11:57:47] 6operations, 7HHVM, 7Tracking: Complete the use of HHVM over Zend PHP on the Wikimedia cluster (tracking) - https://phabricator.wikimedia.org/T86081#1636896 (10JeroenDeDauw) Is there any ETA on being able to use PHP 5.4 features in WMF deployed extensions yet? [12:00:42] 6operations, 10Datasets-General-or-Unknown: Sometimes (at peak usage?), dumps.wikimedia.org becomes very slow for users (sometimes unresponsive) - https://phabricator.wikimedia.org/T45647#1636908 (10ArielGlenn) I'll see if I can correlate the times to server activity to get a lead on this. [12:01:39] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 8 below the confidence bounds [12:03:20] 6operations, 10Beta-Cluster, 7HHVM: Convert work machines (tin, terbium) to Trusty and hhvm usage - https://phabricator.wikimedia.org/T87036#1636917 (10Krenair) Trusty replacement for tin = mira? [12:06:40] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 8 below the confidence bounds [12:06:57] 6operations, 10Traffic, 7Monitoring, 7Pybal: Implement pybal pool state monitoring and alerting via icinga - https://phabricator.wikimedia.org/T102394#1636919 (10fgiunchedi) an easier option would be for pybal to expose its internal state via http for clients (e.g. icinga checks) to fetch, like e.g. hhvm d... [12:07:28] 6operations, 7HHVM, 7Tracking: Complete the use of HHVM over Zend PHP on the Wikimedia cluster (tracking) - https://phabricator.wikimedia.org/T86081#1636920 (10Krenair) * {T104747} is blocked on {T110707} which is blocked by ops on https://gerrit.wikimedia.org/r/#/c/234699/ * {T94277} is waiting on @ArielGle... [12:07:56] 6operations, 10Traffic, 7Monitoring, 7Pybal: Implement pybal pool state monitoring and alerting via icinga - https://phabricator.wikimedia.org/T102394#1636929 (10Joe) @fgiunchedi I am working on a patch in that direction right now :) [12:08:12] 6operations, 10Traffic, 7Monitoring, 7Pybal: Implement pybal pool state monitoring and alerting via icinga - https://phabricator.wikimedia.org/T102394#1636930 (10Joe) a:3Joe [12:11:02] (03PS7) 10Muehlenhoff: Raise default conntrack table size [puppet] - 10https://gerrit.wikimedia.org/r/237389 (https://phabricator.wikimedia.org/T105307) [12:13:30] (03CR) 10Muehlenhoff: "I've tested this on the mediawiki instances in deployment-prep and in my ferm test systems in labs; the correct values are set after a reb" [puppet] - 10https://gerrit.wikimedia.org/r/237389 (https://phabricator.wikimedia.org/T105307) (owner: 10Muehlenhoff) [12:15:54] moritzm, I was wondering if you might know what's going on in https://phabricator.wikimedia.org/T112501#1636776 ? [12:17:24] 6operations, 10Traffic, 10fundraising-tech-ops, 7IPv6, 5Patch-For-Review: Enable IPv6 on donate.wikimedia.org - https://phabricator.wikimedia.org/T73267#1636967 (10BBlack) We haven't had the time to devote to it yet, it's just a matter of scheduling and priorities. [12:23:48] (03CR) 10BBlack: [C: 031] Raise default conntrack table size [puppet] - 10https://gerrit.wikimedia.org/r/237389 (https://phabricator.wikimedia.org/T105307) (owner: 10Muehlenhoff) [12:24:16] Krenair: hmm, it's not caused by ferm rules on either deployment-puppetmaster nor deployment-poolcounter01 (they don't have any), maybe related to the openstack update? I'll have a look at the logs [12:24:29] thanks [12:24:42] I checked and other hosts were successfully connecting to that port [12:26:09] <_joe_> moritzm, Krenair I'm on it [12:26:27] <_joe_> it's clearly a higher-level problem in the cloud network [12:27:24] <_joe_> my guess is andrewbogott might know something more about that [12:27:42] (03PS2) 10Filippo Giunchedi: cassandra: enable DC internode encryption for test cluster [puppet] - 10https://gerrit.wikimedia.org/r/237648 (https://phabricator.wikimedia.org/T108953) [12:27:50] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: enable DC internode encryption for test cluster [puppet] - 10https://gerrit.wikimedia.org/r/237648 (https://phabricator.wikimedia.org/T108953) (owner: 10Filippo Giunchedi) [12:28:09] (03PS2) 10ArielGlenn: crap salt cleanup scripts primarily for labs use [software] - 10https://gerrit.wikimedia.org/r/236798 [12:30:20] (03PS1) 10Aude: Remove (broken) Wikidata-specific SkinCopyrightFooter hook handler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238125 (https://phabricator.wikimedia.org/T112520) [12:32:09] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [12:32:33] !log enable dc encryption on cassandra test cluster and rolling restart [12:32:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:33:40] (03PS2) 10Hashar: contint: upgrade setuptools from pypi [puppet] - 10https://gerrit.wikimedia.org/r/234254 (https://phabricator.wikimedia.org/T110506) [12:34:13] (03CR) 10Hashar: [C: 031 V: 032] "Had this patch on the integration puppetmaster. On creating a new node every went fine and the jobs are properly running." [puppet] - 10https://gerrit.wikimedia.org/r/234254 (https://phabricator.wikimedia.org/T110506) (owner: 10Hashar) [12:35:30] 6operations, 10Traffic, 7HTTPS: Track/notify cert expiries better - https://phabricator.wikimedia.org/T112521#1637061 (10BBlack) 3NEW [12:39:17] 6operations, 10Salt: fix monitor-salt-keys.py to not rotate salt aes keys on deletion - https://phabricator.wikimedia.org/T112522#1637080 (10ArielGlenn) 3NEW a:3ArielGlenn [12:47:09] 6operations, 10Citoid, 6Services, 10Traffic: Remove citoid from parsoidcache - https://phabricator.wikimedia.org/T110476#1637106 (10BBlack) I don't see where that's noted there. Are you saying there's a reason to keep a separate cxserver.wikimedia.org in the long term, even after making it available via RB? [12:50:38] !log starting Cassandra repair on restbase1003 (nodetool repair -pr) [12:50:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:52:40] (03PS1) 10Filippo Giunchedi: cassandra: add auxiliary (non-seed) codfw test hosts [puppet] - 10https://gerrit.wikimedia.org/r/238135 (https://phabricator.wikimedia.org/T108613) [12:54:45] (03CR) 10Alex Monk: [C: 032] Remove (broken) Wikidata-specific SkinCopyrightFooter hook handler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238125 (https://phabricator.wikimedia.org/T112520) (owner: 10Aude) [12:54:52] (03Merged) 10jenkins-bot: Remove (broken) Wikidata-specific SkinCopyrightFooter hook handler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238125 (https://phabricator.wikimedia.org/T112520) (owner: 10Aude) [12:55:40] !log krenair@tin Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/#/c/238125/ (duration: 00m 13s) [12:55:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:56:14] !log rebooting lvs2006 to test eth hw params stuff... [12:56:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:01:39] (03CR) 10Filippo Giunchedi: [C: 04-1] "LGTM overall, not sure about the onlyif test" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/237389 (https://phabricator.wikimedia.org/T105307) (owner: 10Muehlenhoff) [13:02:47] thanks Krenair :) [13:07:04] 6operations, 10Citoid, 6Services, 10Traffic: Remove citoid from parsoidcache - https://phabricator.wikimedia.org/T110476#1637147 (10Mvolz) It was mentioned on another task (and I can't currently find the thread!) but @Jdforrester-WMF mentioned that we've been actively encouraging developers to use the cito... [13:08:54] (03CR) 10Filippo Giunchedi: [C: 04-1] "I initially thought we'd want to separate seeds from non-seeds based on DC separation, though clients will cycle through seeds anyways so " [puppet] - 10https://gerrit.wikimedia.org/r/238135 (https://phabricator.wikimedia.org/T108613) (owner: 10Filippo Giunchedi) [13:09:18] 6operations, 10Citoid, 6Services, 10Traffic: Remove citoid from parsoidcache - https://phabricator.wikimedia.org/T110476#1637153 (10mobrovac) >>! In T110476#1637106, @BBlack wrote: > I don't see where that's noted there. Hm, indeed, it's not. Hm, strange. I remember discussing it with @Jdforrester-WMF on... [13:13:07] 6operations, 10Salt: various salt-minions are not replying to test.ping or commands - https://phabricator.wikimedia.org/T102808#1637169 (10ArielGlenn) I will be looking into this again this week. [13:15:11] (03PS1) 10Filippo Giunchedi: cassandra: add codfw test nodes [puppet] - 10https://gerrit.wikimedia.org/r/238138 (https://phabricator.wikimedia.org/T108613) [13:16:05] (03Abandoned) 10Filippo Giunchedi: cassandra: add auxiliary (non-seed) codfw test hosts [puppet] - 10https://gerrit.wikimedia.org/r/238135 (https://phabricator.wikimedia.org/T108613) (owner: 10Filippo Giunchedi) [13:22:40] (03CR) 10Eevans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/238138 (https://phabricator.wikimedia.org/T108613) (owner: 10Filippo Giunchedi) [13:23:13] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add codfw test nodes [puppet] - 10https://gerrit.wikimedia.org/r/238138 (https://phabricator.wikimedia.org/T108613) (owner: 10Filippo Giunchedi) [13:26:48] 6operations, 10Traffic, 5Patch-For-Review: Re-investigate eth params on jessie LVS nodes - https://phabricator.wikimedia.org/T110530#1637197 (10BBlack) GRO and LRO seem fine. Still facing an issue with both the rxring parameters and the interface-rps parameters. They can both be applied successfully post-b... [13:31:09] Hi [13:31:31] Is there anything that can be said about yesterday's "incident" yet? [13:33:19] 6operations, 10Traffic, 10fundraising-tech-ops, 7IPv6, 5Patch-For-Review: Enable IPv6 on donate.wikimedia.org - https://phabricator.wikimedia.org/T73267#1637211 (10Jgreen) a:5Jgreen>3BBlack [13:34:31] 6operations, 10fundraising-tech-ops: reformulate kafkatee package to work with Trusty - https://phabricator.wikimedia.org/T110591#1637217 (10Jgreen) p:5Normal>3High [13:37:54] 6operations, 10Traffic, 10fundraising-tech-ops, 7IPv6, 5Patch-For-Review: Enable IPv6 on donate.wikimedia.org - https://phabricator.wikimedia.org/T73267#1637231 (10jrobell) Thank you @Pcoombe, as this doesn't seem to affect banner campaigns, the planned Luxembourg and Belgium campaign just went up at 1.3... [13:38:29] !log stop puppet on restbase-test2001 and turn up cassandra [13:38:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:38:39] 6operations, 10fundraising-tech-ops: package udp-filter for Trusty, for use on fundraising banner_logger - https://phabricator.wikimedia.org/T110592#1637236 (10Jgreen) [13:39:22] 6operations, 10fundraising-tech-ops: package udp-filter for Trusty, for use on fundraising banner_logger - https://phabricator.wikimedia.org/T110592#1581705 (10Jgreen) [13:39:23] 6operations, 10fundraising-tech-ops: build libanon package for trusty - https://phabricator.wikimedia.org/T110739#1637238 (10Jgreen) 5Open>3Resolved builds now [13:41:18] RECOVERY - Cassandra database on restbase-test2001 is OK: PROCS OK: 1 process with UID = 111 (cassandra), command name java, args CassandraDaemon [13:42:18] RECOVERY - Cassanda CQL query interface on restbase-test2001 is OK: TCP OK - 0.036 second response time on port 9042 [13:47:01] 6operations, 10fundraising-tech-ops: package udp-filter for Trusty, for use on fundraising banner_logger - https://phabricator.wikimedia.org/T110592#1637249 (10Jgreen) [13:48:27] 6operations, 10Traffic, 5Patch-For-Review: Re-investigate eth params on jessie LVS nodes - https://phabricator.wikimedia.org/T110530#1637252 (10BBlack) Digging a little further in syslogs, apparently it is a race. systemd ends up trying to configure eth[12] first and they fail the RSS IRQ pattern check, and... [13:52:07] 6operations, 10fundraising-tech-ops: package udp-filter for Trusty, for use on fundraising banner_logger - https://phabricator.wikimedia.org/T110592#1637263 (10Jgreen) 5Open>3Resolved compiles/builds fine now [14:00:44] (03PS1) 10Filippo Giunchedi: cassandra: enable ssl_storage_port (7001) in ferm [puppet] - 10https://gerrit.wikimedia.org/r/238144 (https://phabricator.wikimedia.org/T108953) [14:01:45] (03PS2) 10Filippo Giunchedi: cassandra: enable ssl_storage_port (7001) in ferm [puppet] - 10https://gerrit.wikimedia.org/r/238144 (https://phabricator.wikimedia.org/T108953) [14:02:10] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: enable ssl_storage_port (7001) in ferm [puppet] - 10https://gerrit.wikimedia.org/r/238144 (https://phabricator.wikimedia.org/T108953) (owner: 10Filippo Giunchedi) [14:02:52] (03PS3) 10Ottomata: Create ferm rules for Hadoop NameNode and ResourceManager for master and standby [puppet] - 10https://gerrit.wikimedia.org/r/237335 (owner: 10Muehlenhoff) [14:14:48] RECOVERY - Cassandra database on restbase-test2003 is OK: PROCS OK: 1 process with UID = 111 (cassandra), command name java, args CassandraDaemon [14:15:18] RECOVERY - Cassanda CQL query interface on restbase-test2003 is OK: TCP OK - 0.034 second response time on port 9042 [14:15:57] 6operations: Undo phab.wikidata.org hacks after wmfusercontent.org cert is fixed - https://phabricator.wikimedia.org/T112381#1637334 (10csteipp) If we ever do this again in the future, let's use a wikimedia.org domain ( longer history of segmenting untrusted subdomains). @Bblack, do you have an eta from GlobalS... [14:16:08] RECOVERY - Cassandra database on restbase-test2002 is OK: PROCS OK: 1 process with UID = 111 (cassandra), command name java, args CassandraDaemon [14:16:19] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 15.38% of data above the critical threshold [500.0] [14:17:37] RECOVERY - Cassanda CQL query interface on restbase-test2002 is OK: TCP OK - 0.034 second response time on port 9042 [14:17:48] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] [14:20:10] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Set up multi-DC replication for Cassandra - https://phabricator.wikimedia.org/T108613#1637354 (10fgiunchedi) ok cassandra is up in codfw with encryption enabled and `auto_bootstrap: false` so codfw and eqiad are seeing each other (step #4)... [14:26:48] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 88.89% of data above the critical threshold [5000000.0] [14:27:27] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:30:58] PROBLEM - Restbase endpoints health on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:30:58] PROBLEM - Restbase endpoints health on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:31:28] <_joe_> godog: ^^ [14:32:08] PROBLEM - Restbase endpoints health on cerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:32:20] looking, thanks _joe_ [14:37:08] RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 1.00% above the threshold [1000000.0] [14:38:47] RECOVERY - Restbase endpoints health on cerium is OK: All endpoints are healthy [14:39:19] RECOVERY - Restbase endpoints health on praseodymium is OK: All endpoints are healthy [14:39:19] RECOVERY - Restbase endpoints health on xenon is OK: All endpoints are healthy [14:41:10] there's some 5xx alerts showing up for restbase in the dashboard, looking at those as well [14:41:17] 6operations, 10ops-codfw, 10netops: cr1-eqdfw PEM 0 failure - https://phabricator.wikimedia.org/T110435#1637414 (10Papaul) RMA create.. Hello Papaul The RMA is already done, the order # is R395890, The local logistics department will receive the request and will proceed from here, I will like you to keep... [14:43:57] PROBLEM - Restbase endpoints health on cerium is CRITICAL: /page/html/{title} is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /page/data-parsoid/{title} is CRITICAL: Test Get data-parsoid by title returned the unexpected status 500 (expecting: 200): /page/title/{title} is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /page/revisio [14:44:37] PROBLEM - Restbase endpoints health on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:44:38] PROBLEM - Restbase endpoints health on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:46:09] RECOVERY - Restbase endpoints health on xenon is OK: All endpoints are healthy [14:46:21] (03PS1) 10ArielGlenn: labs key monitor/delete script: don't rotate saltmaster aes key on key deletion [puppet] - 10https://gerrit.wikimedia.org/r/238151 [14:48:21] (03PS1) 10Giuseppe Lavagetto: Add instrumentation [debs/pybal] - 10https://gerrit.wikimedia.org/r/238152 (https://phabricator.wikimedia.org/T102394) [14:48:57] 6operations, 10ops-codfw: ms-be2006.codfw.wmnet: slot=6 dev=sdg failed - https://phabricator.wikimedia.org/T112242#1637427 (10Papaul) I will have replacement drive on site tomorrow. [14:49:19] 6operations, 10ops-eqiad, 10Traffic, 10netops: rack/setup new eqiad lvs machines - https://phabricator.wikimedia.org/T104458#1637428 (10faidon) [14:51:27] PROBLEM - Restbase endpoints health on xenon is CRITICAL: /page/html/{title} is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /page/data-parsoid/{title} is CRITICAL: Test Get data-parsoid by title returned the unexpected status 500 (expecting: 200): /page/title/{title} is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /page/revision [14:52:26] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: rsync the diff since mail was held on sodium - https://phabricator.wikimedia.org/T110138#1637435 (10Dzahn) one more time, about 20 hours later sent 5435891 bytes received 10238 bytes 37952.12 bytes/sec total size is 2837146704 speedup is 520.95... [14:52:54] (03CR) 10ArielGlenn: [C: 032] labs key monitor/delete script: don't rotate saltmaster aes key on key deletion [puppet] - 10https://gerrit.wikimedia.org/r/238151 (owner: 10ArielGlenn) [14:53:37] 6operations, 7Availability, 7Monitoring: Monitor MediaWiki sessions - https://phabricator.wikimedia.org/T108985#1637437 (10chasemp) [14:53:37] (03PS1) 10Andrew Bogott: Upgrade labvirt1007 to kilo [puppet] - 10https://gerrit.wikimedia.org/r/238154 [14:54:27] 6operations, 6Discovery, 10Wikimedia-Logstash, 7Elasticsearch, 7Graphite: Deploy statsd plugin for production elasticsearch & logstash - https://phabricator.wikimedia.org/T90889#1637438 (10chasemp) We are going to try on https://phabricator.wikimedia.org/T111573 first I think [14:55:09] 6operations, 10Beta-Cluster, 6Labs, 10MediaWiki-General-or-Unknown: Create a poolcounter instance in deployment-prep - https://phabricator.wikimedia.org/T112501#1637443 (10Andrew) I see this problem and can reproduce it on another instance. No idea as to the cause yet. [14:55:18] (03PS2) 10Andrew Bogott: Upgrade labvirt1007 to kilo [puppet] - 10https://gerrit.wikimedia.org/r/238154 [14:56:22] (03CR) 10Andrew Bogott: [C: 032] Upgrade labvirt1007 to kilo [puppet] - 10https://gerrit.wikimedia.org/r/238154 (owner: 10Andrew Bogott) [14:57:11] 6operations, 10Traffic, 5Patch-For-Review: Re-investigate eth params on jessie LVS nodes - https://phabricator.wikimedia.org/T110530#1637444 (10BBlack) So, basically this is a race centered around bnx2x->udev->systemd event notifications and /e/n/i up-commands that set hardware parameters. It's probably goi... [14:57:33] 6operations, 10Traffic, 5Patch-For-Review: Re-investigate eth params on jessie LVS nodes - https://phabricator.wikimedia.org/T110530#1637449 (10BBlack) [14:57:49] 6operations, 10Traffic, 5Patch-For-Review: Re-investigate eth params on jessie LVS nodes - https://phabricator.wikimedia.org/T110530#1580041 (10BBlack) [14:57:50] 6operations, 10ops-eqiad, 10Traffic, 10netops: rack/setup new eqiad lvs machines - https://phabricator.wikimedia.org/T104458#1637451 (10BBlack) [15:00:04] anomie ostriches thcipriani marktraceur Krenair: Respected human, time to deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150914T1500). Please do the needful. [15:00:04] Krenair bawolff: A patch you scheduled for Morning SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [15:00:11] hey [15:00:18] hi [15:00:36] (03CR) 10Alex Monk: [C: 032] Add *.ggpht.com to Wikimedia Commons upload whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234980 (https://phabricator.wikimedia.org/T110869) (owner: 10Dereckson) [15:01:04] (03Merged) 10jenkins-bot: Add *.ggpht.com to Wikimedia Commons upload whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234980 (https://phabricator.wikimedia.org/T110869) (owner: 10Dereckson) [15:01:23] Woo! [15:01:59] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/234980/ (duration: 00m 12s) [15:02:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:02:38] PROBLEM - DPKG on lvs1005 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:02:48] PROBLEM - DPKG on lvs3003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:02:58] PROBLEM - DPKG on lvs3004 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:02:58] PROBLEM - DPKG on lvs3002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:02:58] PROBLEM - DPKG on lvs1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:03:17] PROBLEM - DPKG on lvs1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:03:18] (03CR) 10Alex Monk: [C: 032] Revert "Add interwiki-labs.cdb" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/237529 (owner: 10Alex Monk) [15:03:39] (03Merged) 10jenkins-bot: Revert "Add interwiki-labs.cdb" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/237529 (owner: 10Alex Monk) [15:03:50] 6operations, 10Salt: check usage of salt-key delete everywhere - https://phabricator.wikimedia.org/T112534#1637484 (10ArielGlenn) 3NEW a:3ArielGlenn [15:03:59] PROBLEM - DPKG on lvs1006 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:03:59] PROBLEM - DPKG on lvs1004 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:04:08] PROBLEM - DPKG on lvs3001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:04:17] PROBLEM - DPKG on lvs1003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:04:26] ^ that's me [15:04:32] 6operations, 10Salt: fix monitor-salt-keys.py to not rotate salt aes keys on deletion - https://phabricator.wikimedia.org/T112522#1637495 (10ArielGlenn) 5Open>3Resolved https://gerrit.wikimedia.org/r/#/c/238151/ tested, merged and deployed. see related task T112534 [15:04:48] !log krenair@tin Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/#/c/237529/ (duration: 00m 11s) [15:04:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:05:23] !log krenair@tin Synchronized docroot/noc: https://gerrit.wikimedia.org/r/#/c/237529/ (duration: 00m 12s) [15:05:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:05:46] 6operations, 10Beta-Cluster, 6Labs, 10MediaWiki-General-or-Unknown: Create a poolcounter instance in deployment-prep - https://phabricator.wikimedia.org/T112501#1637501 (10Andrew) This appears to be yet another issue with the nova rolling-upgrade process. The new instance, deployment-puppetmaster, was run... [15:05:47] !log krenair@tin Synchronized .gitignore: https://gerrit.wikimedia.org/r/#/c/237529/ (duration: 00m 13s) [15:05:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:06:17] 6operations, 10Beta-Cluster, 6Labs, 10MediaWiki-General-or-Unknown: Create a poolcounter instance in deployment-prep - https://phabricator.wikimedia.org/T112501#1637502 (10Andrew) [15:07:27] RECOVERY - DPKG on lvs1006 is OK: All packages OK [15:07:57] RECOVERY - DPKG on lvs3003 is OK: All packages OK [15:08:23] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: rsync the diff since mail was held on sodium - https://phabricator.wikimedia.org/T110138#1637520 (10JohnLewis) so we're seeing static values below 2 hours but not less than an hour and forty minutes? Still seems high but if we are directly rsyncing,... [15:09:48] RECOVERY - DPKG on lvs3004 is OK: All packages OK [15:09:48] RECOVERY - DPKG on lvs3002 is OK: All packages OK [15:09:55] 6operations, 10Beta-Cluster, 6Labs, 10MediaWiki-General-or-Unknown: Create a poolcounter instance in deployment-prep - https://phabricator.wikimedia.org/T112501#1637526 (10Krenair) Signed and puppet successfully ran on deployment-poolcounter01.deployment-prep.eqiad.wmflabs [15:11:39] RECOVERY - DPKG on lvs1002 is OK: All packages OK [15:12:00] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 16.67% of data above the critical threshold [500.0] [15:12:38] RECOVERY - DPKG on lvs3001 is OK: All packages OK [15:13:22] 6operations, 10Traffic, 5Patch-For-Review: Fix ethernet startup race on HP LVS w/ jessie - https://phabricator.wikimedia.org/T110530#1637531 (10BBlack) [15:13:46] 6operations: Change distribution in releases.wikimedia.org to "sid" or "jessie" - https://phabricator.wikimedia.org/T111225#1637536 (10GWicke) In the meantime, I pushed 0.4 to releases.wikimedia.org, but before we can switch to that {T111225} will need to be resolved. [15:14:16] 6operations: Change distribution in releases.wikimedia.org to "sid" or "jessie" - https://phabricator.wikimedia.org/T111225#1637537 (10GWicke) p:5Low>3High [15:14:49] RECOVERY - DPKG on lvs1001 is OK: All packages OK [15:16:07] RECOVERY - DPKG on lvs1003 is OK: All packages OK [15:16:17] RECOVERY - DPKG on lvs1005 is OK: All packages OK [15:17:37] RECOVERY - DPKG on lvs1004 is OK: All packages OK [15:17:45] 6operations: Change distribution in releases.wikimedia.org to "sid" or "jessie" - https://phabricator.wikimedia.org/T111225#1637548 (10GWicke) Upped the priority to 'high', as this is blocking the move to the official repository. Since the labs repository is still broken pending some file restoration, we current... [15:18:24] 6operations: Change distribution in releases.wikimedia.org to "sid" or "jessie" - https://phabricator.wikimedia.org/T111225#1637550 (10GWicke) [15:18:52] (03PS1) 10Andrew Bogott: Move default openstack version to Kilo [puppet] - 10https://gerrit.wikimedia.org/r/238158 [15:20:09] Krenair: Can I sneak another change into the SWAT window? I can deploy it myself [15:20:18] (Cherry-pick of https://gerrit.wikimedia.org/r/238115 once it merges) [15:20:22] At this point it's not exactly "sneaking", but sure : [15:20:23] :) [15:20:29] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:23:08] PROBLEM - DPKG on lvs3003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:23:14] (03CR) 10Nuria: Set replace=True for EventLogging MySQL consumer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/237688 (https://phabricator.wikimedia.org/T112265) (owner: 10Ottomata) [15:23:18] PROBLEM - DPKG on lvs1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:23:18] PROBLEM - DPKG on lvs3004 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:24:18] PROBLEM - puppet last run on lvs1001 is CRITICAL: CRITICAL: Puppet has 1 failures [15:24:18] PROBLEM - DPKG on lvs1004 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:24:28] PROBLEM - DPKG on lvs3001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:24:35] Yeah I guess it's a bit late [15:24:58] RECOVERY - DPKG on lvs1001 is OK: All packages OK [15:26:07] RECOVERY - DPKG on lvs1004 is OK: All packages OK [15:26:38] RECOVERY - DPKG on lvs3003 is OK: All packages OK [15:26:48] RECOVERY - DPKG on lvs3004 is OK: All packages OK [15:26:54] (03PS2) 10Lokal Profil: Localisation updates from translatewiki.net [puppet] - 10https://gerrit.wikimedia.org/r/229136 [15:27:55] (03CR) 10Andrew Bogott: [C: 032] Move default openstack version to Kilo [puppet] - 10https://gerrit.wikimedia.org/r/238158 (owner: 10Andrew Bogott) [15:29:40] (03PS1) 10BBlack: switch lvs[34]00x installer to jessie [puppet] - 10https://gerrit.wikimedia.org/r/238161 (https://phabricator.wikimedia.org/T96375) [15:30:38] RECOVERY - Disk space on labstore1002 is OK: DISK OK [15:32:04] (03CR) 10Lokal Profil: "The latest patch set is only an update to include translations done since the first patch set was sent for review." [puppet] - 10https://gerrit.wikimedia.org/r/229136 (owner: 10Lokal Profil) [15:33:08] (03CR) 10BBlack: [C: 032] switch lvs[34]00x installer to jessie [puppet] - 10https://gerrit.wikimedia.org/r/238161 (https://phabricator.wikimedia.org/T96375) (owner: 10BBlack) [15:34:45] !log catrope@tin Synchronized php-1.26wmf22/extensions/Echo/: SWAT (duration: 00m 13s) [15:34:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:37:23] 6operations, 10Beta-Cluster, 6Labs, 10MediaWiki-General-or-Unknown: Create a poolcounter instance in deployment-prep - https://phabricator.wikimedia.org/T112501#1637622 (10Andrew) [15:37:33] 6operations, 10MediaWiki-General-or-Unknown, 5Patch-For-Review: Stop a poolcounter server fail from being a SPOF for the service and the api (and the site) - https://phabricator.wikimedia.org/T105378#1637625 (10Andrew) [15:37:34] 6operations, 10Beta-Cluster, 6Labs, 10MediaWiki-General-or-Unknown: Create a poolcounter instance in deployment-prep - https://phabricator.wikimedia.org/T112501#1637624 (10Andrew) 5Open>3Resolved [15:37:39] (03PS1) 10ArielGlenn: for wmf reimage script, don't rotate saltmaster aes key on minion key deletion [puppet] - 10https://gerrit.wikimedia.org/r/238164 [15:38:04] 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1637628 (10jcrespo) [15:39:03] !log reinstalling lvs4003, lvs4003 (jessie upgrade: T96375) [15:39:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:39:15] I think we are storing every git commit on phabricator [15:39:17] !log reinstalling lvs4003, lvs4004 (jessie upgrade: T96375) (typo earlier) [15:39:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:39:32] (03PS2) 10Tim Landscheidt: Tools: Migrate from labsdebrepo to aptly [puppet] - 10https://gerrit.wikimedia.org/r/238089 [15:40:41] (03CR) 10Tim Landscheidt: "(PS2: Use apt::pin instead of a file resource.)" [puppet] - 10https://gerrit.wikimedia.org/r/238089 (owner: 10Tim Landscheidt) [15:42:31] 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1637656 (10jcrespo) I've reopened T110913 to include the profiling I did on phabricator during the weekend. Some scary things there (in terms of performance). [15:43:36] 6operations: Unable to connect to deployment-eventlogging02.eqiad.wmflabs - https://phabricator.wikimedia.org/T112540#1637657 (10Mholloway) 3NEW [15:43:57] 6operations: Unable to connect to deployment-eventlogging02.eqiad.wmflabs - https://phabricator.wikimedia.org/T112540#1637669 (10Mholloway) [15:44:08] RECOVERY - Restbase endpoints health on cerium is OK: All endpoints are healthy [15:44:57] RECOVERY - Restbase endpoints health on praseodymium is OK: All endpoints are healthy [15:44:57] RECOVERY - Restbase endpoints health on xenon is OK: All endpoints are healthy [15:46:23] !log switch to openjdk-8 and bounce cassandra on restbase-test200* [15:46:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:46:45] 6operations, 10Salt: check usage of salt-key delete everywhere - https://phabricator.wikimedia.org/T112534#1637676 (10ArielGlenn) https://gerrit.wikimedia.org/r/#/c/238164/ key rotation not needed for host re-imaging, the 24 hour rotation is good enough [15:50:52] 6operations, 10Salt: check usage of salt-key delete everywhere - https://phabricator.wikimedia.org/T112534#1637690 (10ArielGlenn) looking at https://gerrit.wikimedia.org/r/#/c/48983/ it seems that auth.sls would be deleting keys almost never. so we can leave that script alone. [15:51:28] RECOVERY - puppet last run on lvs1001 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [15:52:15] (03PS8) 10Muehlenhoff: Raise default conntrack table size [puppet] - 10https://gerrit.wikimedia.org/r/237389 (https://phabricator.wikimedia.org/T105307) [15:53:07] 10Ops-Access-Requests, 6operations: Unable to connect to deployment-eventlogging02.eqiad.wmflabs - https://phabricator.wikimedia.org/T112540#1637712 (10Mholloway) [15:56:03] product duty never changes, does it [15:56:18] PROBLEM - nova-scheduler process on labcontrol1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-scheduler [15:56:26] (03PS1) 10Andrew Bogott: Openstack: Return the nova scheduler pool to normal. [puppet] - 10https://gerrit.wikimedia.org/r/238169 [15:59:37] mutante: perhaps it did a few years ago ;) [16:00:15] (03CR) 10Andrew Bogott: [C: 032] Openstack: Return the nova scheduler pool to normal. [puppet] - 10https://gerrit.wikimedia.org/r/238169 (owner: 10Andrew Bogott) [16:02:22] SPF|Cloud: impossible! its only existed about a year (if not less) [16:03:59] 7Puppet, 6Analytics-Backlog, 10Analytics-Wikimetrics: Cleanup Wikimetrics puppet module so it can run puppet continuously without own puppetmaster {dove} - https://phabricator.wikimedia.org/T101763#1637743 (10madhuvishy) [16:05:58] 6operations: audit all ssh certificates expiry on ops tracking gcal - https://phabricator.wikimedia.org/T112542#1637752 (10RobH) 3NEW a:3RobH [16:06:15] !log stopping hdfs journalnode on analytics1011 to copy journal edits to new journalnodes on analytics1035 and analytics1052 [16:06:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:06:57] robh: did you mean SSL? :) [16:07:12] 6operations: audit all ssl certificates expiry on ops tracking gcal - https://phabricator.wikimedia.org/T112542#1637763 (10RobH) [16:07:13] yes [16:07:17] 6operations: audit all SSL certificates expiry on ops tracking gcal - https://phabricator.wikimedia.org/T112542#1637765 (10Krenair) [16:08:51] 6operations, 6Performance-Team: New URL scheme for service-generated thumbnails - https://phabricator.wikimedia.org/T111048#1637782 (10Gilles) 5Open>3Invalid After re-reading the IIIF spec, it seems way too large a standard to support. It requires supporting many formats and filters that thumbor doesn't. I... [16:08:51] well, we can change [16:09:34] (03PS1) 10Ottomata: Adding new journalnodes in prep for decomissioning analytics1011 and analytics1019 [puppet] - 10https://gerrit.wikimedia.org/r/238173 (https://phabricator.wikimedia.org/T112113) [16:10:12] 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1637794 (10mmodell) Doesn't look too scary to me, can you elaborate? [16:10:47] (03CR) 10Ottomata: [C: 032] Adding new journalnodes in prep for decomissioning analytics1011 and analytics1019 [puppet] - 10https://gerrit.wikimedia.org/r/238173 (https://phabricator.wikimedia.org/T112113) (owner: 10Ottomata) [16:12:32] 6operations: audit all SSL certificates expiry on ops tracking gcal - https://phabricator.wikimedia.org/T112542#1637820 (10RobH) this is changing scope to a checklist for all ssl certificate purchases and how to review and audit [16:12:42] !log catrope@tin Synchronized php-1.26wmf22/extensions/Echo/: For real this time (duration: 00m 11s) [16:12:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:13:13] PROBLEM - Restbase endpoints health on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:13:13] PROBLEM - Restbase endpoints health on cerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:14:23] PROBLEM - Restbase endpoints health on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:15:05] RECOVERY - Restbase endpoints health on xenon is OK: All endpoints are healthy [16:15:05] RECOVERY - Restbase endpoints health on cerium is OK: All endpoints are healthy [16:16:14] RECOVERY - Restbase endpoints health on praseodymium is OK: All endpoints are healthy [16:22:14] PROBLEM - configured eth on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:22:53] PROBLEM - salt-minion processes on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:22:54] PROBLEM - SSH on mw1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:23:08] (03PS1) 10Ottomata: Update net-topology.py for Hadoop so that the default rack has the same hierarchy as real nodes [puppet] - 10https://gerrit.wikimedia.org/r/238179 [16:23:11] (03CR) 10Mdann52: [C: 031] noindex user namespace on en.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/237330 (https://phabricator.wikimedia.org/T104797) (owner: 10Mdann52) [16:23:33] PROBLEM - puppet last run on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:23:34] PROBLEM - nutcracker port on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:23:44] PROBLEM - DPKG on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:23:45] PROBLEM - RAID on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:24:45] (03CR) 10Ottomata: [C: 032] Update net-topology.py for Hadoop so that the default rack has the same hierarchy as real nodes [puppet] - 10https://gerrit.wikimedia.org/r/238179 (owner: 10Ottomata) [16:24:53] RECOVERY - salt-minion processes on mw1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [16:27:08] Nemo_bis: wny did you remove it? [16:27:44] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Set up multi-DC replication for Cassandra - https://phabricator.wikimedia.org/T108613#1637867 (10Eevans) [16:27:46] it's there for a reason :) and unless someone from product said it doesn't need to be there; it should be. [16:28:13] PROBLEM - Hadoop NameNode Primary Is Active on analytics1001 is CRITICAL: Hadoop.NameNode.FSNamesystem.tag_HAState CRITICAL: standby [16:28:14] PROBLEM - Disk space on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:28:46] ^^^ this is ok [16:28:55] i'm moving restarting namenodes to get a change in [16:29:07] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Set up multi-DC replication for Cassandra - https://phabricator.wikimedia.org/T108613#1524920 (10Eevans) [16:29:35] ACKNOWLEDGEMENT - Hadoop NameNode Primary Is Active on analytics1001 is CRITICAL: Hadoop.NameNode.FSNamesystem.tag_HAState CRITICAL: standby ottomata restarting namenodes [16:31:03] PROBLEM - dhclient process on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:32:14] PROBLEM - nutcracker process on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:32:24] 6operations, 5Patch-For-Review, 7Pybal: jessie pybals get restarted every day by logrotate, resetting BGP sessions - https://phabricator.wikimedia.org/T112457#1637893 (10BBlack) [16:32:25] PROBLEM - salt-minion processes on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:32:25] 6operations, 10Traffic, 5Patch-For-Review: Upgrade codfw,ulsfo,esams LVS to jessie - https://phabricator.wikimedia.org/T96375#1637892 (10BBlack) [16:32:44] 6operations, 10Traffic, 5Patch-For-Review: Upgrade codfw,ulsfo,esams LVS to jessie - https://phabricator.wikimedia.org/T96375#1215460 (10BBlack) [16:33:24] RECOVERY - dhclient process on mw1006 is OK: PROCS OK: 0 processes with command name dhclient [16:33:28] 10Ops-Access-Requests, 6operations: Unable to connect to deployment-eventlogging02.eqiad.wmflabs - https://phabricator.wikimedia.org/T112540#1637899 (10Krenair) wmflabs -> removing #operations and #ops-access-requests [16:33:43] RECOVERY - nutcracker port on mw1006 is OK: TCP OK - 0.000 second response time on port 11212 [16:33:43] RECOVERY - Hadoop NameNode Primary Is Active on analytics1001 is OK: Hadoop.NameNode.FSNamesystem.tag_HAState OKAY: active [16:33:44] RECOVERY - configured eth on mw1006 is OK: OK - interfaces up [16:37:40] JohnFLewis: I think you missed some steps, but do as you prefer [16:38:27] Nemo_bis: in such a meta case; it's better to leave it and discuss than reverse. you never know - during the time its not there someone may actually need a product duty guy :) [16:39:54] PROBLEM - configured eth on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:39:55] PROBLEM - nutcracker port on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:40:42] (03PS1) 10Ottomata: Add analytics1053 and 1057 to Hadoop net-topology.py [puppet] - 10https://gerrit.wikimedia.org/r/238181 [16:41:38] (03PS2) 10Ottomata: Add analytics1053 and 1057 to Hadoop net-topology.py [puppet] - 10https://gerrit.wikimedia.org/r/238181 [16:41:44] PROBLEM - dhclient process on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:43:43] RECOVERY - dhclient process on mw1006 is OK: PROCS OK: 0 processes with command name dhclient [16:43:58] (03CR) 10Ottomata: [C: 032] Add analytics1053 and 1057 to Hadoop net-topology.py [puppet] - 10https://gerrit.wikimedia.org/r/238181 (owner: 10Ottomata) [16:44:41] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Request to access apertium-apy service restart - https://phabricator.wikimedia.org/T111360#1637973 (10RobH) This was approved in the ops meeting just now, so implementation to follow. (I'm merely noting the approval in the meeting on task.) [16:47:33] RECOVERY - Disk space on mw1006 is OK: DISK OK [16:47:55] (03CR) 10Alexandros Kosiaris: [C: 04-1] nodepool: sudo rules for contint-admins (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/235742 (https://phabricator.wikimedia.org/T111374) (owner: 10Hashar) [16:48:29] jelouuu, does anyone know what version of logstash do we have deployed? [16:49:54] PROBLEM - dhclient process on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:50:13] PROBLEM - Hadoop NodeManager on analytics1016 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [16:50:23] nuria: _808db would know [16:50:31] oh I Just realized that's the reverse of his regular name [16:51:09] _808db: hello [16:52:54] that's me too, aye yai yai yarn man [16:53:24] me no comprendou [16:53:34] PROBLEM - Disk space on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:54:29] (03PS1) 10Ottomata: Remove analytics1011, 1016, and 1019 as Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/238185 (https://phabricator.wikimedia.org/T112113) [16:56:33] (03CR) 10Ottomata: [C: 032] Remove analytics1011, 1016, and 1019 as Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/238185 (https://phabricator.wikimedia.org/T112113) (owner: 10Ottomata) [16:56:36] nuria, I think _808db is away [16:57:04] phabricator says for two weeks [17:01:55] <_joe_> ottomata: btw, I'm back and able for interviews :) [17:04:22] _joe_: ok awesome [17:04:28] did you summit all the mountains? [17:04:49] <_joe_> ottomata: ahah yeah [17:05:55] <_joe_> ottomata: I was here http://www.kastra.eu/pics/mystra18.jpg when I got your message :) [17:06:33] OooO [17:08:26] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 9.09% of data above the critical threshold [500.0] [17:12:05] 6operations, 10ops-eqiad, 6Labs, 3Labs-Sprint-114, 3ToolLabs-Goals-Q4: Make certain ports and cables between the labstores and shelves are numbered/named and labeled, and make sure that the diagram(s) reflect that. - https://phabricator.wikimedia.org/T112549#1638094 (10coren) 3NEW a:3coren [17:13:23] (03PS1) 10Tim Landscheidt: shinken: Make shinkengen compatible with ldap3 0.9.4.2 [puppet] - 10https://gerrit.wikimedia.org/r/238190 (https://phabricator.wikimedia.org/T101824) [17:15:25] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:17:08] (03PS10) 10Ori.livneh: Send image varnish frontend data from logs to statsd [puppet] - 10https://gerrit.wikimedia.org/r/234157 (https://phabricator.wikimedia.org/T105681) (owner: 10Gilles) [17:18:21] (03CR) 10Ori.livneh: [C: 032] Send image varnish frontend data from logs to statsd [puppet] - 10https://gerrit.wikimedia.org/r/234157 (https://phabricator.wikimedia.org/T105681) (owner: 10Gilles) [17:19:41] argh, upload varnishes will complain about puppet failures in a moment [17:19:41] sorry [17:19:50] 6operations, 10fundraising-tech-ops: reformulate kafkatee package to work with Trusty - https://phabricator.wikimedia.org/T110591#1638171 (10Jgreen) >>! In T110591#1610291, @Ottomata wrote: > Done https://gerrit.wikimedia.org/r/#/c/236066/ > > http://apt.wikimedia.org/wikimedia/pool/main/k/kafkatee/ Minor bu... [17:21:27] (03CR) 10Tim Landscheidt: "Tested this on shinken-test8-scfc against the backport and on shinken-01 against the current live one." [puppet] - 10https://gerrit.wikimedia.org/r/238190 (https://phabricator.wikimedia.org/T101824) (owner: 10Tim Landscheidt) [17:21:30] oh, no, they won't [17:21:36] the puppet failure i saw was some race condition [17:27:17] (03PS1) 10Chad: Minor tweaks to my .gitconfig [puppet] - 10https://gerrit.wikimedia.org/r/238191 [17:28:17] 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1638205 (10jcrespo) Maybe inserting all the commits on the 16GB blob table is one of the reason of slowdowns (pure speculation). [17:28:33] jouncebot: next [17:28:33] In 2 hour(s) and 31 minute(s): Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150914T2000) [17:30:47] (03PS1) 10BBlack: update star.wmfusercontent.org cert [puppet] - 10https://gerrit.wikimedia.org/r/238192 [17:31:00] (03CR) 10BBlack: [C: 032 V: 032] update star.wmfusercontent.org cert [puppet] - 10https://gerrit.wikimedia.org/r/238192 (owner: 10BBlack) [17:35:06] heyo - does anyone know why etherpad seems to be down? [17:35:10] etherpad just fell over? [17:35:19] ok, not just me :) [17:35:24] seems like it [17:37:31] (03PS2) 10Chad: Phab (labs): Move sshd to 2222, easier to remember than 222 [puppet] - 10https://gerrit.wikimedia.org/r/235777 [17:37:49] at least it's not 29418 ostriches [17:38:00] (03PS1) 10BBlack: switch wmfusercontent.org to RSA-only temporarily [puppet] - 10https://gerrit.wikimedia.org/r/238195 [17:38:10] Krenair: That's for system sshd, most people won't use. [17:38:11] (03CR) 10BBlack: [C: 032 V: 032] switch wmfusercontent.org to RSA-only temporarily [puppet] - 10https://gerrit.wikimedia.org/r/238195 (owner: 10BBlack) [17:38:16] ah [17:38:18] Git's SSH will be more locked down, and on :22 [17:38:26] RECOVERY - dhclient process on mw1006 is OK: PROCS OK: 0 processes with command name dhclient [17:38:35] RECOVERY - SSH on mw1006 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [17:38:45] RECOVERY - nutcracker process on mw1006 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [17:38:45] RECOVERY - DPKG on mw1006 is OK: All packages OK [17:40:13] atgo, ebernhardson: it looks up [17:40:16] but very slow [17:40:37] PROBLEM - etherpad.wikimedia.org HTTP on etherpad1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:41:43] unuseably slow [17:41:46] definitely totally down for me, krenair [17:41:58] and sometimes it looks completely down [17:42:02] I did manage to load a pad just now though [17:43:46] PROBLEM - SSH on mw1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:43:57] PROBLEM - nutcracker process on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:43:57] PROBLEM - DPKG on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:44:02] (03CR) 10Chad: "Looks mostly good, minor nit inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/237096 (https://phabricator.wikimedia.org/T128) (owner: 1020after4) [17:44:37] Yeah, it's just broken [17:44:44] I think something similar to this happened recently [17:44:52] I forget who was able to fix it [17:45:07] (03PS1) 10BBlack: Revert "switch phab altdom to phab.wikidata.org T112381" [puppet] - 10https://gerrit.wikimedia.org/r/238196 [17:45:09] (03PS1) 10BBlack: Revert "Temporarily move phab altdom into wikivoyage.org" [puppet] - 10https://gerrit.wikimedia.org/r/238197 [17:45:17] PROBLEM - dhclient process on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:45:21] (03CR) 10BBlack: [C: 032 V: 032] Revert "switch phab altdom to phab.wikidata.org T112381" [puppet] - 10https://gerrit.wikimedia.org/r/238196 (owner: 10BBlack) [17:45:32] (03CR) 10BBlack: [C: 032 V: 032] Revert "Temporarily move phab altdom into wikivoyage.org" [puppet] - 10https://gerrit.wikimedia.org/r/238197 (owner: 10BBlack) [17:45:45] RECOVERY - etherpad.wikimedia.org HTTP on etherpad1001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 522 bytes in 0.011 second response time [17:46:03] only incident docs I found for it were https://wikitech.wikimedia.org/wiki/Incident_documentation/20140714-Etherpad so I guess nobody bothered to write any last time [17:48:25] (03PS1) 10BBlack: Revert "Temporarily create phab.wikivoyage.org" [dns] - 10https://gerrit.wikimedia.org/r/238198 [17:48:32] ostriches: woohoo, yay for using 22 :D [17:49:38] (03PS2) 10BBlack: Revert "Temporarily create phab.wikivoyage.org" [dns] - 10https://gerrit.wikimedia.org/r/238198 [17:49:43] (03CR) 10BBlack: [C: 032 V: 032] Revert "Temporarily create phab.wikivoyage.org" [dns] - 10https://gerrit.wikimedia.org/r/238198 (owner: 10BBlack) [17:49:56] (03PS1) 10BBlack: Revert "add phab.wikidata.org temporarily T112381" [dns] - 10https://gerrit.wikimedia.org/r/238199 [17:50:30] (03PS2) 10BBlack: Revert "add phab.wikidata.org temporarily T112381" [dns] - 10https://gerrit.wikimedia.org/r/238199 [17:50:41] (03PS1) 10BBlack: Revert "misc-web: temporarily broaden user content domain match for phab" [puppet] - 10https://gerrit.wikimedia.org/r/238200 [17:50:46] (03PS2) 10BBlack: Revert "misc-web: temporarily broaden user content domain match for phab" [puppet] - 10https://gerrit.wikimedia.org/r/238200 [17:53:31] 6operations, 5Patch-For-Review: Undo phab.wikidata.org hacks after wmfusercontent.org cert is fixed - https://phabricator.wikimedia.org/T112381#1638290 (10BBlack) The reason we avoided wikimedia.org is the same reason this domain exists at all: it's a known security problem if phab loads user-defined content f... [17:54:51] (03CR) 10BBlack: [C: 032] Revert "add phab.wikidata.org temporarily T112381" [dns] - 10https://gerrit.wikimedia.org/r/238199 (owner: 10BBlack) [17:55:56] (03CR) 10BBlack: [C: 032] Revert "misc-web: temporarily broaden user content domain match for phab" [puppet] - 10https://gerrit.wikimedia.org/r/238200 (owner: 10BBlack) [17:56:37] RECOVERY - Disk space on mw1006 is OK: DISK OK [17:57:05] RECOVERY - configured eth on mw1006 is OK: OK - interfaces up [17:57:17] RECOVERY - dhclient process on mw1006 is OK: PROCS OK: 0 processes with command name dhclient [17:57:26] RECOVERY - SSH on mw1006 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [17:57:36] RECOVERY - nutcracker process on mw1006 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [17:57:36] RECOVERY - DPKG on mw1006 is OK: All packages OK [17:57:46] RECOVERY - nutcracker port on mw1006 is OK: TCP OK - 0.000 second response time on port 11212 [17:57:56] RECOVERY - puppet last run on mw1006 is OK: OK: Puppet is currently enabled, last run 1 hour ago with 0 failures [17:58:06] RECOVERY - salt-minion processes on mw1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:58:06] RECOVERY - RAID on mw1006 is OK: OK: no RAID installed [18:03:43] 6operations, 5Patch-For-Review: Undo phab.wikidata.org hacks after wmfusercontent.org cert is fixed - https://phabricator.wikimedia.org/T112381#1638349 (10BBlack) Everything's reverted back to a normal state on the new cert now, with the exception of: https://gerrit.wikimedia.org/r/#/c/238195/ (RSA-only) whi... [18:05:20] 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1638365 (10mmodell) jcsrespo: I see. I don't have a very good understanding of mysql blob performance. I would have assumed that it handles large blobs fairly wel... [18:05:26] !log rebuilding restbase-test2001.codfw (nodetool rebuild -- eqiad) [18:05:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:06:05] (03PS3) 10Tim Landscheidt: Tools: Migrate from labsdebrepo to aptly [puppet] - 10https://gerrit.wikimedia.org/r/238089 (https://phabricator.wikimedia.org/T111708) [18:06:07] (03PS1) 10Yuvipanda: aptly: Pin per-project aptly repository [puppet] - 10https://gerrit.wikimedia.org/r/238201 [18:06:49] 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1638376 (10jcrespo) > I would have assumed that it handles large blobs fairly well. It generally does, I wonder if large data movement can cause stalls, because... [18:07:23] (03CR) 10Yuvipanda: "I76b53c1073cbd007107ad8f60512f201a6583d31 does the pinning at the source. I've also applied the aptly::client role to all instances via Hi" [puppet] - 10https://gerrit.wikimedia.org/r/238089 (https://phabricator.wikimedia.org/T111708) (owner: 10Tim Landscheidt) [18:07:46] (03PS1) 10BBlack: Add new ECDSA cert for wmfusercontent [puppet] - 10https://gerrit.wikimedia.org/r/238202 [18:08:23] (03CR) 10BBlack: [C: 032 V: 032] Add new ECDSA cert for wmfusercontent [puppet] - 10https://gerrit.wikimedia.org/r/238202 (owner: 10BBlack) [18:09:04] (03PS1) 10BBlack: Revert "switch wmfusercontent.org to RSA-only temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/238203 [18:09:09] (03PS2) 10BBlack: Revert "switch wmfusercontent.org to RSA-only temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/238203 [18:09:16] (03CR) 10BBlack: [C: 032 V: 032] Revert "switch wmfusercontent.org to RSA-only temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/238203 (owner: 10BBlack) [18:10:15] 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1638392 (10mmodell) There is quite a lot of activity on repositories but I didn't think the volume had changed very much in the past several weeks. There has bee... [18:10:39] (03PS2) 10Yuvipanda: labs_lvm: Only run extend-instance-vol when needed [puppet] - 10https://gerrit.wikimedia.org/r/235642 (https://phabricator.wikimedia.org/T109933) (owner: 10Tim Landscheidt) [18:11:00] (03CR) 10Yuvipanda: [C: 032 V: 032] labs_lvm: Only run extend-instance-vol when needed [puppet] - 10https://gerrit.wikimedia.org/r/235642 (https://phabricator.wikimedia.org/T109933) (owner: 10Tim Landscheidt) [18:11:57] 6operations, 5Patch-For-Review: Undo phab.wikidata.org hacks after wmfusercontent.org cert is fixed - https://phabricator.wikimedia.org/T112381#1638396 (10BBlack) 5Open>3Resolved a:3BBlack The ECDSA re-issue was much quicker than expected (must be automated now for simple cases), so the last bit is rever... [18:12:06] (03PS2) 10Yuvipanda: Tools: Accept mail for all submit hosts [puppet] - 10https://gerrit.wikimedia.org/r/237863 (https://phabricator.wikimedia.org/T63484) (owner: 10Tim Landscheidt) [18:12:13] (03CR) 10Yuvipanda: [C: 032 V: 032] Tools: Accept mail for all submit hosts [puppet] - 10https://gerrit.wikimedia.org/r/237863 (https://phabricator.wikimedia.org/T63484) (owner: 10Tim Landscheidt) [18:14:06] 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1638411 (10chasemp) Just for historical perspective, when we first implemented we knew that MySQL was a stopgap store (for how long we didn't know), and decided t... [18:15:08] Phabricator still is unstyled for me, stupid cache I guess? :( [18:15:25] Firefox says sec_error_ocsp_unknown_cert for https://phab.wmfusercontent.org/ [18:16:17] oh OCSP heh [18:16:25] it would fix itself within an hour, but I'll push it around faster [18:16:40] (ocs updater needs to re-run for adding back the ECDSA cert) [18:17:05] try again, should be fixed [18:17:20] bblack: yep, thanks [18:18:06] PROBLEM - puppet last run on mw2105 is CRITICAL: CRITICAL: puppet fail [18:21:03] 6operations: Swap two elasticsearch servers in row D with an elasticsearch server in racks A3 and C5. - https://phabricator.wikimedia.org/T112559#1638425 (10EBernhardson) 3NEW [18:21:31] i want to request having two servers in eqiad moved to different racks, should i email ops, or assign ticket to someone in particular, or how would i go about that? [18:22:52] 6operations, 10ops-eqiad: Swap two elasticsearch servers in row D with an elasticsearch server in racks A3 and C5. - https://phabricator.wikimedia.org/T112559#1638437 (10JohnLewis) [18:23:03] 6operations, 10ops-eqiad: Swap two elasticsearch servers in row D with an elasticsearch server in racks A3 and C5. - https://phabricator.wikimedia.org/T112559#1638441 (10EBernhardson) [18:23:26] ebernhardson: I added the dc project and CC'd Chris to the task. he (or someone) should pick it up and ask him to look at it :) [18:23:37] cmjohnson: ^ that also works since he is here :) [18:24:01] JohnFLewis: thanks [18:25:37] (03Abandoned) 10Thcipriani: Add config deployment [tools/scap] - 10https://gerrit.wikimedia.org/r/235385 (owner: 10Thcipriani) [18:28:22] 6operations, 10ops-eqiad: Swap two elasticsearch servers in row D with an elasticsearch server in racks A3 and C5. - https://phabricator.wikimedia.org/T112559#1638476 (10Cmjohnson) Any particular servers you would like move or just take 2 that makes the most sense? I am thinking elastic1031 => A3 elastic103... [18:28:36] johnflewis ^ [18:29:36] ebernhardson: ^ enjoy the discussion! :) [18:30:16] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [18:30:32] (03PS2) 1020after4: SSH repo hosting support for phabricator. [puppet] - 10https://gerrit.wikimedia.org/r/237096 (https://phabricator.wikimedia.org/T128) [18:31:25] (03PS3) 1020after4: SSH repo hosting support for phabricator. [puppet] - 10https://gerrit.wikimedia.org/r/237096 (https://phabricator.wikimedia.org/T128) [18:31:53] (03CR) 1020after4: "I've addressed chad's concern and rebased on current production branch." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/237096 (https://phabricator.wikimedia.org/T128) (owner: 1020after4) [18:32:03] 6operations, 10ops-eqiad: Swap two elasticsearch servers in row D with an elasticsearch server in racks A3 and C5. - https://phabricator.wikimedia.org/T112559#1638480 (10EBernhardson) In terms of exact servers, whichever makes the most sense.I would like to see servers moved into both A and C racks for availab... [18:32:20] 6operations, 10ops-eqiad: Swap two elasticsearch servers in row D with an elasticsearch server in racks A3 and C5. - https://phabricator.wikimedia.org/T112559#1638482 (10chasemp) >>! In T112559#1638476, @Cmjohnson wrote: > Any particular servers you would like move or just take 2 that makes the most sense? >... [18:33:32] (03CR) 1020after4: [C: 031] Phab (labs): Move sshd to 2222, easier to remember than 222 [puppet] - 10https://gerrit.wikimedia.org/r/235777 (owner: 10Chad) [18:33:45] PROBLEM - citoid endpoints health on sca1001 is CRITICAL: /api is CRITICAL: Test web site in alternative language returned the unexpected status 520 (expecting: 200) [18:33:59] <_joe_> uh [18:34:47] PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api is CRITICAL: Test web site in alternative language returned the unexpected status 520 (expecting: 200) [18:34:59] (03CR) 10Hashar: nodepool: sudo rules for contint-admins (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/235742 (https://phabricator.wikimedia.org/T111374) (owner: 10Hashar) [18:35:21] (03CR) 1020after4: [C: 031] Phab: clean up role, remove ::config and ::main abstraction [puppet] - 10https://gerrit.wikimedia.org/r/235778 (owner: 10Chad) [18:35:26] (03PS3) 10Rush: Phab (labs): Move sshd to 2222, easier to remember than 222 [puppet] - 10https://gerrit.wikimedia.org/r/235777 (owner: 10Chad) [18:36:37] (03PS3) 10Hashar: nodepool: sudo rules for contint-admins [puppet] - 10https://gerrit.wikimedia.org/r/235742 (https://phabricator.wikimedia.org/T111374) [18:36:44] (03PS2) 10Rush: Phab: clean up role, remove ::config and ::main abstraction [puppet] - 10https://gerrit.wikimedia.org/r/235778 (owner: 10Chad) [18:39:19] (03PS1) 10Nemo bis: [English Planet] Fetch all Magnus Manske posts [puppet] - 10https://gerrit.wikimedia.org/r/238207 [18:42:00] (03PS1) 10Thcipriani: Add pattern-matching arg to limit deploy hosts [tools/scap] - 10https://gerrit.wikimedia.org/r/238208 [18:43:56] (03PS3) 10Ori.livneh: Slightly increase RESTBase job runner concurrency [puppet] - 10https://gerrit.wikimedia.org/r/237868 (owner: 10GWicke) [18:44:07] (03CR) 10Ori.livneh: [C: 032 V: 032] Slightly increase RESTBase job runner concurrency [puppet] - 10https://gerrit.wikimedia.org/r/237868 (owner: 10GWicke) [18:44:42] (03PS2) 10Ori.livneh: [English Planet] Fetch all Magnus Manske posts [puppet] - 10https://gerrit.wikimedia.org/r/238207 (owner: 10Nemo bis) [18:45:03] (03CR) 10Ori.livneh: [C: 032 V: 032] "Magnus is Wiki." [puppet] - 10https://gerrit.wikimedia.org/r/238207 (owner: 10Nemo bis) [18:45:07] RECOVERY - puppet last run on mw2105 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [18:46:35] ori, thanks re 37868! [18:47:46] 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1638593 (10jcrespo) Well, we have the closest thing to amazon s3, which is Swift... I do not thing MySQL is a great place to store large files. Any relational da... [18:47:47] (03PS1) 10Thcipriani: Add --environment flag to cli.Application [tools/scap] - 10https://gerrit.wikimedia.org/r/238211 [18:48:00] (03CR) 10Ori.livneh: [C: 04-1] "Looks good -- but while you're here, could you make the names uniform, so it's not "adminpass" / "admin_pass", "adminuser" / "admin_user"," [puppet] - 10https://gerrit.wikimedia.org/r/235778 (owner: 10Chad) [18:51:46] RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy [18:52:26] RECOVERY - citoid endpoints health on sca1001 is OK: All endpoints are healthy [18:54:48] (03CR) 10Kaldari: "Nemo_bis, TimStarling: What would be the best place to point people to from this code so that they can get an understanding of the history" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/237960 (owner: 10Kaldari) [18:57:38] jouncebot: next [18:57:38] In 1 hour(s) and 2 minute(s): Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150914T2000) [19:00:28] (03PS1) 10Thcipriani: Allow full path to hosts file [tools/scap] - 10https://gerrit.wikimedia.org/r/238213 [19:04:13] (03PS1) 10Andrew Bogott: toolschecker: rename a test to actually reflect what it does. [puppet] - 10https://gerrit.wikimedia.org/r/238215 [19:04:14] (03PS1) 10Andrew Bogott: Added an ldap test. [puppet] - 10https://gerrit.wikimedia.org/r/238216 (https://phabricator.wikimedia.org/T107454) [19:05:22] (03PS2) 10Andrew Bogott: toolschecker: rename a test to actually reflect what it does. [puppet] - 10https://gerrit.wikimedia.org/r/238215 [19:05:27] (03PS2) 10Andrew Bogott: Added an ldap test. [puppet] - 10https://gerrit.wikimedia.org/r/238216 (https://phabricator.wikimedia.org/T107454) [19:06:47] (03CR) 10Andrew Bogott: [C: 032] toolschecker: rename a test to actually reflect what it does. [puppet] - 10https://gerrit.wikimedia.org/r/238215 (owner: 10Andrew Bogott) [19:09:24] (03CR) 10Zfilipin: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/235695 (https://phabricator.wikimedia.org/T102020) (owner: 10Zfilipin) [19:09:33] (03PS3) 10Andrew Bogott: toolschecker: Added an ldap test. [puppet] - 10https://gerrit.wikimedia.org/r/238216 (https://phabricator.wikimedia.org/T107454) [19:09:45] (03CR) 10Zfilipin: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/235695 (https://phabricator.wikimedia.org/T102020) (owner: 10Zfilipin) [19:10:31] (03CR) 10Zfilipin: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/225238 (https://phabricator.wikimedia.org/T102020) (owner: 10Zfilipin) [19:12:01] (03CR) 10Gilles: "https://grafana.wikimedia.org/#/dashboard/db/resourceloader" [puppet] - 10https://gerrit.wikimedia.org/r/234157 (https://phabricator.wikimedia.org/T105681) (owner: 10Gilles) [19:13:07] (03PS4) 10Andrew Bogott: toolschecker: Added an ldap test. [puppet] - 10https://gerrit.wikimedia.org/r/238216 (https://phabricator.wikimedia.org/T107454) [19:13:51] 6operations, 6Labs, 10Labs-Infrastructure: move labs role classes to role/labs/foo structure - https://phabricator.wikimedia.org/T112570#1638664 (10Dzahn) 3NEW [19:19:07] (03PS5) 10Andrew Bogott: toolschecker: Added an ldap test. [puppet] - 10https://gerrit.wikimedia.org/r/238216 (https://phabricator.wikimedia.org/T107454) [19:20:25] (03PS1) 10Hashar: Turn puppet autosign back on beta/integration [puppet] - 10https://gerrit.wikimedia.org/r/238221 (https://phabricator.wikimedia.org/T112537) [19:21:18] (03CR) 10Andrew Bogott: [C: 032] toolschecker: Added an ldap test. [puppet] - 10https://gerrit.wikimedia.org/r/238216 (https://phabricator.wikimedia.org/T107454) (owner: 10Andrew Bogott) [19:22:13] (03CR) 10Nemo bis: "Usually our code comments only link the original request, from which one has to follow links. Maybe just add the mailing list link without" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/237960 (owner: 10Kaldari) [19:23:49] 7Puppet, 6Labs: Move all labs-only puppet roles to manifests/role/labs - https://phabricator.wikimedia.org/T107167#1638699 (10Dzahn) [19:23:50] 6operations, 6Labs, 10Labs-Infrastructure: move labs role classes to role/labs/foo structure - https://phabricator.wikimedia.org/T112570#1638698 (10Dzahn) [19:25:40] !log legoktm@tin Synchronized php-1.26wmf22/extensions/Echo/: Only load nojs Special:Notifications styles on the special page (duration: 00m 12s) [19:25:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:26:09] (03PS5) 10Dzahn: phragile: Add role class [puppet] - 10https://gerrit.wikimedia.org/r/227466 (https://phabricator.wikimedia.org/T108803) (owner: 10WMDE-leszek) [19:26:27] (03CR) 10Rush: [C: 031] "seems good, some of the admin_pass weird naming is meant to reflect the weird naming in ops/private I think. Not that it necessarily mean" [puppet] - 10https://gerrit.wikimedia.org/r/235778 (owner: 10Chad) [19:26:51] greg-g, thcipriani, twentyafterfour, ostriches: FYI, I've asked the collab team to push out fixes for T112401 ASAP, so they'll probably do sync-dir/files throughout the day. [19:27:31] I don't have any deployments today [19:27:37] (03PS6) 10Dzahn: phragile: Add role class [puppet] - 10https://gerrit.wikimedia.org/r/227466 (https://phabricator.wikimedia.org/T108803) (owner: 10WMDE-leszek) [19:28:48] 6operations, 7Database: Drop phlegal_* databases from m3 - https://phabricator.wikimedia.org/T112573#1638724 (10chasemp) 3NEW a:3jcrespo [19:29:49] (03PS1) 10Andrew Bogott: toolschecker: s/labss/labs [puppet] - 10https://gerrit.wikimedia.org/r/238223 [19:30:37] 6operations, 7Database: Drop phlegal_* databases from m3 - https://phabricator.wikimedia.org/T112573#1638754 (10chasemp) see: https://gerrit.wikimedia.org/r/#/c/235778/1 for related cleanup [19:30:51] (03CR) 10Andrew Bogott: [C: 032] toolschecker: s/labss/labs [puppet] - 10https://gerrit.wikimedia.org/r/238223 (owner: 10Andrew Bogott) [19:31:27] (03CR) 10Dzahn: [C: 032] "amended to be called "labsphragile" rather than "phragile". i also don't think it looks great but that is the current standard today if yo" [puppet] - 10https://gerrit.wikimedia.org/r/227466 (https://phabricator.wikimedia.org/T108803) (owner: 10WMDE-leszek) [19:32:04] (03PS2) 10Andrew Bogott: toolschecker: s/labss/labs [puppet] - 10https://gerrit.wikimedia.org/r/238223 [19:35:45] PROBLEM - check_puppetrun on beryllium is CRITICAL: CRITICAL: Puppet has 1 failures [19:36:54] (03CR) 1020after4: [C: 031] Add pattern-matching arg to limit deploy hosts [tools/scap] - 10https://gerrit.wikimedia.org/r/238208 (owner: 10Thcipriani) [19:40:23] (03CR) 10Ori.livneh: Basic role for Sentry (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/199598 (https://phabricator.wikimedia.org/T84956) (owner: 10Gilles) [19:40:45] PROBLEM - check_puppetrun on beryllium is CRITICAL: CRITICAL: Puppet has 1 failures [19:41:39] (03CR) 1020after4: [C: 032] Allow full path to hosts file [tools/scap] - 10https://gerrit.wikimedia.org/r/238213 (owner: 10Thcipriani) [19:42:05] (03PS4) 10Rush: Phab (labs): Move sshd to 2222, easier to remember than 222 [puppet] - 10https://gerrit.wikimedia.org/r/235777 (owner: 10Chad) [19:42:16] (03CR) 10Rush: [C: 032 V: 032] "no objection here" [puppet] - 10https://gerrit.wikimedia.org/r/235777 (owner: 10Chad) [19:43:49] gilles: so, revert https://gerrit.wikimedia.org/r/#/c/234157/ to unbreak? [19:43:53] I can push that through right now [19:44:19] unless you have some other plan or simple fix [19:45:07] 6operations, 10ops-eqiad: Swap two elasticsearch servers in row D with an elasticsearch server in racks A3 and C5. - https://phabricator.wikimedia.org/T112559#1638827 (10chasemp) For the moment none of these three can be missing at the same time: hieradata/hosts/elastic1001.yaml:elasticsearch::master_eligible... [19:45:17] bblack: let me take a quick look [19:45:45] RECOVERY - check_puppetrun on beryllium is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [19:46:16] ok [19:48:51] bblack: i got it, fix incomign [19:49:13] 6operations, 10hardware-requests: Request three servers for Pageview API - https://phabricator.wikimedia.org/T111053#1638833 (10Ottomata) The specs of those are all the same. We'll use - analytics1011 - analytics1016 - analytics1019 These will be reinstalled with Jessie and renamed. The current node names... [19:50:20] (03PS3) 10Dzahn: admin: add kartik to apertium-admins [puppet] - 10https://gerrit.wikimedia.org/r/235854 (https://phabricator.wikimedia.org/T111360) [19:51:03] 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1638840 (10mmodell) So we could potentially have phabricator store it's files in swift? [19:52:48] (03PS1) 10Ori.livneh: Follow-up for I8be1929b2: remove mediawiki::monitoring::webserver [puppet] - 10https://gerrit.wikimedia.org/r/238237 [19:52:50] (03PS1) 10Ori.livneh: Follow-up for Iae36f1: actually provision the varnishprocessor module [puppet] - 10https://gerrit.wikimedia.org/r/238238 [19:53:03] (03CR) 10Ori.livneh: [C: 032 V: 032] Follow-up for I8be1929b2: remove mediawiki::monitoring::webserver [puppet] - 10https://gerrit.wikimedia.org/r/238237 (owner: 10Ori.livneh) [19:54:09] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Request to access apertium-apy service restart - https://phabricator.wikimedia.org/T111360#1638856 (10Dzahn) a:3Dzahn [19:55:14] (03CR) 10Gilles: [C: 031] "Duh" [puppet] - 10https://gerrit.wikimedia.org/r/238238 (owner: 10Ori.livneh) [19:56:01] (03CR) 10Ori.livneh: [C: 032] Follow-up for Iae36f1: actually provision the varnishprocessor module [puppet] - 10https://gerrit.wikimedia.org/r/238238 (owner: 10Ori.livneh) [19:57:10] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Request to access apertium-apy service restart - https://phabricator.wikimedia.org/T111360#1638872 (10Dzahn) @KartikMistry I merged the change now since it was approved. It adds you to the group and the group to all nodes with the "sca" role. members: [... [19:57:37] PROBLEM - puppet last run on mw1026 is CRITICAL: CRITICAL: Puppet has 1 failures [19:57:55] PROBLEM - puppet last run on mw1231 is CRITICAL: CRITICAL: Puppet has 1 failures [19:58:06] PROBLEM - puppet last run on mw1139 is CRITICAL: CRITICAL: Puppet has 1 failures [19:58:07] PROBLEM - puppet last run on mw2020 is CRITICAL: CRITICAL: Puppet has 1 failures [19:58:16] PROBLEM - puppet last run on mw2091 is CRITICAL: CRITICAL: Puppet has 1 failures [19:58:16] PROBLEM - puppet last run on mw1219 is CRITICAL: CRITICAL: Puppet has 1 failures [19:58:16] PROBLEM - puppet last run on mw2069 is CRITICAL: CRITICAL: Puppet has 1 failures [19:58:31] 10Ops-Access-Requests, 6operations, 5Continuous-Integration-Scaling, 5Patch-For-Review: contint-admins can't start/stop nodepool (lack sudo) - https://phabricator.wikimedia.org/T111374#1638874 (10Dzahn) a:3Dzahn [19:58:45] PROBLEM - puppet last run on mw2157 is CRITICAL: CRITICAL: Puppet has 1 failures [19:58:46] PROBLEM - puppet last run on mw2033 is CRITICAL: CRITICAL: Puppet has 1 failures [19:59:07] PROBLEM - puppet last run on mw2141 is CRITICAL: CRITICAL: Puppet has 1 failures [19:59:07] PROBLEM - puppet last run on mw2076 is CRITICAL: CRITICAL: Puppet has 1 failures [20:00:02] 7Puppet, 6Labs: Move all labs-only puppet roles to manifests/role/labs - https://phabricator.wikimedia.org/T107167#1638882 (10scfc) In https://gerrit.wikimedia.org/r/#/c/230928/1/manifests/role/labsvagrantlxc.pp, @akosiaris wrote that roles should move to `modules/role/manifests/` in the long term, so ideally... [20:00:04] gwicke cscott arlolra subbu mdholloway: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150914T2000). Please do the needful. [20:00:29] ori: the crits above are related to removing the hhvm monitoring thingy [20:00:58] (might be race conditions though) [20:01:01] bblack: yes, but they are for hosts that were mid-run [20:01:01] yeah [20:01:29] ok [20:01:50] bblack: the varnish thing should be fixed as soon as that patch rolls out to all varnishes [20:02:09] i forced a run on cp1048 and it worked correctly [20:02:22] 6operations: (www.)wmfusercontent.org should respond to HTTP - https://phabricator.wikimedia.org/T104735#1638888 (10Mike_Peel) >>! In T104735#1632982, @Ricordisamoa wrote: >>>! In T104735#1632583, @BBlack wrote: >> This seems to be satisfy pointless curiosity of users who look at a browser developer console and... [20:03:02] 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1638894 (10chasemp) >>! In T109279#1638840, @mmodell wrote: > So we could potentially have phabricator store it's files in swift? 1. Yes, but not sure how much w... [20:03:16] RECOVERY - puppet last run on mw2020 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [20:03:57] 10Ops-Access-Requests, 6operations, 5Continuous-Integration-Scaling, 5Patch-For-Review: contint-admins can't start/stop nodepool (lack sudo) - https://phabricator.wikimedia.org/T111374#1638900 (10Dzahn) In the ops meeting it has been said that this is approved in principal, but that we don't want to use sy... [20:04:11] starting parsoid deploy [20:04:35] 10Ops-Access-Requests, 6operations: Requesting access to elasticsearch-roots - https://phabricator.wikimedia.org/T111473#1638901 (10Dzahn) a:3Dzahn [20:05:49] 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1638902 (10mmodell) @chasemp: upstream phabricator is almost certainly storing things in s3. 1. Swift should be similar enough to s3 to make it an easy integrati... [20:07:53] 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1638915 (10chasemp) I think there is some difference in API for swift vs s3, as in swift has a subset. paging @fgiunchedi who should know offhand. [20:10:49] (03PS1) 10Dzahn: admins: add tfinc to elasticsearch-roots [puppet] - 10https://gerrit.wikimedia.org/r/238301 (https://phabricator.wikimedia.org/T111473) [20:11:38] (03CR) 10jenkins-bot: [V: 04-1] admins: add tfinc to elasticsearch-roots [puppet] - 10https://gerrit.wikimedia.org/r/238301 (https://phabricator.wikimedia.org/T111473) (owner: 10Dzahn) [20:12:08] 6operations, 10ops-eqiad: Swap two elasticsearch servers in row D with an elasticsearch server in racks A3 and C5. - https://phabricator.wikimedia.org/T112559#1638926 (10chasemp) Also I think this will need to be updated at this time: hieradata/regex.yaml es_rack_a3: __regex: !ruby/regexp /^elastic100[0-6]... [20:14:22] ori: thanks for the fix [20:15:02] 6operations: (www.)wmfusercontent.org should respond to HTTP - https://phabricator.wikimedia.org/T104735#1638960 (10BBlack) The problem here is with people's perceptions mostly :/ It's a common pattern to use multiple domainnames to fetch sub-resources of a site. Aside from the obvious examples like gstatic, e... [20:15:14] !log deployed parsoid sha 3d5f4359 [20:15:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:16:13] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to elasticsearch-roots - https://phabricator.wikimedia.org/T111473#1638967 (10Tfinc) @Dzahn, looks like you this patch set adds "tfinc" instead of "tomasz" I'd mention that in CR but gerrit is not letting me in [20:16:55] 6operations, 10ops-eqiad: Swap two elasticsearch servers in row D with an elasticsearch server in racks A3 and C5. - https://phabricator.wikimedia.org/T112559#1638971 (10chasemp) [20:18:06] (03PS2) 10Dzahn: admins: add Tomasz Finc to elasticsearch-roots [puppet] - 10https://gerrit.wikimedia.org/r/238301 (https://phabricator.wikimedia.org/T111473) [20:18:44] gilles: is the varnishmedia data starting to come in? [20:18:45] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to elasticsearch-roots - https://phabricator.wikimedia.org/T111473#1638977 (10Dzahn) @Tfinc ah, thanks, fixed! Do we need to investigate the Gerrit issue? [20:20:13] (03PS3) 10Dzahn: admins: add Tomasz Finc to elasticsearch-roots [puppet] - 10https://gerrit.wikimedia.org/r/238301 (https://phabricator.wikimedia.org/T111473) [20:21:00] !log graceful’d apache2 on labcontrol1001 [20:21:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:22:08] (03CR) 10Dzahn: [C: 032] admins: add Tomasz Finc to elasticsearch-roots [puppet] - 10https://gerrit.wikimedia.org/r/238301 (https://phabricator.wikimedia.org/T111473) (owner: 10Dzahn) [20:22:24] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to elasticsearch-roots - https://phabricator.wikimedia.org/T111473#1638986 (10Krenair) Gerrit should be letting you in using your wikitech credentials. It's working for me... [20:22:46] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to elasticsearch-roots - https://phabricator.wikimedia.org/T111473#1638996 (10Dzahn) Approved in ops meeting today. Merged. [20:23:14] Ori: just walked out but yes I saw the data come in. One server's worth, cp1048 I presume [20:23:16] RECOVERY - puppet last run on mw2091 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [20:23:40] anyone know why puppet is disabled on elasticsearch? [20:23:41] Ori: I've added a graph for it to the media dashboard [20:23:48] administratively disabled (Reason: 'reason not specified'); [20:24:02] can't add new admins without :) [20:24:16] RECOVERY - puppet last run on mw2076 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [20:24:18] please add reason.. or it's a bug that it's disabled [20:24:35] RECOVERY - puppet last run on mw1231 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [20:24:45] RECOVERY - puppet last run on mw1139 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [20:24:56] RECOVERY - puppet last run on mw1219 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [20:25:04] mutante: if it's 1001 I added a reason as a second command I must not not work out [20:25:06] my apologies [20:25:25] RECOVERY - puppet last run on mw2157 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:25:26] RECOVERY - puppet last run on mw2033 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [20:25:28] 6operations: (www.)wmfusercontent.org should respond to HTTP - https://phabricator.wikimedia.org/T104735#1639009 (10Mike_Peel) >>! In T104735#1638960, @BBlack wrote: > The problem here is with people's perceptions mostly :/ It's a common pattern to use multiple domainnames to fetch sub-resources of a site. Asi... [20:25:40] chasemp: ah, ok, yes it is 1001 [20:25:47] RECOVERY - puppet last run on mw2141 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:25:54] chasemp: should i wait? i'm adding Tomasz, just did on 1002 [20:25:56] I ran it, then realized and ran again. that seems not to actually work. [20:25:56] RECOVERY - puppet last run on mw1026 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:26:01] nope I just fixed it up [20:26:07] ok, thanks [20:26:22] (03CR) 10Gergő Tisza: Basic role for Sentry (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/199598 (https://phabricator.wikimedia.org/T84956) (owner: 10Gilles) [20:26:45] RECOVERY - puppet last run on mw2069 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:28:00] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to elasticsearch-roots - https://phabricator.wikimedia.org/T111473#1639017 (10Dzahn) @Tfinc i saw puppet add your user on elastic1001 and 1002. The other 2 will just follow automatically. Since you already have a shell user and bastion... [20:28:21] 6operations: (www.)wmfusercontent.org should respond to HTTP - https://phabricator.wikimedia.org/T104735#1639019 (10BBlack) >>! In T104735#1639009, @Mike_Peel wrote: >>>! In T104735#1638960, @BBlack wrote: >> The problem here is with people's perceptions mostly :/ It's a common pattern to use multiple domainnam... [20:28:52] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to elasticsearch-roots - https://phabricator.wikimedia.org/T111473#1639021 (10Dzahn) 5Open>3Resolved @elastic1001:~# id tomasz uid=1155(tomasz) gid=500(wikidev) groups=500(wikidev),709(elasticsearch-roots) [20:29:41] kart_: you should have access to control apertium-apy now [20:30:01] tfinc: and you have the elastic access now [20:30:14] mutante: thank you [20:30:40] mutante: did you see my note about "tfinc" vs "tomasz" in the phab task ? [20:30:43] yw, let us know if any issues with gerrit [20:30:46] yes, i did [20:30:55] i changed it to "tomasz" [20:31:18] and confirmed you are in the elasticsearch-roots group [20:32:15] 6operations: (www.)wmfusercontent.org should respond to HTTP - https://phabricator.wikimedia.org/T104735#1639045 (10Mike_Peel) >>! In T104735#1639019, @BBlack wrote: > Yeah but if you look closer, that example (en + bits) doesn't share anything but the trailing `.org`. It's wiki**P**edia vs wiki**M**edia. Simi... [20:35:27] PROBLEM - puppet last run on mw2125 is CRITICAL: CRITICAL: Puppet has 1 failures [20:36:10] 6operations: (www.)wmfusercontent.org should respond to HTTP - https://phabricator.wikimedia.org/T104735#1639062 (10BBlack) But now we're off in the territory of human comfort levels, not software. It's still meaningless for any real verification to populate non-existent related hostnames just for people to loo... [20:36:52] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Request to access apertium-apy service restart - https://phabricator.wikimedia.org/T111360#1639063 (10Dzahn) 5Open>3Resolved @KartikMistry please reopen if any issues, but i don't expect any because it's the same we do for other services in "sca". [20:37:34] 10Ops-Access-Requests, 6operations: Requesting access to stat1003, stat1002 and bast1001 for JMinor - https://phabricator.wikimedia.org/T111872#1639065 (10Dzahn) a:5jcrespo>3Dzahn [20:38:16] 10Ops-Access-Requests, 6operations: Request to access apertium-apy service restart - https://phabricator.wikimedia.org/T111360#1639067 (10Dzahn) [20:41:49] 6operations: (www.)wmfusercontent.org should respond to HTTP - https://phabricator.wikimedia.org/T104735#1639085 (10Mike_Peel) >>! In T104735#1639062, @BBlack wrote: > But now we're off in the territory of human comfort levels, not software. It's still meaningless for any real verification to populate non-exist... [20:45:41] 6operations: (www.)wmfusercontent.org should respond to HTTP - https://phabricator.wikimedia.org/T104735#1639106 (10BBlack) >>! In T104735#1639085, @Mike_Peel wrote: >>>! In T104735#1639062, @BBlack wrote: >> But now we're off in the territory of human comfort levels, not software. It's still meaningless for an... [20:47:07] 6operations: (www.)wmfusercontent.org should respond to HTTP - https://phabricator.wikimedia.org/T104735#1639116 (10Mike_Peel) >>! In T104735#1639106, @BBlack wrote: >>>! In T104735#1639085, @Mike_Peel wrote: >>>>! In T104735#1639062, @BBlack wrote: >>> But now we're off in the territory of human comfort levels,... [20:49:48] 6operations: (www.)wmfusercontent.org should respond to HTTP - https://phabricator.wikimedia.org/T104735#1639132 (10BBlack) Why would anyone go there? That hostname doesn't exist, and has never been linked anywhere. [20:53:13] !log updated OCG to version 5811056e28f2bc6408b6da96095352ab381bb11f [20:53:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:54:26] 10Ops-Access-Requests, 6operations, 5Continuous-Integration-Scaling, 5Patch-For-Review: contint-admins can't start/stop nodepool (lack sudo) - https://phabricator.wikimedia.org/T111374#1639162 (10hashar) Thanks @Dzahn, I followed up on @akosiaris comment and adjust the sudo rule to use service instead of s... [20:56:10] 6operations: (www.)wmfusercontent.org should respond to HTTP - https://phabricator.wikimedia.org/T104735#1639166 (10Mike_Peel) We seem to have come full circle... I'm still not convinced that it's worth spending more time discussing this than it would take to fix the issue! [20:59:20] !log legoktm@tin Synchronized php-1.26wmf22/extensions/Echo/: Hack around OOUI's icon pack being too large by creating our own (duration: 00m 12s) [20:59:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:01:51] (03PS2) 10Rush: phab: use permissions for files on bot upload [puppet] - 10https://gerrit.wikimedia.org/r/236205 [21:01:58] (03CR) 10Rush: [C: 032 V: 032] phab: use permissions for files on bot upload [puppet] - 10https://gerrit.wikimedia.org/r/236205 (owner: 10Rush) [21:02:16] RECOVERY - puppet last run on mw2125 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:05:00] 6operations: (www.)wmfusercontent.org should respond to HTTP - https://phabricator.wikimedia.org/T104735#1639200 (10BBlack) These are the kinds of things I've had to deal with over the past several months, most of which could've been avoided by and are a part of this larger philosophical problem, IMHO: T101048... [21:05:10] (03CR) 10Tim Landscheidt: [C: 031] aptly: Pin per-project aptly repository [puppet] - 10https://gerrit.wikimedia.org/r/238201 (owner: 10Yuvipanda) [21:08:43] (03CR) 10Milimetric: "Quick update: we're finalizing the name of the three servers that have been allocated to this (see https://phabricator.wikimedia.org/T1110" [puppet] - 10https://gerrit.wikimedia.org/r/231574 (https://phabricator.wikimedia.org/T107056) (owner: 10Milimetric) [21:09:06] (03PS4) 10Dzahn: nodepool: sudo rules for contint-admins [puppet] - 10https://gerrit.wikimedia.org/r/235742 (https://phabricator.wikimedia.org/T111374) (owner: 10Hashar) [21:10:13] 10Ops-Access-Requests, 6operations, 5Continuous-Integration-Scaling, 5Patch-For-Review: contint-admins can't start/stop nodepool (lack sudo) - https://phabricator.wikimedia.org/T111374#1639211 (10Dzahn) @hashar looks all good. thanks. i will merge. i also saw on labnodepool1001 "nodepool" is now recognized... [21:10:33] (03CR) 10Dzahn: [C: 032] nodepool: sudo rules for contint-admins [puppet] - 10https://gerrit.wikimedia.org/r/235742 (https://phabricator.wikimedia.org/T111374) (owner: 10Hashar) [21:12:13] disabled (Reason: 'Andrew disabling puppet because nodepool is running amok.'); [21:12:36] andrewbogott: ^ do we know if it can be enabled again? is that fresh or old? [21:12:54] i just want a single run to apply sudo changes [21:13:21] mutante: I disabled because hashar asked me to... [21:13:24] A single run is definitely fine [21:13:37] heh, ok, so this is for his own access :) [21:13:41] If the nodepool service is running then my reason for disabling is moot [21:13:49] so you can leave it enabled if nodepool is already up [21:14:04] it does not seem to be running [21:14:06] hashar: hi [21:14:31] 6operations: (www.)wmfusercontent.org should respond to HTTP - https://phabricator.wikimedia.org/T104735#1639223 (10hashar) There is nothing requiring http on wmfusercontent.org and I am not sure what would be the use case. Since the whole domain can host any arbitrary file (per design) and is solely used for i... [21:14:37] i will run it once and update on ticket [21:14:42] bblack: giving you some support :-} [21:14:57] mutante: andrewbogott you can reenable puppet / nodepool -} [21:15:09] hashar: ok, thanks [21:15:11] I got it killed on thursday iirc , because I thought it could kill labs [21:15:19] ok, i'm doing it [21:15:29] but labs is fully operational and I tried nodepoold again today and it is all fine as far as I am concerned [21:15:36] (03PS1) 10Andrew Bogott: toolschecker: Added db tests [puppet] - 10https://gerrit.wikimedia.org/r/238323 (https://phabricator.wikimedia.org/T107449) [21:15:48] !log labnodepool1001 - re-enable puppet and nodepool [21:15:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:15:57] it might create a snapshot image over night and maybe deleted / create an instance [21:15:57] !log ori@tin Synchronized php-1.26wmf22/extensions/TitleBlacklist: Ie44fcb500: Avoid checking blacklists in isBlacklisted() for existing titles (duration: 00m 12s) [21:16:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:16:08] it's running now [21:16:11] \O/ [21:16:29] andrewbogott: I got some doc work ongoing for nodepool :} [21:16:53] hashar: %contint-admins ALL = NOPASSWD: /usr/sbin/service nodepool start [21:16:58] the new sudo rules are there [21:17:22] \O/ [21:17:38] eh, how many nodes are there..looks [21:17:57] (03PS2) 10Andrew Bogott: toolschecker: Added db tests [puppet] - 10https://gerrit.wikimedia.org/r/238323 (https://phabricator.wikimedia.org/T107449) [21:17:59] only 1 - 5 :-D [21:18:02] just one. ok. then this is done :) [21:18:11] and you guys can also start and stop it now [21:18:20] confirmed [21:18:22] \O/ [21:18:25] :) [21:19:04] 10Ops-Access-Requests, 6operations, 5Continuous-Integration-Scaling, 5Patch-For-Review: contint-admins can't start/stop nodepool (lack sudo) - https://phabricator.wikimedia.org/T111374#1639253 (10Dzahn) 14:17 < mutante> !log labnodepool1001 - re-enable puppet and nodepool 14:19 < mutante> hashar: %contint... [21:19:13] 10Ops-Access-Requests, 6operations, 5Continuous-Integration-Scaling, 5Patch-For-Review: contint-admins can't start/stop nodepool (lack sudo) - https://phabricator.wikimedia.org/T111374#1639254 (10Dzahn) 5Open>3Resolved [21:20:51] 10Ops-Access-Requests, 6operations, 5Continuous-Integration-Scaling: contint-admins can't start/stop nodepool (lack sudo) - https://phabricator.wikimedia.org/T111374#1602573 (10Dzahn) [21:23:23] 10Ops-Access-Requests, 6operations, 5Continuous-Integration-Scaling: contint-admins can't start/stop nodepool (lack sudo) - https://phabricator.wikimedia.org/T111374#1639264 (10hashar) Thanks. Wrote some lame notes on the wiki https://wikitech.wikimedia.org/w/index.php?title=Nodepool&diff=177626&oldid=177201 [21:24:35] sleeep time [21:26:03] hashar: bonne nuit [21:32:55] (03PS2) 10Yuvipanda: aptly: Pin per-project aptly repository [puppet] - 10https://gerrit.wikimedia.org/r/238201 [21:33:01] (03CR) 10Yuvipanda: [C: 032 V: 032] aptly: Pin per-project aptly repository [puppet] - 10https://gerrit.wikimedia.org/r/238201 (owner: 10Yuvipanda) [21:34:14] 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1639306 (10mmodell) we wouldn't be the only ones using swift: https://secure.phabricator.com/T5843 [21:43:27] (03PS1) 10Dzahn: admin: create shell account for Joshua Minor [puppet] - 10https://gerrit.wikimedia.org/r/238333 (https://phabricator.wikimedia.org/T111872) [21:44:12] (03CR) 10Dzahn: [C: 04-2] "needs labs user for UID" [puppet] - 10https://gerrit.wikimedia.org/r/238333 (https://phabricator.wikimedia.org/T111872) (owner: 10Dzahn) [21:48:34] (03PS4) 10Yuvipanda: Tools: Migrate from labsdebrepo to aptly [puppet] - 10https://gerrit.wikimedia.org/r/238089 (https://phabricator.wikimedia.org/T111708) (owner: 10Tim Landscheidt) [21:49:12] (03CR) 10Yuvipanda: [C: 032 V: 032] Tools: Migrate from labsdebrepo to aptly [puppet] - 10https://gerrit.wikimedia.org/r/238089 (https://phabricator.wikimedia.org/T111708) (owner: 10Tim Landscheidt) [21:49:51] (03PS6) 10EBernhardson: Disable dynamic scripting in Elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/224651 (owner: 10Manybubbles) [21:52:39] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to stat1003, stat1002 and bast1001 for JMinor - https://phabricator.wikimedia.org/T111872#1639372 (10Dzahn) @JMinor Hi, i'm Daniel, i'm going to follow-up with this ticket to get you the access now that all requirements are done. Just... [21:58:19] 10Ops-Access-Requests, 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Let contint-admins force run puppet with /usr/local/sbin/puppet-run - https://phabricator.wikimedia.org/T110943#1639409 (10Dzahn) @Robh did we have an outcome today? [21:59:51] 10Ops-Access-Requests, 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Let contint-admins force run puppet with /usr/local/sbin/puppet-run - https://phabricator.wikimedia.org/T110943#1639413 (10RobH) Someone said it was approved in the meeting notes, but since I wasn't on clinic du... [22:01:19] 10Ops-Access-Requests, 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Let contint-admins force run puppet with /usr/local/sbin/puppet-run - https://phabricator.wikimedia.org/T110943#1639416 (10Dzahn) Ok, thanks, i'll take it then since i'm on duty and just did the other contint-ad... [22:01:49] mutante: wait [22:01:51] i may be incorrect [22:01:55] i just went to doublecheck what i said [22:02:00] and i may be quickly reverting. [22:02:47] robh: ok [22:02:55] it looks like it wasnt on the meeting this week [22:03:02] so whoever was on clinic last week missed it i suppose [22:03:49] heh, hashar said ". So should be talked about again in the next ops meeting on Monday Sep. 25th." [22:03:50] 10Ops-Access-Requests, 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Let contint-admins force run puppet with /usr/local/sbin/puppet-run - https://phabricator.wikimedia.org/T110943#1639418 (10RobH) I was incorrect. That was a different task (Checking notes on https://office.wik... [22:03:55] but that's a Friday :) [22:04:08] why is it assigned to me? [22:04:16] you took it :) [22:04:22] 10Ops-Access-Requests, 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Let contint-admins force run puppet with /usr/local/sbin/puppet-run - https://phabricator.wikimedia.org/T110943#1639419 (10RobH) a:5RobH>3None [22:04:28] "If this is approved in the meeting, I'll merge the patchset post meeting." [22:04:31] oh well, putting back up for grabs, heh [22:04:39] yes, but then the meeting didnt happen so opps =] [22:05:37] ok [22:05:46] also, Sep 25th must be a mistake [22:06:25] mutante: it's a Friday. Of course it is a mistake [22:06:35] [unaware of the context] [22:06:46] 15:06 < mutante> but that's a Friday :) [22:09:02] 10Ops-Access-Requests, 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Let contint-admins force run puppet with /usr/local/sbin/puppet-run - https://phabricator.wikimedia.org/T110943#1639428 (10Dzahn) a:3Dzahn [22:09:13] 6operations: (www.)wmfusercontent.org should respond to HTTP - https://phabricator.wikimedia.org/T104735#1639430 (10Platonides) >>! In T104735#1639062, @BBlack wrote: > But now we're off in the territory of human comfort levels, not software. It's still meaningless for any real verification to populate non-exis... [22:21:51] 10Ops-Access-Requests, 6operations, 7Icinga: give John Lewis permissions to send commands in icinga - https://phabricator.wikimedia.org/T105229#1639464 (10Dzahn) a:3Dzahn [22:22:38] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to stat1003, stat1002 and bast1001 for JMinor - https://phabricator.wikimedia.org/T111872#1639467 (10Dzahn) a:5Dzahn>3JMinor please assing the ticket back to me when you're done or have a reply. thank you [22:23:30] 10Ops-Access-Requests, 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Let contint-admins force run puppet with /usr/local/sbin/puppet-run - https://phabricator.wikimedia.org/T110943#1639470 (10Dzahn) 5Open>3stalled [22:26:07] (03CR) 10EBernhardson: "the issues applying this to beta cluster are unrelated to this patch. I've filed T112585 to capture that problem." [puppet] - 10https://gerrit.wikimedia.org/r/224651 (owner: 10Manybubbles) [22:27:42] 7Blocked-on-Operations, 7Puppet, 6Reading-Infrastructure-Team, 10Sentry, and 2 others: Create basic puppet role for Sentry - https://phabricator.wikimedia.org/T84956#1639478 (10Dzahn) @tgr and @ori have comments on the patch originally created by @gilles. I'm not sure if all concerns have been addressed ye... [22:29:24] 7Blocked-on-Operations, 7Puppet, 6Reading-Infrastructure-Team, 10Sentry, and 2 others: Create basic puppet role for Sentry - https://phabricator.wikimedia.org/T84956#1639484 (10Dzahn) also @akosiaris do you like it nowadays with the current PS? [22:37:49] !log deployed tilerator [22:37:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:42:24] (03PS1) 10Dzahn: reprepro: add new distro jessie for mediawiki releases [puppet] - 10https://gerrit.wikimedia.org/r/238348 (https://phabricator.wikimedia.org/T111225) [22:43:44] (03CR) 10John F. Lewis: [C: 031] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/238348 (https://phabricator.wikimedia.org/T111225) (owner: 10Dzahn) [22:44:30] (03CR) 10Dzahn: "what about "'AlsoAcceptFor' => 'trusty'," ?" [puppet] - 10https://gerrit.wikimedia.org/r/238348 (https://phabricator.wikimedia.org/T111225) (owner: 10Dzahn) [22:46:23] (03CR) 10Dzahn: "@filippo am i doing it right? what else do we need?" [puppet] - 10https://gerrit.wikimedia.org/r/238348 (https://phabricator.wikimedia.org/T111225) (owner: 10Dzahn) [22:47:31] (03CR) 10John F. Lewis: reprepro: add new distro jessie for mediawiki releases [puppet] - 10https://gerrit.wikimedia.org/r/238348 (https://phabricator.wikimedia.org/T111225) (owner: 10Dzahn) [22:48:02] 6operations, 5Patch-For-Review: Change distribution in releases.wikimedia.org to "sid" or "jessie" - https://phabricator.wikimedia.org/T111225#1639576 (10Dzahn) @fgiunchedi ^ how does that change look to add jessie? I'm not sure what do put in "AlsoAcceptFor" and if we say 8.0, 8.1 or 8.2 jessie [22:48:35] (03CR) 10John F. Lewis: "Looks like it should be jessie but since I'm not confident and missed it first time, removing code review." [puppet] - 10https://gerrit.wikimedia.org/r/238348 (https://phabricator.wikimedia.org/T111225) (owner: 10Dzahn) [22:49:37] (03PS2) 10Dzahn: reprepro: add new distro jessie for mediawiki releases [puppet] - 10https://gerrit.wikimedia.org/r/238348 (https://phabricator.wikimedia.org/T111225) [22:51:35] 6operations, 10Flow, 10MediaWiki-Redirects, 3Collaboration-Team-Current, and 2 others: On mobile, the Flow notification's link takes you to the desktop version of the Flow page, even though the main (background) link takes you to the mobile one (main) - https://phabricator.wikimedia.org/T107108#1639601 (10D... [22:57:55] andrewbogott: can we start "nova-scheduler" on labcontrol1002 or should it be stopped [22:59:37] * James_F waves for SWAT, pre-empting the bot. [22:59:40] mutante: labcontrol1002 is a hot spare, it shouldn’t really be running anything. Is there a test for nova-scheduler? [23:00:04] RoanKattouw ostriches Krenair: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150914T2300). Please do the needful. [23:00:04] James_F: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:13] 6operations: (www.)wmfusercontent.org should respond to HTTP - https://phabricator.wikimedia.org/T104735#1639636 (10BBlack) >>! In T104735#1639430, @Platonides wrote: >>>! In T104735#1639062, @BBlack wrote: >> But now we're off in the territory of human comfort levels, not software. It's still meaningless for a... [23:00:16] Krenair: is it going to be you? [23:00:29] not it [23:00:34] * RoanKattouw should go to sleep [23:00:56] andrewbogott: yes, i was checking icinga [23:01:15] mutante: hm, ok, that probably shouldn’t be tested, I need to purge that host [23:01:20] it started about 7h ago fwiw [23:01:43] ok, thanks, i'll just disable it for now [23:01:50] RoanKattouw, are you in NL? [23:01:52] then you can purge it later [23:02:07] James_F, guess so [23:02:57] mutante: thanks [23:03:43] ACKNOWLEDGEMENT - nova-scheduler process on labcontrol1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-scheduler daniel_zahn hot spare [23:03:45] added "persistent comment" -> "hot spare" [23:05:39] subbu: Yeah, for a week [23:05:47] James_F: i'll push it out i suppose [23:05:50] not seeing volunteers :) [23:06:22] ebernhardson: You and Krenair fight. :-) [23:06:27] ahh i see krenair got it, excellent. [23:08:59] 6operations, 10Traffic, 7HTTPS: Track/notify cert expiries better - https://phabricator.wikimedia.org/T112521#1639696 (10Dzahn) p:5Triage>3Normal [23:09:58] 6operations, 10ops-eqiad: Swap two elasticsearch servers in row D with an elasticsearch server in racks A3 and C5. - https://phabricator.wikimedia.org/T112559#1639700 (10Dzahn) p:5Triage>3Normal [23:11:13] James_F, why are you naming MediaWiki-Gallery without the prefix? [23:12:13] Krenair: The prefix isn't meant to be there. [23:12:22] Why not? [23:12:23] Krenair: It's an artefact of the Bugzilla->Phabricator move. [23:12:35] It wasn't removed initially in case of clashes. [23:13:14] Why are we not keeping them? [23:13:30] Because they clutter the search space and are unhelpful. [23:13:40] Anyway, this is off-topic for -operations. :-) [23:13:50] Phabricator doesn't provide a hierarchy of projects [23:14:19] Yet. [23:14:24] So it should be prefixed as a part of MediaWiki [23:14:27] No. [23:14:51] This isn't the venue, and it's settled policy. If you want to change it, you can start a new discussion, but people probably will disagree. :-) [23:14:57] Link please [23:15:22] Krenair: Are you deploying or not? [23:15:43] 10Ops-Access-Requests, 6operations: Requesting access to elasticsearch-roots - https://phabricator.wikimedia.org/T111473#1639814 (10Dzahn) [23:15:57] I've got stuff to do, like fixing Commons. :-) [23:16:08] Send me the link James_F. [23:16:17] https://wikitech.wikimedia.org/wiki/Deployments [23:18:27] That's not the link I was looking for [23:18:55] Krenair: Do I need to find someone else to deploy? [23:19:09] Depends [23:19:16] * James_F sighs. [23:19:17] ebernhardson: Can you please deploy? Krenair seems to have lost interest and we've got wikis to fix. [23:19:31] what is this patch? [23:19:40] It's already live in master. [23:19:46] https://gerrit.wikimedia.org/r/238334 [23:20:08] (Fixes an issue with the UploadWizard, which is unhelpful for Commonists during WLM. :-)) [23:20:39] (03CR) 10John F. Lewis: [C: 04-1] "Looked at the ticket as I don't see any long term need for this personally. Though I don't object to an amend and re-evaluate." [puppet] - 10https://gerrit.wikimedia.org/r/237865 (https://phabricator.wikimedia.org/T83158) (owner: 10Dzahn) [23:20:47] i can't see any reason why not [23:23:19] !log ebernhardson@tin Synchronized php-1.26wmf22/extensions/UploadWizard/: Swat out badtoken fix to UploadWizard in 1.26wmf22 (duration: 00m 12s) [23:23:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:23:25] Thanks. [23:23:27] James_F: I misread that for Communists. whoops :) [23:23:29] * James_F tests. [23:23:32] JohnFLewis: ;-) [23:24:30] (03PS1) 10CSteipp: Enable captchas on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238357 (https://phabricator.wikimedia.org/T86460) [23:25:06] ebernhardson: Thanks, LGTM. [23:27:29] !log ebernhardson@tin Synchronized php-1.26wmf22/extensions/WikimediaEvents/: Change bucket selection methods in CompletionSuggestions AB test (duration: 00m 12s) [23:27:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:27:57] 6operations, 7HHVM: /var/cache/hhvm/cli.hhbc.sq3 owned by root on some mw hosts - https://phabricator.wikimedia.org/T112517#1639855 (10Dzahn) p:5Triage>3Normal [23:28:47] 6operations, 10ops-eqiad: db1043 degraded RAID - https://phabricator.wikimedia.org/T112502#1639859 (10Dzahn) p:5Triage>3High [23:29:20] 6operations, 10Traffic: Deprecate pybal SSH health checks - https://phabricator.wikimedia.org/T111899#1639868 (10Dzahn) p:5Triage>3Normal [23:36:45] PROBLEM - citoid endpoints health on sca1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:37:52] (03CR) 10EBernhardson: [C: 031] Disable dynamic scripting in Elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/224651 (owner: 10Manybubbles) [23:39:48] sca1002 - NRPE running and All endpoints are healthy [23:40:36] PROBLEM - citoid endpoints health on sca1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:40:46] Oy. [23:41:13] mutante: Is the test too sensitive? Or is the service actually flaky? [23:41:35] RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy [23:42:07] RECOVERY - citoid endpoints health on sca1001 is OK: All endpoints are healthy [23:42:09] James_F: the service looks ok: [23:42:14] Yeah. [23:42:24] /usr/local/lib/nagios/plugins/service_checker -t 5 10.64.32.153 http://10.64.32.153:1970 [23:42:30] All endpoints are healthy [23:42:32] that is it [23:42:47] i believe it's icinga being too busy for a moment to get the result from NRPE within the timeout [23:43:52] to run it locally takes about 1.5 seconds, not 10 [23:44:11] but on the icinga side, there's lots and lots to run [23:44:51] 6operations, 10Wikimedia-Media-storage: Update rsvg on the image scalers - https://phabricator.wikimedia.org/T112421#1639913 (10Tgr) [23:45:39] 6operations, 10Wikimedia-Media-storage: Update rsvg on the image scalers - https://phabricator.wikimedia.org/T112421#1634588 (10Tgr) [23:46:09] (03PS3) 10Dzahn: Add scap scripts to all canary app servers [puppet] - 10https://gerrit.wikimedia.org/r/237707 (https://phabricator.wikimedia.org/T112174) (owner: 10Jcrespo) [23:47:23] * James_F nods. Thanks, mutante. [23:48:04] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/883/mw1017.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/237707 (https://phabricator.wikimedia.org/T112174) (owner: 10Jcrespo) [23:48:49] 6operations, 5Patch-For-Review: mw1017 has outdated broken mwscript - https://phabricator.wikimedia.org/T112174#1639947 (10Dzahn) a:5jcrespo>3Dzahn [23:50:01] (03PS34) 10Gergő Tisza: Basic role for Sentry [puppet] - 10https://gerrit.wikimedia.org/r/199598 (https://phabricator.wikimedia.org/T84956) (owner: 10Gilles) [23:51:16] RECOVERY - nova-scheduler process on labcontrol1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-scheduler [23:54:04] 6operations, 5Patch-For-Review: mw1017 has outdated broken mwscript - https://phabricator.wikimedia.org/T112174#1639967 (10Dzahn) Sep 14 23:50:10 mw1017 puppet-agent[19262]: (/Stage[main]/Scap::Scripts/File[/usr/local/bin/mwscript]/content) content changed ``` Notice: /Stage[main]/Scap::Scripts/File[/usr/loc... [23:55:00] 6operations, 5Patch-For-Review: mw1017 has outdated broken mwscript - https://phabricator.wikimedia.org/T112174#1639968 (10Dzahn) 5Open>3Resolved [23:55:10] 6operations: mw1017 has outdated broken mwscript - https://phabricator.wikimedia.org/T112174#1628038 (10Dzahn) [23:56:16] PROBLEM - nova-scheduler process on labcontrol1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-scheduler [23:57:11] (03PS1) 10Tim Starling: Update personal .bashrc [puppet] - 10https://gerrit.wikimedia.org/r/238363