[00:03:51] 6operations, 7Database, 5Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#1806943 (10jcrespo) @JanZerebecki Do not get me wrong, I think we should have the ability to do that. But how does a compromised key affect us? A compromised key or CA for HTTPS means th... [01:12:10] (03PS2) 10Ori.livneh: Add redis::instance [puppet] - 10https://gerrit.wikimedia.org/r/253146 (https://phabricator.wikimedia.org/T100714) [01:21:07] (03CR) 10Ori.livneh: [C: 032] wmflib: add conflicts() function [puppet] - 10https://gerrit.wikimedia.org/r/253145 (owner: 10Ori.livneh) [01:27:27] (03PS3) 10Ori.livneh: Add redis::instance [puppet] - 10https://gerrit.wikimedia.org/r/253146 (https://phabricator.wikimedia.org/T100714) [01:27:29] (03PS1) 10Ori.livneh: Fix predicate in conflicts() code [puppet] - 10https://gerrit.wikimedia.org/r/253286 [01:27:43] (03CR) 10Ori.livneh: [C: 032 V: 032] Fix predicate in conflicts() code [puppet] - 10https://gerrit.wikimedia.org/r/253286 (owner: 10Ori.livneh) [01:31:12] (03CR) 10Ori.livneh: [C: 032] Add redis::instance [puppet] - 10https://gerrit.wikimedia.org/r/253146 (https://phabricator.wikimedia.org/T100714) (owner: 10Ori.livneh) [01:35:14] (03PS1) 10Ori.livneh: redis::instance: qualify resource reference [puppet] - 10https://gerrit.wikimedia.org/r/253288 [01:36:37] (03CR) 10Ori.livneh: [C: 032] redis::instance: qualify resource reference [puppet] - 10https://gerrit.wikimedia.org/r/253288 (owner: 10Ori.livneh) [01:37:11] PROBLEM - puppet last run on pybal-test2003 is CRITICAL: CRITICAL: puppet fail [01:44:01] (03PS1) 10Ori.livneh: redis::instance: omit empty values from template output [puppet] - 10https://gerrit.wikimedia.org/r/253289 [01:44:13] (03CR) 10Ori.livneh: [C: 032 V: 032] redis::instance: omit empty values from template output [puppet] - 10https://gerrit.wikimedia.org/r/253289 (owner: 10Ori.livneh) [01:46:31] RECOVERY - puppet last run on pybal-test2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:22:14] !log l10nupdate@tin Synchronized php-1.27.0-wmf.6/cache/l10n: l10nupdate for 1.27.0-wmf.6 (duration: 06m 29s) [02:22:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:26:09] (03CR) 10Bmansurov: [C: 031] Disable first QuickSurveys survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253038 (https://phabricator.wikimedia.org/T118525) (owner: 10Jhobs) [03:33:18] 6operations, 6Labs: Investigate why nscd is used in labs - https://phabricator.wikimedia.org/T100564#1807015 (10yuvipanda) 5Open>3declined Looks like we need it for caching. [03:41:03] PROBLEM - puppet last run on mw2002 is CRITICAL: CRITICAL: puppet fail [03:53:31] (03PS1) 10Yuvipanda: redis: Make redis::instance use base::service_unit [puppet] - 10https://gerrit.wikimedia.org/r/253294 [03:53:35] 7Blocked-on-Operations, 6operations, 7Availability, 5Patch-For-Review, 7Performance: Upstart support for redis::instance - https://phabricator.wikimedia.org/T118704#1807031 (10ori) 3NEW a:3yuvipanda [03:55:06] ori: ^ noop patch [03:57:13] ori: am writing another one with upstart support now [03:57:48] (03PS2) 10Yuvipanda: redis: Make redis::instance use base::service_unit [puppet] - 10https://gerrit.wikimedia.org/r/253294 (https://phabricator.wikimedia.org/T118704) [03:57:53] (removed more boilerplate) [03:58:07] you should add an [Install] section [03:59:40] ori: hmm, right. [04:00:20] i am bummed about not using systemd templates, but for no real rational reason. you are right that this is more portable. [04:01:05] I don't think we give up too much either [04:01:10] also do we set daemonize? [04:01:38] if we don't we don't need to expect fork I gues [04:02:25] ah I see, so it uses teh default [04:02:29] I wonder what the default is [04:02:31] * YuviPanda checks [04:02:51] what [04:02:54] the default is empty?! [04:03:13] hmm [04:03:15] it is [04:03:17] that's strange [04:06:30] (03PS1) 10Yuvipanda: redis: Add upstart support to redis::instance [puppet] - 10https://gerrit.wikimedia.org/r/253295 (https://phabricator.wikimedia.org/T118704) [04:06:33] ori: ^ adds upstart support [04:06:37] I'll take a look at the Install now [04:07:16] YuviPanda: just look at the unit file that ships with the jessie package [04:09:22] RECOVERY - puppet last run on mw2002 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [04:09:34] (03CR) 10Ori.livneh: redis: Add upstart support to redis::instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/253295 (https://phabricator.wikimedia.org/T118704) (owner: 10Yuvipanda) [04:09:53] (03PS3) 10Yuvipanda: redis: Make redis::instance use base::service_unit [puppet] - 10https://gerrit.wikimedia.org/r/253294 (https://phabricator.wikimedia.org/T118704) [04:09:55] (03PS2) 10Yuvipanda: redis: Add upstart support to redis::instance [puppet] - 10https://gerrit.wikimedia.org/r/253295 (https://phabricator.wikimedia.org/T118704) [04:09:55] ori: yeah done and pushed. [04:09:58] * YuviPanda looks for comment [04:10:33] ori: fixed [04:10:51] ori: I can also move some of my own redises (both jessie and trusty) to test [04:10:57] they're all hot standby so easy to test [04:11:03] (if you want!) [04:11:30] (03PS3) 10Yuvipanda: redis: Add upstart support to redis::instance [puppet] - 10https://gerrit.wikimedia.org/r/253295 (https://phabricator.wikimedia.org/T118704) [04:30:24] 6operations, 7Availability, 5Patch-For-Review, 7Performance: Upstart support for redis::instance - https://phabricator.wikimedia.org/T118704#1807051 (10yuvipanda) [04:55:38] (03CR) 10Ori.livneh: [C: 032] redis: Make redis::instance use base::service_unit [puppet] - 10https://gerrit.wikimedia.org/r/253294 (https://phabricator.wikimedia.org/T118704) (owner: 10Yuvipanda) [04:57:17] (03CR) 10Ori.livneh: [C: 031] redis: Add upstart support to redis::instance [puppet] - 10https://gerrit.wikimedia.org/r/253295 (https://phabricator.wikimedia.org/T118704) (owner: 10Yuvipanda) [04:58:36] (03CR) 10Yuvipanda: [C: 032] redis: Add upstart support to redis::instance [puppet] - 10https://gerrit.wikimedia.org/r/253295 (https://phabricator.wikimedia.org/T118704) (owner: 10Yuvipanda) [05:00:39] 7Blocked-on-Operations, 6operations, 7Availability, 5Patch-For-Review, 7Performance: Make redis/redisdb roles support multiple instances on the same servers - https://phabricator.wikimedia.org/T100714#1807060 (10yuvipanda) a:5aaron>3ori [05:00:55] 6operations, 7Availability, 5Patch-For-Review, 7Performance: Upstart support for redis::instance - https://phabricator.wikimedia.org/T118704#1807062 (10yuvipanda) That should do it but I'll test tomorrow to make sure! [06:24:42] PROBLEM - puppet last run on ms-be2014 is CRITICAL: CRITICAL: puppet fail [06:30:32] PROBLEM - puppet last run on mw1199 is CRITICAL: CRITICAL: puppet fail [06:30:32] PROBLEM - puppet last run on mw2021 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:42] PROBLEM - puppet last run on holmium is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:42] PROBLEM - puppet last run on db2055 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:42] PROBLEM - puppet last run on db2056 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:12] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:02] PROBLEM - puppet last run on mw1112 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:11] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:13] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:13] PROBLEM - puppet last run on cp2001 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:22] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:23] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 2 failures [06:33:31] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:22] PROBLEM - puppet last run on mw2095 is CRITICAL: CRITICAL: Puppet has 1 failures [06:51:02] RECOVERY - puppet last run on ms-be2014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:21] RECOVERY - puppet last run on mw1112 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:42] RECOVERY - puppet last run on mw2021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:43] RECOVERY - puppet last run on holmium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:52] RECOVERY - puppet last run on db2055 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [06:56:53] RECOVERY - puppet last run on db2056 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:57:22] RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:42] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [06:58:21] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:23] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:23] RECOVERY - puppet last run on cp2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:32] RECOVERY - puppet last run on mw1199 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:32] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:33] RECOVERY - puppet last run on mw2095 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [06:58:41] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:19:32] PROBLEM - puppet last run on cp2016 is CRITICAL: CRITICAL: puppet fail [07:38:49] (03CR) 10Alexandros Kosiaris: [C: 032] Remove plutonium from infrastructure [puppet] - 10https://gerrit.wikimedia.org/r/252968 (https://phabricator.wikimedia.org/T118586) (owner: 10Alexandros Kosiaris) [07:38:58] (03PS2) 10Alexandros Kosiaris: Remove plutonium from infrastructure [puppet] - 10https://gerrit.wikimedia.org/r/252968 (https://phabricator.wikimedia.org/T118586) [07:42:35] (03CR) 10Alexandros Kosiaris: install-server: set up esams, ulsfo webproxies [puppet] - 10https://gerrit.wikimedia.org/r/126031 (owner: 10Faidon Liambotis) [07:42:54] (03Abandoned) 10Alexandros Kosiaris: install-server: set up esams, ulsfo webproxies [puppet] - 10https://gerrit.wikimedia.org/r/126031 (owner: 10Faidon Liambotis) [07:43:17] (03CR) 10Alexandros Kosiaris: [C: 032] Remove plutonium from infrastructure [puppet] - 10https://gerrit.wikimedia.org/r/252968 (https://phabricator.wikimedia.org/T118586) (owner: 10Alexandros Kosiaris) [07:44:54] 6operations, 10ops-eqiad, 5Patch-For-Review: reclaim plutonium to spares - https://phabricator.wikimedia.org/T118586#1807092 (10akosiaris) [07:47:52] RECOVERY - puppet last run on cp2016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:10:30] !log powered off plutonium, removed it from puppet and salt. context: https://phabricator.wikimedia.org/T118586 [08:10:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:13:09] (03CR) 10Alexandros Kosiaris: [C: 032] Reclaim plutonium.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/252972 (https://bugzilla.wikimedia.org/118586) (owner: 10Alexandros Kosiaris) [08:14:29] 6operations, 10ops-eqiad, 5Patch-For-Review: reclaim plutonium to spares - https://phabricator.wikimedia.org/T118586#1807118 (10akosiaris) [08:16:31] 6operations, 10Traffic, 7discovery-system, 5services-tooling: Figure out a security model for etcd - https://phabricator.wikimedia.org/T97972#1807128 (10yuvipanda) p:5Low>3Normal *bump* This is blocking general access to the k8s api. [08:23:50] (03CR) 10Jcrespo: etherpad: Add an autorestarter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/253048 (owner: 10Yuvipanda) [08:25:28] (03PS2) 10Giuseppe Lavagetto: instrumentation: add alerting endpoint [debs/pybal] - 10https://gerrit.wikimedia.org/r/252214 [08:26:22] (03CR) 10jenkins-bot: [V: 04-1] instrumentation: add alerting endpoint [debs/pybal] - 10https://gerrit.wikimedia.org/r/252214 (owner: 10Giuseppe Lavagetto) [08:40:01] (03PS1) 10Jcrespo: Repool db1015, Depool db1027 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253298 [08:42:19] (03CR) 10Muehlenhoff: [C: 04-1] "Let's hold this back until we've enabled all systems with ferm, the advantage of the current solution is that is gets immedediately applie" [puppet] - 10https://gerrit.wikimedia.org/r/253056 (owner: 10Dzahn) [08:43:41] (03PS2) 10Jcrespo: Repool db1015, Depool db1027 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253298 [08:56:35] <_joe_> !log rsyncing files from terbium to rutherfordium [08:56:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:02:00] (03PS1) 10Jcrespo: Adding ferm (and performance schema) to db1015 before repool [puppet] - 10https://gerrit.wikimedia.org/r/253303 [09:03:33] (03CR) 10Jcrespo: "I am blocked by this in order to deploy https://gerrit.wikimedia.org/r/#/c/253298/" [puppet] - 10https://gerrit.wikimedia.org/r/253303 (owner: 10Jcrespo) [09:12:59] PROBLEM - puppet last run on rutherfordium is CRITICAL: CRITICAL: Puppet last ran 3 days ago [09:13:57] <_joe_> 3 days ago? [09:16:49] !log swift eqiad-prod: add ms-be1019 / ms-be1020 / ms-be1021 weight 500 [09:16:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:19:01] <_joe_> !log reimaging terbium [09:19:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:25:48] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/253303 (owner: 10Jcrespo) [09:34:16] !log adding missing tables to vewikimedia https://phabricator.wikimedia.org/rMW7389d7c69035c968553fbd2eaf1cf47038c4577c [09:34:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:34:44] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1807247 (10fgiunchedi) `restbase2001-b` has finished bootstrapping (times UTC) and joined the cluster (cql interface up) ``` 15:35 -icinga-wm:#wi... [09:36:10] (03PS1) 10Filippo Giunchedi: cassandra: add restbase2001-c instance [puppet] - 10https://gerrit.wikimedia.org/r/253306 [09:36:52] (03PS3) 10Giuseppe Lavagetto: instrumentation: add alerting endpoint [debs/pybal] - 10https://gerrit.wikimedia.org/r/252214 [09:37:11] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add restbase2001-c instance [puppet] - 10https://gerrit.wikimedia.org/r/253306 (owner: 10Filippo Giunchedi) [09:39:27] !log bootstrap restbase2001-c cassandra instance [09:39:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:03:38] !log performing schema change on wikishared.bounce_records (x1) [10:03:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:04:47] PROBLEM - cassandra-c CQL 10.192.16.164:9042 on restbase2001 is CRITICAL: Connection refused [10:17:01] (03PS2) 10Jcrespo: Adding ferm (and performance schema) to db1015 before repool [puppet] - 10https://gerrit.wikimedia.org/r/253303 [10:17:23] (03CR) 10Jcrespo: [C: 032] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/253303 (owner: 10Jcrespo) [10:21:49] 7Puppet, 10Deployment-Systems, 5Patch-For-Review, 3Scap3: Refactor `mediawiki::scap` to make sure Scap dependencies are not dependent on mediawiki - https://phabricator.wikimedia.org/T116606#1807309 (10Joe) @thcipriani the patch is mostly ok as is, sorry for not working on this last week, but it has been s... [10:22:43] (03CR) 10Jcrespo: [C: 032] Repool db1015, Depool db1027 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253298 (owner: 10Jcrespo) [10:23:55] (03CR) 10Giuseppe Lavagetto: [C: 032] instrumentation: add alerting endpoint [debs/pybal] - 10https://gerrit.wikimedia.org/r/252214 (owner: 10Giuseppe Lavagetto) [10:24:49] (03Merged) 10jenkins-bot: instrumentation: add alerting endpoint [debs/pybal] - 10https://gerrit.wikimedia.org/r/252214 (owner: 10Giuseppe Lavagetto) [10:24:59] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1015, depool db1027 for maintenance (duration: 00m 49s) [10:25:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:31:56] (03PS1) 10Filippo Giunchedi: thumbstats: fix description [software] - 10https://gerrit.wikimedia.org/r/253310 [10:32:08] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] thumbstats: fix description [software] - 10https://gerrit.wikimedia.org/r/253310 (owner: 10Filippo Giunchedi) [10:34:16] (03PS1) 10Filippo Giunchedi: swiftrepl: move to argparse and ConfigParser [software] - 10https://gerrit.wikimedia.org/r/253311 [10:44:42] 6operations, 10Wikimedia-Mailing-lists: Upgrade Mailman to version 3 - https://phabricator.wikimedia.org/T52864#1807350 (10Aklapper) p:5High>3Normal This is blocked on external parameters (see T52864#1676829) hence setting priority to normal here. [11:00:23] 6operations, 7Database: New hardware for production core mysql cluster - https://phabricator.wikimedia.org/T106847#1807378 (10jcrespo) [11:04:12] 6operations, 7Database: defragment db1015, db1035 and db1027 - https://phabricator.wikimedia.org/T110504#1807394 (10jcrespo) db1015 defragmented and updated. Only db1027 pending. [11:06:37] 6operations, 7Database: Spikes of job runner new connection errors to mysql "Error connecting to 10.64.32.24: Can't connect to MySQL server on '10.64.32.24' (4)" - mainly on db1035 - https://phabricator.wikimedia.org/T107072#1807396 (10jcrespo) This has not happened since I upgraded and defragemented on db1035... [11:21:08] !log defragementing, upgrading, general maintenance to mysql at db1027. Expect lag, etc. (it is depooled) [11:21:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:30:06] PROBLEM - salt-minion processes on terbium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [11:35:14] 6operations, 7Database, 7Tracking: Migrate MySQLs to use ROW-based replication (tracking) - https://phabricator.wikimedia.org/T109179#1807439 (10jcrespo) I would like to do a full-scale test of this feature on codfw to validate it and test its configuration (sadly [[ https://mariadb.com/kb/en/mariadb/replica... [12:02:49] (03PS1) 10Jcrespo: Explicit all mariadb versions for 5.5 vs 10 [puppet] - 10https://gerrit.wikimedia.org/r/253316 [12:05:38] ACKNOWLEDGEMENT - cassandra-c CQL 10.192.16.164:9042 on restbase2001 is CRITICAL: Connection refused Filippo Giunchedi bootstrapping [12:06:35] 6operations, 10ops-eqiad, 10netops: cr1-eqiad PEM 0 fan failed - https://phabricator.wikimedia.org/T118721#1807472 (10faidon) 3NEW a:3Cmjohnson [12:07:51] (03PS2) 10Jcrespo: Explicit all mariadb versions for 5.5 vs 10 [puppet] - 10https://gerrit.wikimedia.org/r/253316 [12:08:53] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/252916 (owner: 10Muehlenhoff) [12:19:40] 6operations, 7Database: Drop the tables old_growth, hitcounter, click_tracking, click_tracking_user_properties from enwiki, maybe other schemas - https://phabricator.wikimedia.org/T115982#1807504 (10jcrespo) [12:19:41] 6operations, 7Database: Drop phlegal_* databases from m3 - https://phabricator.wikimedia.org/T112573#1807505 (10jcrespo) [12:19:42] 6operations, 7Database: Drop *_old database tables from Wikimedia wikis - https://phabricator.wikimedia.org/T54932#1807506 (10jcrespo) [12:22:52] 6operations, 7Database: Drop the tables old_growth, hitcounter, click_tracking, click_tracking_user_properties from enwiki, maybe other schemas - https://phabricator.wikimedia.org/T115982#1807514 (10jcrespo) [12:22:53] 6operations, 7Database: Drop phlegal_* databases from m3 - https://phabricator.wikimedia.org/T112573#1807515 (10jcrespo) [12:26:37] (03PS4) 10Giuseppe Lavagetto: Move scap-specific items out of mediawiki class [puppet] - 10https://gerrit.wikimedia.org/r/252362 (https://phabricator.wikimedia.org/T116606) (owner: 10Thcipriani) [12:30:08] 6operations, 7Database: Drop *_old database tables from Wikimedia wikis - https://phabricator.wikimedia.org/T54932#1807526 (10jcrespo) 5stalled>3Open [12:35:56] PROBLEM - puppet last run on rdb2004 is CRITICAL: CRITICAL: puppet fail [12:38:03] (03Abandoned) 10Faidon Liambotis: Revert "varnish: misspass limiter" [puppet] - 10https://gerrit.wikimedia.org/r/252385 (https://phabricator.wikimedia.org/T118362) (owner: 10Faidon Liambotis) [12:45:29] (03PS2) 10Faidon Liambotis: Remove subnet for ulsfo-eqiad Giglinx link [dns] - 10https://gerrit.wikimedia.org/r/251955 (https://phabricator.wikimedia.org/T118170) [12:45:40] (03PS5) 10Giuseppe Lavagetto: Move scap-specific items out of mediawiki class [puppet] - 10https://gerrit.wikimedia.org/r/252362 (https://phabricator.wikimedia.org/T116606) (owner: 10Thcipriani) [12:45:44] (03CR) 10Faidon Liambotis: [C: 032] Remove subnet for ulsfo-eqiad Giglinx link [dns] - 10https://gerrit.wikimedia.org/r/251955 (https://phabricator.wikimedia.org/T118170) (owner: 10Faidon Liambotis) [12:57:50] (03CR) 10Giuseppe Lavagetto: "Puppet compiler declares this harmless https://puppet-compiler.wmflabs.org/1292/" [puppet] - 10https://gerrit.wikimedia.org/r/252362 (https://phabricator.wikimedia.org/T116606) (owner: 10Thcipriani) [12:59:11] (03PS1) 10BBlack: lvs100[1-6]: dhcp -> jessie [puppet] - 10https://gerrit.wikimedia.org/r/253319 [13:00:26] <_joe_> bblack: \o/ [13:01:29] bblack: speaking of lvs... the new ones' pybal is spamming router logs again with BGP attempts :) [13:01:33] shall I stop pybal? [13:02:09] yeah [13:02:14] all of them? [13:02:46] RECOVERY - puppet last run on rdb2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:02:52] <_joe_> I guess someone disabled/reenabled puppet globally? [13:03:17] maybe [13:03:22] some had puppet enabled, some don't [13:03:38] I'll fix up the disables so they have notes again [13:06:48] (and catch them up on misc puppet changes) [13:10:00] (03CR) 10Giuseppe Lavagetto: [C: 032] Move scap-specific items out of mediawiki class [puppet] - 10https://gerrit.wikimedia.org/r/252362 (https://phabricator.wikimedia.org/T116606) (owner: 10Thcipriani) [13:12:15] (03PS2) 10BBlack: lvs100[1-6]: dhcp -> jessie [puppet] - 10https://gerrit.wikimedia.org/r/253319 [13:12:26] (03CR) 10BBlack: [C: 032 V: 032] lvs100[1-6]: dhcp -> jessie [puppet] - 10https://gerrit.wikimedia.org/r/253319 (owner: 10BBlack) [13:19:12] !log reinstalling lvs1006 [13:19:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:39:54] !log dropping ipblocks_old table from all wikis [13:39:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:43:35] !log reinstalling lvs100[45] [13:43:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:44:43] (03PS1) 10Hashar: contint: rsync server to hold jobs caches [puppet] - 10https://gerrit.wikimedia.org/r/253322 (https://phabricator.wikimedia.org/T116017) [13:49:13] <_joe_> bblack: I might have a new pybal package coming shortly [13:49:35] ok cool [13:50:48] <_joe_> bblack: to fix a config option and add the /alerts endpoint [13:54:17] PROBLEM - puppet last run on cp3009 is CRITICAL: CRITICAL: puppet fail [14:01:11] 6operations, 7Database: Drop *_old database tables from Wikimedia wikis - https://phabricator.wikimedia.org/T54932#1807607 (10jcrespo) These are the tables that are still on one of the masters and that end in old: ``` mysql -A -BN information_schema -e "SELECT CONCAT(table_schema, '.', table_name) FROM infor... [14:19:10] godog: any chance you could take a look at that patch again and +1 / +2? ;) It's had a slight poke! [14:19:22] RECOVERY - puppet last run on cp3009 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [14:22:52] PROBLEM - YARN NodeManager Node-State on analytics1030 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:25:23] (03PS1) 10Giuseppe Lavagetto: pybal: fix config dict types [debs/pybal] - 10https://gerrit.wikimedia.org/r/253325 [14:26:12] addshore: for sure, I'll merge shortly [14:26:17] awesome :) [14:27:29] 6operations, 10ops-eqiad, 5Patch-For-Review: reclaim plutonium to spares - https://phabricator.wikimedia.org/T118586#1807640 (10akosiaris) [14:27:31] PROBLEM - puppet last run on analytics1030 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:27:43] PROBLEM - RAID on analytics1030 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:28:02] PROBLEM - Check size of conntrack table on analytics1030 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:28:48] ottomata: can I ignore those analytics1030 complaints? [14:29:13] RECOVERY - puppet last run on analytics1030 is OK: OK: Puppet is currently enabled, last run 13 minutes ago with 0 failures [14:29:41] RECOVERY - RAID on analytics1030 is OK: OK: optimal, 13 logical, 14 physical [14:29:52] RECOVERY - Check size of conntrack table on analytics1030 is OK: OK: nf_conntrack is 0 % full [14:30:04] 6operations, 7Graphite, 5Patch-For-Review: icinga strips single quotes from metric names for check_graphite - https://phabricator.wikimedia.org/T118398#1807642 (10fgiunchedi) [14:30:20] 6operations, 7Database: Drop the tables old_growth, hitcounter, click_tracking, click_tracking_user_properties from enwiki, maybe other schemas - https://phabricator.wikimedia.org/T115982#1807646 (10jcrespo) old_growth is only on enwiki. hitcounter is on: {P2309} tables starting with click_tracking: {P2310}... [14:30:42] RECOVERY - YARN NodeManager Node-State on analytics1030 is OK: OK: YARN NodeManager analytics1030.eqiad.wmnet:8041 Node-State: RUNNING [14:30:43] 6operations, 7Database: Drop the tables old_growth, hitcounter, click_tracking, click_tracking_user_properties from enwiki, maybe other schemas - https://phabricator.wikimedia.org/T115982#1807648 (10jcrespo) [14:32:18] how to reset pw for commons-l admin interface? i haven't recieved any mail. [14:32:18] andrewbogott: often, yes. if they are just NRPE socket timeouts that all come at once, usually, you can. that comes from the nodes being really busy and not reporting to icinga fast enough (i think). [14:32:31] *mailinglist [14:32:31] 6operations, 7Database: Drop the tables old_growth, hitcounter, click_tracking, click_tracking_user_properties from enwiki, maybe other schemas - https://phabricator.wikimedia.org/T115982#1737574 (10jcrespo) Asking @LegoKTM, as he was the one archiving the tag. [14:32:52] * Steinsplitter pokes mutante [14:34:39] 6operations, 10vm-requests: VM request for OpenLDAP labs servers - https://phabricator.wikimedia.org/T118726#1807661 (10MoritzMuehlenhoff) 3NEW [14:34:42] nvm. was in spam [14:34:59] (03PS8) 10Filippo Giunchedi: Retain daily.* graphite metrics for longer (25y) [puppet] - 10https://gerrit.wikimedia.org/r/247866 (https://phabricator.wikimedia.org/T117402) (owner: 10Addshore) [14:35:01] PROBLEM - YARN NodeManager Node-State on analytics1030 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:37:10] PROBLEM - puppet last run on analytics1030 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:37:29] PROBLEM - salt-minion processes on lvs1005 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [14:38:00] PROBLEM - RAID on analytics1030 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:38:11] PROBLEM - salt-minion processes on lvs1004 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [14:38:35] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Retain daily.* graphite metrics for longer (25y) [puppet] - 10https://gerrit.wikimedia.org/r/247866 (https://phabricator.wikimedia.org/T117402) (owner: 10Addshore) [14:38:37] (03PS2) 10Giuseppe Lavagetto: pybal: fix config dict types [debs/pybal] - 10https://gerrit.wikimedia.org/r/253325 [14:38:50] RECOVERY - YARN NodeManager Node-State on analytics1030 is OK: OK: YARN NodeManager analytics1030.eqiad.wmnet:8041 Node-State: RUNNING [14:39:00] RECOVERY - puppet last run on analytics1030 is OK: OK: Puppet is currently enabled, last run 23 minutes ago with 0 failures [14:39:50] RECOVERY - RAID on analytics1030 is OK: OK: optimal, 13 logical, 14 physical [14:40:26] !log carbon daemon bounce on graphite1001 [14:40:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:42:08] (03PS6) 10Filippo Giunchedi: graphite: add metric tapping [puppet] - 10https://gerrit.wikimedia.org/r/243906 [14:42:18] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] graphite: add metric tapping [puppet] - 10https://gerrit.wikimedia.org/r/243906 (owner: 10Filippo Giunchedi) [14:43:30] RECOVERY - salt-minion processes on terbium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:48:26] addshore: {{done}}, should be good to go [14:48:42] epic, I'll either start using it in a bit or tommorrow! [14:49:10] RECOVERY - salt-minion processes on lvs1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:49:21] PROBLEM - salt-minion processes on terbium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [14:49:36] <_joe_> something bad is happening with salt minions on newly added hosts [14:49:39] <_joe_> apergos: ^^ [14:49:40] yeah [14:50:00] _joe_: I just signed the terbium salt key, did that mess with you? [14:50:02] I was somehow luckily able to fix just 1/3 of mine by doing the salt key delete -> add cycle on both palladium + neodymium [14:50:12] it's something to do with the dual masters... [14:50:13] <_joe_> andrewbogott: yeah you kinda did [14:50:13] hrm [14:50:16] sorry [14:50:21] addshore: ok! btw you can use either carbon line protocol or statsd, though you don't need the derived metrics from statsd IIRC? [14:50:23] <_joe_> bblack: also on neodymium? [14:50:25] <_joe_> sigh [14:50:36] <_joe_> andrewbogott: ask next time :) [14:50:48] yeh, I dont need the derived stuff, and I just tested the plaintext protocol and it works going straight to graphite.eqiad.wmnet :) [14:50:54] <_joe_> bblack: I'll do the same [14:50:55] so will probably use that! [14:50:56] ok! I thought I remembered that failure from last week so was just trying to clean up [14:51:03] Since everyone always forgets to sign salt keys on new installs [14:52:02] if it's a reinstall then there will be an issue with key rejection, if you don't delete on both masters [14:52:23] <_joe_> apergos: it's not centralized? [14:52:25] <_joe_> sigh [14:52:32] <_joe_> so much for multimaster [14:52:51] addshore: *nod* thanks! [14:52:55] apergos: at some point you said you were going to work on that old salt ticket with the idea to use the puppet certificate store, did that ever happen? [14:53:24] this is a temporary arrangement until neodymium becomes primary and then only master [14:53:25] so, delete on both, but only have to accept on one? [14:53:56] nothing will be broken if you don't accept on neodymium [14:54:24] paravoid: I have a patchset lying around which was waiting on the key handling in salt to be moved out to a driver [14:54:54] someone has recently taken a stab at that; I need to look at their code and see what changes my patch needs [14:55:10] RECOVERY - salt-minion processes on terbium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:55:31] 6operations, 7Database: Drop *_old database tables from Wikimedia wikis - https://phabricator.wikimedia.org/T54932#1807710 (10Reedy) Good work! :) chwikimedia and comcomwiki are both considered "deleted". zh_cnwiki too. zh_cnwiki the old table may still have some value; depending on when it was "deleted" dat... [14:56:04] 6operations, 7Database: Drop database table "hashs" from Wikimedia wikis - https://phabricator.wikimedia.org/T54927#1807712 (10Reedy) [14:58:19] (03CR) 10Alexandros Kosiaris: [C: 04-1] "I have my doubts this will solve the problem, but if anything, it might help. Various inline comments." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/253048 (owner: 10Yuvipanda) [14:58:32] (03PS3) 10Andrew Bogott: adding mobrovac to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/252738 (https://phabricator.wikimedia.org/T118399) (owner: 10RobH) [14:58:50] 6operations, 10Analytics, 10Traffic, 5Patch-For-Review: Replace Analytics XFF/client.ip data with X-Client-IP - https://phabricator.wikimedia.org/T118557#1807718 (10Ottomata) I'd like both @JAllemandou and @leila to review this and give the go ahead to remove `ip` and `x_forwarded_for`. If they say cool,... [14:59:41] (03CR) 10Andrew Bogott: [C: 032] adding mobrovac to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/252738 (https://phabricator.wikimedia.org/T118399) (owner: 10RobH) [15:00:17] 10Ops-Access-Requests, 6operations: Give access to stat1002 to mobrovac - https://phabricator.wikimedia.org/T118399#1807723 (10Andrew) 5stalled>3Resolved Marco, you are now in analytics-privatedata-users. [15:05:06] 6operations, 10Analytics, 10CirrusSearch, 6Discovery: Delete logs on stat1002 in /a/mw-log/archive that are more than 90 days old. - https://phabricator.wikimedia.org/T118527#1807738 (10Dzahn) [15:05:22] 6operations, 10Analytics, 10CirrusSearch, 6Discovery, 7audits-data-retention: Delete logs on stat1002 in /a/mw-log/archive that are more than 90 days old. - https://phabricator.wikimedia.org/T118527#1802620 (10Dzahn) [15:05:40] RECOVERY - salt-minion processes on lvs1004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:05:42] apergos: even if I stop the minion on the node, delete on both, start the minion, then accept on palladium, the client still ends up in a stuck state [15:05:49] constantly Nov 16 15:05:36 lvs1004 salt-minion[7699]: [ERROR ] The Salt Master has cached the public key for this node, this salt minion will wait for 10 seconds before attempting to re-authenticate [15:05:59] as it wasn't accepted on palladium yet [15:06:06] s/it/if it/ [15:06:20] damn [15:06:30] (it worked for me on one node, but I can't get the other two to work) [15:07:12] (03PS6) 10BBlack: add X-Client-IP to response headers for vk to consume [puppet] - 10https://gerrit.wikimedia.org/r/252427 (https://phabricator.wikimedia.org/T118557) [15:07:38] (03CR) 10BBlack: [C: 032 V: 032] add X-Client-IP to response headers for vk to consume [puppet] - 10https://gerrit.wikimedia.org/r/252427 (https://phabricator.wikimedia.org/T118557) (owner: 10BBlack) [15:08:13] (03PS3) 10BBlack: Add resp.http.X-Client-IP -> webrequest:client_ip [puppet] - 10https://gerrit.wikimedia.org/r/252928 (https://phabricator.wikimedia.org/T118557) [15:08:20] bblack: which two nodes? [15:08:39] (03CR) 10BBlack: [C: 032 V: 032] Add resp.http.X-Client-IP -> webrequest:client_ip [puppet] - 10https://gerrit.wikimedia.org/r/252928 (https://phabricator.wikimedia.org/T118557) (owner: 10BBlack) [15:09:33] apergos: lvs100[45].wikimedia.org [15:09:53] they're both in that stuck state now, where the minion complains, but palladium has supposedly accepted the key sent by that minion [15:10:04] ok, I'm going to have a look [15:10:08] ok [15:10:25] (03PS1) 10Giuseppe Lavagetto: terbium: start moving back cronjobs [puppet] - 10https://gerrit.wikimedia.org/r/253329 [15:11:06] has terbium been upgraded? [15:11:53] <_joe_> Reedy: yes, what's up? [15:12:00] Nothing particularly [15:12:06] Just noticed you were moving crons back to it [15:12:15] (03PS2) 10Giuseppe Lavagetto: terbium: start moving back cronjobs [puppet] - 10https://gerrit.wikimedia.org/r/253329 [15:12:15] So wasn't sure if that was for good or bad reasons :) [15:12:18] <_joe_> yup :) [15:12:23] <_joe_> good, good [15:13:40] PROBLEM - puppet last run on cp1063 is CRITICAL: CRITICAL: Puppet has 1 failures [15:14:11] PROBLEM - puppet last run on cp3035 is CRITICAL: CRITICAL: Puppet has 1 failures [15:14:20] PROBLEM - puppet last run on cp1051 is CRITICAL: CRITICAL: Puppet has 1 failures [15:15:01] PROBLEM - Varnishkafka log producer on cp3045 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [15:15:02] (03CR) 10Giuseppe Lavagetto: [C: 032] terbium: start moving back cronjobs [puppet] - 10https://gerrit.wikimedia.org/r/253329 (owner: 10Giuseppe Lavagetto) [15:15:20] PROBLEM - puppet last run on cp4007 is CRITICAL: CRITICAL: Puppet has 1 failures [15:15:20] PROBLEM - Varnishkafka log producer on cp4007 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [15:15:21] PROBLEM - Varnishkafka log producer on cp3035 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [15:15:40] PROBLEM - Varnishkafka log producer on cp1072 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [15:15:40] PROBLEM - puppet last run on cp2021 is CRITICAL: CRITICAL: Puppet has 1 failures [15:15:40] PROBLEM - puppet last run on cp3045 is CRITICAL: CRITICAL: Puppet has 1 failures [15:15:50] PROBLEM - Varnishkafka log producer on cp3034 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [15:15:50] PROBLEM - puppet last run on cp3034 is CRITICAL: CRITICAL: Puppet has 1 failures [15:15:51] bleh that must be me [15:16:20] PROBLEM - puppet last run on cp2007 is CRITICAL: CRITICAL: Puppet has 1 failures [15:16:20] PROBLEM - puppet last run on cp2003 is CRITICAL: CRITICAL: Puppet has 1 failures [15:16:30] PROBLEM - Varnishkafka log producer on cp3044 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [15:16:39] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 1 failures [15:16:49] PROBLEM - Varnishkafka log producer on cp1049 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [15:16:51] PROBLEM - puppet last run on cp1072 is CRITICAL: CRITICAL: Puppet has 1 failures [15:17:13] PROBLEM - puppet last run on cp4018 is CRITICAL: CRITICAL: Puppet has 1 failures [15:17:19] PROBLEM - puppet last run on cp1069 is CRITICAL: CRITICAL: Puppet has 1 failures [15:17:19] PROBLEM - puppet last run on cp3047 is CRITICAL: CRITICAL: Puppet has 1 failures [15:17:31] PROBLEM - Varnishkafka log producer on cp1069 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [15:17:31] PROBLEM - puppet last run on cp1049 is CRITICAL: CRITICAL: Puppet has 1 failures [15:17:40] PROBLEM - puppet last run on cp2019 is CRITICAL: CRITICAL: Puppet has 1 failures [15:17:41] PROBLEM - puppet last run on cp3030 is CRITICAL: CRITICAL: Puppet has 1 failures [15:17:49] PROBLEM - puppet last run on cp4006 is CRITICAL: CRITICAL: Puppet has 1 failures [15:17:49] PROBLEM - puppet last run on cp4005 is CRITICAL: CRITICAL: Puppet has 1 failures [15:17:49] PROBLEM - puppet last run on cp3044 is CRITICAL: CRITICAL: Puppet has 1 failures [15:18:01] PROBLEM - puppet last run on cp3018 is CRITICAL: CRITICAL: Puppet has 1 failures [15:18:01] PROBLEM - Varnishkafka log producer on cp4005 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [15:18:11] PROBLEM - puppet last run on cp1047 is CRITICAL: CRITICAL: Puppet has 1 failures [15:18:20] PROBLEM - puppet last run on cp2026 is CRITICAL: CRITICAL: Puppet has 1 failures [15:18:21] PROBLEM - Varnishkafka log producer on cp4006 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [15:18:30] PROBLEM - Varnishkafka log producer on cp3047 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [15:18:49] PROBLEM - puppet last run on cp3016 is CRITICAL: CRITICAL: Puppet has 1 failures [15:18:50] PROBLEM - puppet last run on cp2023 is CRITICAL: CRITICAL: Puppet has 1 failures [15:19:10] PROBLEM - puppet last run on cp3005 is CRITICAL: CRITICAL: Puppet has 1 failures [15:19:10] PROBLEM - puppet last run on cp3040 is CRITICAL: CRITICAL: Puppet has 1 failures [15:19:21] I think varnishkafka has some artificial limit on format string length :/ [15:19:22] PROBLEM - puppet last run on cp2011 is CRITICAL: CRITICAL: Puppet has 1 failures [15:19:30] PROBLEM - Varnishkafka log producer on cp2017 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [15:19:31] PROBLEM - puppet last run on cp2002 is CRITICAL: CRITICAL: Puppet has 1 failures [15:19:40] PROBLEM - Varnishkafka log producer on cp3032 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [15:19:40] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 1 failures [15:20:00] PROBLEM - puppet last run on cp3037 is CRITICAL: CRITICAL: Puppet has 1 failures [15:20:00] PROBLEM - puppet last run on cp3010 is CRITICAL: CRITICAL: Puppet has 1 failures [15:20:09] PROBLEM - puppet last run on cp1061 is CRITICAL: CRITICAL: Puppet has 1 failures [15:20:22] uhhh [15:20:30] PROBLEM - Varnishkafka log producer on cp2011 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [15:20:39] PROBLEM - puppet last run on cp4014 is CRITICAL: CRITICAL: Puppet has 1 failures [15:20:39] PROBLEM - puppet last run on cp3042 is CRITICAL: CRITICAL: Puppet has 1 failures [15:20:40] PROBLEM - puppet last run on cp3003 is CRITICAL: CRITICAL: Puppet has 1 failures [15:20:46] it's my patch to add client_ip to the format [15:20:49] PROBLEM - Varnishkafka log producer on cp2002 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [15:20:50] PROBLEM - puppet last run on cp4010 is CRITICAL: CRITICAL: Puppet has 1 failures [15:21:00] PROBLEM - Varnishkafka log producer on cp1061 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [15:21:00] PROBLEM - puppet last run on cp2010 is CRITICAL: CRITICAL: Puppet has 1 failures [15:21:00] (03PS1) 10BBlack: Revert "Add resp.http.X-Client-IP -> webrequest:client_ip" [puppet] - 10https://gerrit.wikimedia.org/r/253331 [15:21:00] PROBLEM - Varnishkafka log producer on cp4014 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [15:21:00] PROBLEM - Varnishkafka log producer on cp3042 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [15:21:01] PROBLEM - puppet last run on cp4013 is CRITICAL: CRITICAL: Puppet has 1 failures [15:21:01] PROBLEM - puppet last run on cp2016 is CRITICAL: CRITICAL: Puppet has 1 failures [15:21:01] PROBLEM - puppet last run on cp2017 is CRITICAL: CRITICAL: Puppet has 1 failures [15:21:01] PROBLEM - Varnishkafka log producer on cp4013 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [15:21:02] PROBLEM - puppet last run on cp3017 is CRITICAL: CRITICAL: Puppet has 1 failures [15:21:02] :) [15:21:02] PROBLEM - puppet last run on cp3032 is CRITICAL: CRITICAL: Puppet has 1 failures [15:21:07] (03PS2) 10BBlack: Revert "Add resp.http.X-Client-IP -> webrequest:client_ip" [puppet] - 10https://gerrit.wikimedia.org/r/253331 [15:21:17] (03CR) 10BBlack: [C: 032 V: 032] Revert "Add resp.http.X-Client-IP -> webrequest:client_ip" [puppet] - 10https://gerrit.wikimedia.org/r/253331 (owner: 10BBlack) [15:21:19] PROBLEM - puppet last run on cp1046 is CRITICAL: CRITICAL: Puppet has 1 failures [15:21:19] PROBLEM - puppet last run on cp1045 is CRITICAL: CRITICAL: Puppet has 1 failures [15:21:20] PROBLEM - puppet last run on cp1054 is CRITICAL: CRITICAL: Puppet has 1 failures [15:21:21] PROBLEM - puppet last run on cp1068 is CRITICAL: CRITICAL: Puppet has 1 failures [15:21:21] PROBLEM - Varnishkafka log producer on cp1048 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [15:21:21] PROBLEM - Varnishkafka log producer on cp3037 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [15:21:29] PROBLEM - Varnishkafka log producer on cp2014 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [15:21:39] PROBLEM - Varnishkafka log producer on cp3033 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [15:21:40] o.O.... so much problems... [15:21:49] PROBLEM - puppet last run on cp1048 is CRITICAL: CRITICAL: Puppet has 1 failures [15:21:52] not really [15:21:59] PROBLEM - puppet last run on cp3033 is CRITICAL: CRITICAL: Puppet has 1 failures [15:22:01] PROBLEM - puppet last run on cp1066 is CRITICAL: CRITICAL: Puppet has 1 failures [15:22:05] it's just a lot of spam over a problem that doesn't really impact users :P [15:22:21] PROBLEM - puppet last run on cp2014 is CRITICAL: CRITICAL: Puppet has 1 failures [15:22:24] ok, better than a big problem ;) [15:22:30] PROBLEM - Varnishkafka log producer on cp1073 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [15:22:30] PROBLEM - Varnishkafka log producer on cp1099 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [15:22:50] RECOVERY - Varnishkafka log producer on cp3045 is OK: PROCS OK: 1 process with command name varnishkafka [15:22:59] PROBLEM - Varnishkafka log producer on cp2005 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [15:23:01] PROBLEM - puppet last run on cp4017 is CRITICAL: CRITICAL: Puppet has 1 failures [15:23:01] PROBLEM - puppet last run on cp4015 is CRITICAL: CRITICAL: Puppet has 1 failures [15:23:10] PROBLEM - Varnishkafka log producer on cp1071 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [15:23:10] PROBLEM - Varnishkafka log producer on cp1051 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [15:23:10] PROBLEM - puppet last run on cp1058 is CRITICAL: CRITICAL: Puppet has 1 failures [15:23:11] PROBLEM - puppet last run on cp1059 is CRITICAL: CRITICAL: Puppet has 1 failures [15:23:12] ottomata: Nov 16 15:19:01 cp1056 varnishkafka[31133]: FMTPARSE: Failed to parse Main format string: %{fake_tag0@hostname?cp1056.eqiad.wmnet}x %{@sequence [15:23:15] !num?0}n %{%FT%T@dt}t %{Varnish:time_firstbyte@time_firstbyte!num?0.0}x %{@ip}h %{Varnish:handling@cache_status}x %{@http_status}s %{@response_ [15:23:18] size!num?0}b %{@http_method}m %{Host@uri_host}i %{@uri_path}U %{@uri_query}q %{Content-Type@content_type}o %{Referer@referer}i %{X-Forwarded-Fo [15:23:19] PROBLEM - Varnishkafka log producer on cp1057 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [15:23:19] PROBLEM - puppet last run on cp2009 is CRITICAL: CRITICAL: Puppet has 1 failures [15:23:20] PROBLEM - puppet last run on cp1065 is CRITICAL: CRITICAL: Puppet has 1 failures [15:23:21] PROBLEM - Varnishkafka log producer on cp1056 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [15:23:21] r@x_forwarded_for}i %{User-Agent@user_agent}i %{Accept-Language@accept_language}i %{X-Analytics@x_analytics}o %{Range@range}i %{X-Cache@x_cache [15:23:24] }o %{X-Client-IP@cli#012Expecting '}' after "%{X-Client-IP@cli..." [15:23:30] PROBLEM - puppet last run on cp4019 is CRITICAL: CRITICAL: Puppet has 1 failures [15:23:31] PROBLEM - puppet last run on cp3031 is CRITICAL: CRITICAL: Puppet has 1 failures [15:23:39] PROBLEM - puppet last run on cp3006 is CRITICAL: CRITICAL: Puppet has 1 failures [15:23:40] PROBLEM - puppet last run on cp1071 is CRITICAL: CRITICAL: Puppet has 1 failures [15:23:40] PROBLEM - puppet last run on cp1074 is CRITICAL: CRITICAL: Puppet has 1 failures [15:23:40] PROBLEM - puppet last run on cp1099 is CRITICAL: CRITICAL: Puppet has 1 failures [15:23:50] PROBLEM - puppet last run on cp2015 is CRITICAL: CRITICAL: Puppet has 1 failures [15:23:50] I think it probably has a fixed limit on format string length, so it just cuts off its input in the middle of the new X-Client-IP part and then complains about lack of }-termination [15:23:51] RECOVERY - Varnishkafka log producer on cp4005 is OK: PROCS OK: 1 process with command name varnishkafka [15:23:59] PROBLEM - puppet last run on cp1062 is CRITICAL: CRITICAL: Puppet has 1 failures [15:24:00] PROBLEM - puppet last run on cp1056 is CRITICAL: CRITICAL: Puppet has 1 failures [15:24:00] PROBLEM - puppet last run on cp1073 is CRITICAL: CRITICAL: Puppet has 1 failures [15:24:01] PROBLEM - puppet last run on cp1064 is CRITICAL: CRITICAL: Puppet has 1 failures [15:24:01] PROBLEM - puppet last run on cp2022 is CRITICAL: CRITICAL: Puppet has 1 failures [15:24:09] PROBLEM - puppet last run on cp2005 is CRITICAL: CRITICAL: Puppet has 1 failures [15:24:10] RECOVERY - Varnishkafka log producer on cp4006 is OK: PROCS OK: 1 process with command name varnishkafka [15:24:10] PROBLEM - Varnishkafka log producer on cp1062 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [15:24:11] PROBLEM - puppet last run on cp1057 is CRITICAL: CRITICAL: Puppet has 1 failures [15:24:19] PROBLEM - Varnishkafka log producer on cp4015 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [15:24:20] RECOVERY - Varnishkafka log producer on cp3047 is OK: PROCS OK: 1 process with command name varnishkafka [15:24:29] PROBLEM - puppet last run on cp3009 is CRITICAL: CRITICAL: Puppet has 1 failures [15:24:30] PROBLEM - Varnishkafka log producer on cp1043 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [15:24:41] PROBLEM - Varnishkafka log producer on cp1074 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [15:24:41] PROBLEM - puppet last run on cp3013 is CRITICAL: CRITICAL: Puppet has 1 failures [15:24:59] PROBLEM - puppet last run on cp1052 is CRITICAL: CRITICAL: Puppet has 1 failures [15:25:09] PROBLEM - Varnishkafka log producer on cp1064 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [15:25:11] RECOVERY - Varnishkafka log producer on cp1069 is OK: PROCS OK: 1 process with command name varnishkafka [15:25:19] RECOVERY - Varnishkafka log producer on cp2017 is OK: PROCS OK: 1 process with command name varnishkafka [15:25:20] PROBLEM - puppet last run on cp1043 is CRITICAL: CRITICAL: Puppet has 1 failures [15:25:21] ottomata: yeah, the place it's cutting off and complaining is at the 512-byte mark, so obviously it's some fixed input buffer size ... :P [15:25:29] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: Puppet has 1 failures [15:25:29] PROBLEM - puppet last run on cp3041 is CRITICAL: CRITICAL: Puppet has 1 failures [15:25:32] ! [15:25:35] crazy! [15:25:43] huh, bblack, just revert and we'll try on one and fix? [15:25:46] if we removed ip and xff it would make room :) [15:26:00] PROBLEM - puppet last run on cp1053 is CRITICAL: CRITICAL: Puppet has 1 failures [15:26:00] ottomata: I already reverted, it takes time to deploy the revert [15:26:04] ok cool [15:26:09] RECOVERY - Varnishkafka log producer on cp3044 is OK: PROCS OK: 1 process with command name varnishkafka [15:26:10] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: Puppet has 1 failures [15:26:20] PROBLEM - Varnishkafka log producer on cp2008 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [15:26:20] yeah, we can't just remove those, all analytics jobs will fail since they expect those fields to be there right now [15:26:29] PROBLEM - puppet last run on cp3039 is CRITICAL: CRITICAL: Puppet has 1 failures [15:26:29] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Puppet has 1 failures [15:26:30] RECOVERY - Varnishkafka log producer on cp2002 is OK: PROCS OK: 1 process with command name varnishkafka [15:26:31] PROBLEM - puppet last run on cp2001 is CRITICAL: CRITICAL: Puppet has 1 failures [15:26:31] PROBLEM - puppet last run on cp2004 is CRITICAL: CRITICAL: Puppet has 1 failures [15:26:39] PROBLEM - puppet last run on cp3012 is CRITICAL: CRITICAL: Puppet has 1 failures [15:26:40] we can patch vk to not has a 512 limit too [15:26:50] PROBLEM - Varnishkafka log producer on cp3046 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [15:26:53] yeah [15:27:10] RECOVERY - Varnishkafka log producer on cp3037 is OK: PROCS OK: 1 process with command name varnishkafka [15:27:15] 6operations, 7Database: Drop the tables old_growth, hitcounter, click_tracking, click_tracking_user_properties from enwiki, maybe other schemas - https://phabricator.wikimedia.org/T115982#1807838 (10Reedy) >>! In T115982#1807646, @jcrespo wrote: > Is click_tracking_events needed? > Who can confirm this changes... [15:27:19] PROBLEM - puppet last run on cp2008 is CRITICAL: CRITICAL: Puppet has 1 failures [15:27:20] RECOVERY - Varnishkafka log producer on cp3032 is OK: PROCS OK: 1 process with command name varnishkafka [15:27:21] RECOVERY - Varnishkafka log producer on cp3033 is OK: PROCS OK: 1 process with command name varnishkafka [15:27:24] bblack: well this is poor. partly my bad assumption: what multimaster is supposed to do is try to connect to all masters, whine about the ones it can't, and continue on. I've seen this in operation, indeed there's no [15:27:30] ottomata: https://github.com/wikimedia/varnishkafka/blob/master/config.c#L266 [15:27:38] reason for it to behave differently, otherwise multimaster is pointless [15:27:41] PROBLEM - puppet last run on cp3046 is CRITICAL: CRITICAL: Puppet has 1 failures [15:27:41] PROBLEM - puppet last run on cp4011 is CRITICAL: CRITICAL: Puppet has 1 failures [15:27:49] PROBLEM - Varnishkafka log producer on cp3039 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [15:28:04] but what it's doing at startup is refusing to process requests from one master if the key hasn't been accepted on the others [15:28:08] really bogus [15:28:08] aye wow yeah, just make that bigger 2048? 4096? its a static once allocated thing then, so might as well make it big [15:28:11] RECOVERY - Varnishkafka log producer on cp2011 is OK: PROCS OK: 1 process with command name varnishkafka [15:28:19] I"m going to see if that's a known bug or what [15:28:20] RECOVERY - Varnishkafka log producer on cp1049 is OK: PROCS OK: 1 process with command name varnishkafka [15:28:28] ottomata: yeah just bumping it to 4K is a reasonable low-risk workaround for now [15:28:31] RECOVERY - Varnishkafka log producer on cp1074 is OK: PROCS OK: 1 process with command name varnishkafka [15:28:39] PROBLEM - puppet last run on cp1055 is CRITICAL: CRITICAL: Puppet has 1 failures [15:28:40] RECOVERY - Varnishkafka log producer on cp1061 is OK: PROCS OK: 1 process with command name varnishkafka [15:28:40] RECOVERY - Varnishkafka log producer on cp2005 is OK: PROCS OK: 1 process with command name varnishkafka [15:28:54] RECOVERY - Varnishkafka log producer on cp3042 is OK: PROCS OK: 1 process with command name varnishkafka [15:28:54] RECOVERY - Varnishkafka log producer on cp4013 is OK: PROCS OK: 1 process with command name varnishkafka [15:28:54] RECOVERY - Varnishkafka log producer on cp4007 is OK: PROCS OK: 1 process with command name varnishkafka [15:29:00] RECOVERY - Varnishkafka log producer on cp1071 is OK: PROCS OK: 1 process with command name varnishkafka [15:29:00] RECOVERY - Varnishkafka log producer on cp1051 is OK: PROCS OK: 1 process with command name varnishkafka [15:29:00] RECOVERY - Varnishkafka log producer on cp1064 is OK: PROCS OK: 1 process with command name varnishkafka [15:29:01] RECOVERY - Varnishkafka log producer on cp1057 is OK: PROCS OK: 1 process with command name varnishkafka [15:29:03] looks good :) [15:29:09] RECOVERY - Varnishkafka log producer on cp1048 is OK: PROCS OK: 1 process with command name varnishkafka [15:29:10] RECOVERY - Varnishkafka log producer on cp1056 is OK: PROCS OK: 1 process with command name varnishkafka [15:30:10] PROBLEM - puppet last run on cp4009 is CRITICAL: CRITICAL: Puppet has 1 failures [15:30:47] (03PS6) 10BBlack: wgHTCPRouting: use separate address for upload [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249121 (https://phabricator.wikimedia.org/T116752) [15:30:56] (03CR) 10BBlack: [C: 031] wgHTCPRouting: use separate address for upload [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249121 (https://phabricator.wikimedia.org/T116752) (owner: 10BBlack) [15:31:50] RECOVERY - Varnishkafka log producer on cp3039 is OK: PROCS OK: 1 process with command name varnishkafka [15:32:19] RECOVERY - Varnishkafka log producer on cp2008 is OK: PROCS OK: 1 process with command name varnishkafka [15:32:21] PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: Puppet has 1 failures [15:32:29] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: Puppet has 1 failures [15:32:50] RECOVERY - Varnishkafka log producer on cp4014 is OK: PROCS OK: 1 process with command name varnishkafka [15:32:50] RECOVERY - Varnishkafka log producer on cp3035 is OK: PROCS OK: 1 process with command name varnishkafka [15:33:00] RECOVERY - Varnishkafka log producer on cp1072 is OK: PROCS OK: 1 process with command name varnishkafka [15:33:11] PROBLEM - puppet last run on cp3036 is CRITICAL: CRITICAL: Puppet has 1 failures [15:34:42] (03PS1) 10BBlack: analytics VCL: fix missing semicolon [puppet] - 10https://gerrit.wikimedia.org/r/253335 [15:34:56] (03CR) 10BBlack: [C: 032 V: 032] analytics VCL: fix missing semicolon [puppet] - 10https://gerrit.wikimedia.org/r/253335 (owner: 10BBlack) [15:35:03] 6operations, 10hardware-requests, 7Performance: Refresh Parser cache servers pc1001-pc1003 - https://phabricator.wikimedia.org/T111777#1807878 (10RobH) Please note that all requested quotes have been obtained off the sub-tasks and are now in review. [15:35:20] PROBLEM - puppet last run on cp1070 is CRITICAL: CRITICAL: Puppet has 1 failures [15:35:29] PROBLEM - puppet last run on cp1060 is CRITICAL: CRITICAL: Puppet has 1 failures [15:35:59] RECOVERY - Varnishkafka log producer on cp1062 is OK: PROCS OK: 1 process with command name varnishkafka [15:36:31] PROBLEM - puppet last run on cp3049 is CRITICAL: CRITICAL: Puppet has 1 failures [15:36:40] PROBLEM - puppet last run on cp2024 is CRITICAL: CRITICAL: Puppet has 1 failures [15:36:40] RECOVERY - puppet last run on cp1069 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [15:36:40] PROBLEM - puppet last run on cp1050 is CRITICAL: CRITICAL: Puppet has 1 failures [15:36:49] RECOVERY - puppet last run on cp3047 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [15:37:09] RECOVERY - puppet last run on cp2008 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [15:37:09] RECOVERY - puppet last run on cp2019 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [15:37:09] RECOVERY - puppet last run on cp3045 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [15:37:11] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [15:37:19] RECOVERY - puppet last run on cp4006 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [15:37:19] RECOVERY - puppet last run on cp3030 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [15:37:19] PROBLEM - puppet last run on cp3015 is CRITICAL: CRITICAL: Puppet has 1 failures [15:37:19] RECOVERY - puppet last run on cp4005 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [15:37:31] RECOVERY - puppet last run on cp4011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:37:39] PROBLEM - puppet last run on cp4020 is CRITICAL: CRITICAL: Puppet has 1 failures [15:37:39] PROBLEM - puppet last run on cp3043 is CRITICAL: CRITICAL: Puppet has 1 failures [15:37:40] RECOVERY - puppet last run on cp1047 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:38:00] RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:38:10] PROBLEM - puppet last run on cp1044 is CRITICAL: CRITICAL: Puppet has 1 failures [15:38:20] RECOVERY - puppet last run on cp2023 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:38:20] PROBLEM - puppet last run on cp1067 is CRITICAL: CRITICAL: Puppet has 1 failures [15:38:29] PROBLEM - puppet last run on cp4012 is CRITICAL: CRITICAL: Puppet has 1 failures [15:38:40] RECOVERY - puppet last run on cp1050 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [15:38:40] RECOVERY - puppet last run on cp2017 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [15:38:40] RECOVERY - puppet last run on cp4018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:38:40] RECOVERY - puppet last run on cp3005 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [15:38:49] RECOVERY - puppet last run on cp1046 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [15:39:01] RECOVERY - puppet last run on cp2002 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [15:39:11] RECOVERY - puppet last run on cp3041 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:39:11] RECOVERY - puppet last run on cp3044 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:39:19] RECOVERY - puppet last run on cp1074 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [15:39:30] RECOVERY - puppet last run on cp3033 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:39:30] RECOVERY - puppet last run on cp3037 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [15:39:30] RECOVERY - puppet last run on cp3018 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [15:39:31] RECOVERY - puppet last run on cp3043 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:39:39] RECOVERY - puppet last run on cp1061 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:39:40] RECOVERY - puppet last run on cp2007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:39:41] RECOVERY - puppet last run on cp2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:40:10] RECOVERY - puppet last run on cp3003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:40:10] RECOVERY - puppet last run on cp3016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:40:21] RECOVERY - puppet last run on cp4010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:40:30] RECOVERY - puppet last run on cp4007 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [15:40:40] RECOVERY - puppet last run on cp1063 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:41:09] RECOVERY - puppet last run on cp1048 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [15:41:22] RECOVERY - puppet last run on cp1051 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:41:22] RECOVERY - puppet last run on cp3035 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [15:42:00] RECOVERY - puppet last run on cp1072 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [15:42:00] RECOVERY - puppet last run on cp3042 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [15:42:30] RECOVERY - puppet last run on cp4013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:42:30] RECOVERY - puppet last run on cp3032 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:42:30] RECOVERY - puppet last run on cp3017 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [15:42:39] RECOVERY - puppet last run on cp1045 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:42:40] RECOVERY - puppet last run on cp1054 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [15:42:40] RECOVERY - puppet last run on cp1068 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:42:41] RECOVERY - puppet last run on cp2009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:42:41] RECOVERY - puppet last run on cp2011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:42:43] 7Puppet, 10Deployment-Systems, 5Patch-For-Review, 3Scap3: Refactor `mediawiki::scap` to make sure Scap dependencies are not dependent on mediawiki - https://phabricator.wikimedia.org/T116606#1807931 (10thcipriani) 5Open>3Resolved [15:42:49] RECOVERY - puppet last run on cp1049 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:42:50] RECOVERY - puppet last run on cp2021 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [15:43:00] RECOVERY - Varnishkafka log producer on cp3034 is OK: PROCS OK: 1 process with command name varnishkafka [15:43:00] RECOVERY - puppet last run on cp3034 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:43:29] RECOVERY - puppet last run on cp1056 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [15:43:40] RECOVERY - puppet last run on cp1057 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [15:43:59] RECOVERY - puppet last run on cp3009 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [15:44:23] 6operations, 7Database: Drop the tables old_growth, hitcounter, click_tracking, click_tracking_user_properties from enwiki, maybe other schemas - https://phabricator.wikimedia.org/T115982#1807937 (10jcrespo) As a comment, I always perform a backup of all tables before deletion. I do not have, however, a perman... [15:44:29] RECOVERY - puppet last run on cp1059 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [15:45:10] RECOVERY - puppet last run on cp2015 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [15:45:18] (03CR) 10Hashar: [C: 031] wgHTCPRouting: use separate address for upload [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249121 (https://phabricator.wikimedia.org/T116752) (owner: 10BBlack) [15:45:19] RECOVERY - puppet last run on cp3010 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [15:45:30] RECOVERY - puppet last run on cp2026 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [15:46:11] RECOVERY - puppet last run on cp2016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:46:19] RECOVERY - puppet last run on cp3040 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:46:50] RECOVERY - puppet last run on cp3031 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:46:50] RECOVERY - puppet last run on cp1071 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:47:10] RECOVERY - puppet last run on cp1066 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [15:47:19] RECOVERY - puppet last run on cp1064 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:47:21] RECOVERY - puppet last run on cp2005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:47:30] RECOVERY - puppet last run on cp2014 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [15:47:41] RECOVERY - puppet last run on cp4014 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [15:47:59] RECOVERY - puppet last run on cp2001 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [15:48:00] RECOVERY - puppet last run on cp3013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:48:09] RECOVERY - puppet last run on cp2010 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [15:48:10] RECOVERY - puppet last run on cp1052 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [15:48:11] RECOVERY - puppet last run on cp4015 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [15:48:20] RECOVERY - puppet last run on cp1058 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:48:31] RECOVERY - Varnishkafka log producer on cp2014 is OK: PROCS OK: 1 process with command name varnishkafka [15:48:41] RECOVERY - puppet last run on cp4019 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:48:41] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [15:48:49] RECOVERY - puppet last run on cp3006 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [15:48:49] RECOVERY - puppet last run on cp1099 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [15:49:10] RECOVERY - puppet last run on cp1073 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:49:19] RECOVERY - puppet last run on cp2022 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [15:49:30] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:49:30] RECOVERY - Varnishkafka log producer on cp4015 is OK: PROCS OK: 1 process with command name varnishkafka [15:49:30] RECOVERY - Varnishkafka log producer on cp1073 is OK: PROCS OK: 1 process with command name varnishkafka [15:49:31] RECOVERY - Varnishkafka log producer on cp1099 is OK: PROCS OK: 1 process with command name varnishkafka [15:49:41] RECOVERY - puppet last run on cp3039 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:49:41] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:49:59] RECOVERY - puppet last run on cp3012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:50:10] RECOVERY - puppet last run on cp4017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:50:20] RECOVERY - puppet last run on cp1065 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [15:50:30] RECOVERY - puppet last run on cp1043 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [15:50:41] RECOVERY - puppet last run on cp1070 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [15:50:49] RECOVERY - puppet last run on cp1060 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [15:50:58] 6operations, 10netops: Figure out the source of QSFP+ errors with DAC + MX480 - https://phabricator.wikimedia.org/T118259#1807968 (10faidon) 5Open>3Resolved Juniper confirmed that DACs are not compatible with the MX series. Nothing to do here but replace DACs with optics. [15:50:59] 6operations, 10ops-esams, 10netops: Set up cr2-esams - https://phabricator.wikimedia.org/T118256#1807970 (10faidon) [15:51:00] RECOVERY - puppet last run on cp1062 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:51:09] RECOVERY - puppet last run on cp1053 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:51:30] RECOVERY - Varnishkafka log producer on cp1043 is OK: PROCS OK: 1 process with command name varnishkafka [15:51:40] RECOVERY - puppet last run on cp1044 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:51:41] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:51:50] RECOVERY - puppet last run on cp1055 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:51:50] RECOVERY - puppet last run on cp1067 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:51:50] RECOVERY - puppet last run on cp2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:52:01] RECOVERY - Varnishkafka log producer on cp3046 is OK: PROCS OK: 1 process with command name varnishkafka [15:52:39] RECOVERY - puppet last run on cp3015 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [15:52:54] 10Ops-Access-Requests, 6operations, 6WMDE-Analytics-Engineering, 10Wikidata: Requesting access to dataset-admins for Addshore - https://phabricator.wikimedia.org/T118739#1807973 (10Addshore) 3NEW [15:52:59] RECOVERY - puppet last run on cp3046 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [15:52:59] RECOVERY - puppet last run on cp4020 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [15:53:30] RECOVERY - puppet last run on cp4009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:53:39] RECOVERY - puppet last run on cp4016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:53:50] RECOVERY - puppet last run on cp4012 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [15:53:50] RECOVERY - puppet last run on cp3049 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:54:00] 10Ops-Access-Requests, 6operations, 6WMDE-Analytics-Engineering, 10Wikidata: Requesting access to dataset-admins for Addshore - https://phabricator.wikimedia.org/T118739#1807982 (10Addshore) [15:54:29] RECOVERY - puppet last run on cp3036 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:56:07] (03PS3) 10Thcipriani: RESTBase configuration for scap3 deployment [puppet] - 10https://gerrit.wikimedia.org/r/252887 [16:00:04] anomie ostriches thcipriani marktraceur Krenair: Respected human, time to deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151116T1600). Please do the needful. [16:01:15] (03PS1) 10Giuseppe Lavagetto: terbium: move back all cronjobs [puppet] - 10https://gerrit.wikimedia.org/r/253340 [16:02:33] 10Ops-Access-Requests, 6operations, 6WMDE-Analytics-Engineering, 10Wikidata: Requesting access to dataset-admins for Addshore - https://phabricator.wikimedia.org/T118739#1808025 (10Addshore) [16:02:36] I can SWAT this morning: bblack bmansurov ping for SWAT. [16:03:24] I'm here [16:03:27] me too [16:03:34] (03PS2) 10Giuseppe Lavagetto: terbium: move back all cronjobs [puppet] - 10https://gerrit.wikimedia.org/r/253340 [16:03:40] RECOVERY - puppet last run on cp2024 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:04:05] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] terbium: move back all cronjobs [puppet] - 10https://gerrit.wikimedia.org/r/253340 (owner: 10Giuseppe Lavagetto) [16:04:08] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253038 (https://phabricator.wikimedia.org/T118525) (owner: 10Jhobs) [16:04:55] (03Merged) 10jenkins-bot: Disable first QuickSurveys survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253038 (https://phabricator.wikimedia.org/T118525) (owner: 10Jhobs) [16:05:51] 6operations, 10hardware-requests, 7Performance: Refresh Parser cache servers pc1001-pc1003 - https://phabricator.wikimedia.org/T111777#1808035 (10jcrespo) [16:06:33] <_joe_> thcipriani: I have a previously scheduled meeting at the time of the deployment meeting today :/ [16:07:03] <_joe_> thcipriani: if I could choose, I'd come to your meeting, FWIW [16:07:07] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Disable first QuickSurveys survey [[gerrit:253038]] (duration: 00m 26s) [16:07:11] ^ bmansurov check please [16:07:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:07:26] thcipriani: thanks, checking [16:07:36] 6operations, 10Beta-Cluster-Infrastructure, 7Blocked-on-RelEng, 7HHVM, 5Patch-For-Review: Convert work machines (tin, terbium) to Trusty and hhvm usage - https://phabricator.wikimedia.org/T87036#1808045 (10Joe) Terbium is now done; I'll look at reimaging tin next. [16:08:15] _joe_: np, we've got plenty on our agenda today before talking about server pooling, it'll be good for us to prep as a group. Can you make it next week? [16:10:24] thcipriani: checked, works as expected. thanks again! [16:10:32] bmansurov: thank you for checking! [16:11:10] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249121 (https://phabricator.wikimedia.org/T116752) (owner: 10BBlack) [16:11:31] (03Merged) 10jenkins-bot: wgHTCPRouting: use separate address for upload [mediawiki-config] - 10https://gerrit.wikimedia.org/r/249121 (https://phabricator.wikimedia.org/T116752) (owner: 10BBlack) [16:12:00] 6operations, 10Salt: Move salt master to separate host from puppet master - https://phabricator.wikimedia.org/T115287#1808069 (10ArielGlenn) Where we are: First off, Brandon added neodymium to the salt master exception in the router configs so that neodymium can communicate with all hosts, e.g. analytics*. S... [16:13:34] bblack: ok, ready to sync your change. Any puppet changes need to happen before I pull the trigger, or is that all done? [16:13:50] thcipriani: it's all done, and I have sniffers running waiting to verify the change works :) [16:13:55] kk [16:14:28] !log thcipriani@tin Synchronized wmf-config/squid.php: SWAT: wgHTCPRouting: use separate address for upload [[gerrit:249121]] (duration: 00m 26s) [16:14:31] ^ bblack check please [16:14:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:15:49] 6operations, 7Database: Drop the tables old_growth, hitcounter, click_tracking, click_tracking_user_properties from enwiki, maybe other schemas - https://phabricator.wikimedia.org/T115982#1808085 (10Reedy) >>! In T115982#1807937, @jcrespo wrote: > As a comment, I always perform a backup of all tables before de... [16:16:12] (03PS1) 10Alexandros Kosiaris: salt: Move the role manifests into role module [puppet] - 10https://gerrit.wikimedia.org/r/253342 [16:16:55] thcipriani: I don't think anything's broken by the change, but I'm not seeing all the traffic switch IPs. It looks like only a small fraction has so far. [16:17:25] it may be that that configuration data is cached in some way that it doesn't take effect until a code restart or something like that [16:17:51] in any case, the receiving side is listening on both the old and new IPs, so it's not a huge deal if it takes a while to take effect [16:18:54] bblack: hmm, I thought scap had some configuration file cache-busting. [16:19:30] oh, but probably not for that particular file, right. I think it's just for InitialiseSettings.php [16:19:53] (03CR) 10Faidon Liambotis: "Is this obsolete in favor of https://gerrit.wikimedia.org/r/#/c/252681/ ?" [puppet] - 10https://gerrit.wikimedia.org/r/243661 (https://phabricator.wikimedia.org/T114638) (owner: 10Giuseppe Lavagetto) [16:20:09] thcipriani: oh actually, I wasn't thinking correctly as I looked at the traffic I was sampling [16:20:17] the problem may all be in my head, let me confirm a few more things :) [16:20:49] (03CR) 10Faidon Liambotis: [C: 032] interface: some lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/249344 (owner: 10Dzahn) [16:22:27] thcipriani: ok everything's fine and confirmed, the problem was all in my head [16:22:41] bblack: awesome. Thanks for checking! [16:25:33] (03PS4) 10Thcipriani: RESTBase configuration for scap3 deployment [puppet] - 10https://gerrit.wikimedia.org/r/252887 [16:29:16] 6operations, 10Phabricator-Bot-Requests, 10procurement, 5Patch-For-Review: update emailbot to allow cc: for #procurement - https://phabricator.wikimedia.org/T117113#1808121 (10RobH) 5Resolved>3Open It seems that some items still bounce from both the vendors and myself. I attempted to forward into anot... [16:29:30] 6operations, 10Phabricator-Bot-Requests, 10procurement, 5Patch-For-Review: update emailbot to allow cc: for #procurement - https://phabricator.wikimedia.org/T117113#1808126 (10RobH) Chase: I've reopened with the above bounce info. [16:34:54] 6operations, 6Phabricator: migrate RT maint-announce into phabricator - https://phabricator.wikimedia.org/T118176#1808129 (10RobH) [16:36:38] 6operations, 10ops-eqiad, 10netops: cr1-eqiad PEM 0 fan failed - https://phabricator.wikimedia.org/T118721#1808133 (10Cmjohnson) p:5Normal>3High [16:37:10] (03PS3) 10BBlack: upload purging: do not listen on text/mobile addr [puppet] - 10https://gerrit.wikimedia.org/r/249128 (https://phabricator.wikimedia.org/T116752) [16:37:24] (03CR) 10BBlack: [C: 032 V: 032] upload purging: do not listen on text/mobile addr [puppet] - 10https://gerrit.wikimedia.org/r/249128 (https://phabricator.wikimedia.org/T116752) (owner: 10BBlack) [16:37:28] Hello, is there a sysadmin, who can help? [16:37:35] what's up? [16:37:44] I got a problem with restoring a page at dewiki. Restroing of other pages work [16:37:54] What's the problem/error? [16:38:00] You also probably don't want ops for this [16:38:14] [8cacb543] 2015-11-16 16:23:25: Fatal exception of type MWException [16:38:20] the word "sysadmin" has so many meanings :) [16:38:28] 6operations, 7Database: Drop PovWatch extension-related database tables from Wikimedia wikis - https://phabricator.wikimedia.org/T54924#1808135 (10Reedy) [16:38:29] Luke081515: File a ticket, please [16:38:32] ok [16:38:41] Yeah, and we'll attach a proper stack trace [16:38:44] * Reedy looks [16:39:00] 6operations, 10ops-eqiad, 10netops: cr1-eqiad PEM 0 fan failed - https://phabricator.wikimedia.org/T118721#1808140 (10RobH) Just FYI this is now covered under support contract C1-4852090500. (I'm still working on T116051 which will include updating racktables info for all networking devices.) [16:39:58] 6operations, 7Database: Drop PovWatch extension-related database tables from Wikimedia wikis - https://phabricator.wikimedia.org/T54924#551102 (10Reedy) This is good to go. No data on enwiki or commonswiki, and just minimal on testwiki [16:40:43] hoo: contentmodel related [16:41:13] Luke081515: let me know when you've raised a ticket, I've got the stack trace [16:41:34] ok [16:42:43] 10Ops-Access-Requests, 6operations: Requesting access to rest base and cassandra nodes - https://phabricator.wikimedia.org/T117473#1808155 (10RobH) Additionally, we will likely need to add in a non admin rule, rather than grant you full admin when all you need is to read data, correct? As such, you would need... [16:44:08] (03PS3) 10BBlack: purging: do not VCL-filter on domain regex [puppet] - 10https://gerrit.wikimedia.org/r/249129 (https://phabricator.wikimedia.org/T116752) [16:44:19] gwicke: sorry if the above task https://phabricator.wikimedia.org/T117473 won't have anything to do with you, but I wanted to check with you since you are an admin on the service. nuria is requesting access to the restbase/cassandra nodes for pageview data [16:44:29] and I'm trying to find out if that data is stored and accessible from another grouop [16:44:37] or if the admins role is the only current group with access [16:44:54] 6operations: Goal: Strengthen Incident monitoring infrastructure - https://phabricator.wikimedia.org/T118746#1808163 (10akosiaris) 3NEW a:3akosiaris [16:45:01] mobrovac: you may also know =] ^ [16:45:27] 6operations, 7Monitoring: Better abstractions for puppet & icinga/nagios/shinken - https://phabricator.wikimedia.org/T85624#1808173 (10akosiaris) [16:45:28] 6operations: Goal: Strengthen Incident monitoring infrastructure - https://phabricator.wikimedia.org/T118746#1808172 (10akosiaris) [16:45:36] I want to get them access to said data without giving them permissions to accidentally break things =] [16:48:31] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [5000000.0] [16:48:37] robh: yeah, there doesn't seem to be a "normal" group there [16:49:01] (03CR) 10BBlack: [C: 032 V: 032] purging: do not VCL-filter on domain regex [puppet] - 10https://gerrit.wikimedia.org/r/249129 (https://phabricator.wikimedia.org/T116752) (owner: 10BBlack) [16:49:04] (03PS1) 10Muehlenhoff: WIP: labs openldap role [puppet] - 10https://gerrit.wikimedia.org/r/253347 [16:49:08] robh: restbase logs are sent to logstash, so read access would basically mean being able to read /var/lib/cassandra [16:49:17] wrt logs [16:49:53] 6operations, 10Traffic: update the multicast purging documentation - https://phabricator.wikimedia.org/T82096#1808189 (10BBlack) [16:49:53] (03CR) 10jenkins-bot: [V: 04-1] WIP: labs openldap role [puppet] - 10https://gerrit.wikimedia.org/r/253347 (owner: 10Muehlenhoff) [16:50:12] 6operations: Undeletion of "Add-on" at dewiki fails - https://phabricator.wikimedia.org/T118747#1808193 (10Luke081515) 3NEW [16:50:16] mobrovac: but the logstash data doesn't contain the pageview data? [16:50:23] Reedy: Done [16:50:27] (03PS1) 10Muehlenhoff: Some further finetuning to server groups [puppet] - 10https://gerrit.wikimedia.org/r/253348 [16:50:52] 6operations: Undeletion of "Add-on" at dewiki fails - https://phabricator.wikimedia.org/T118747#1808200 (10Luke081515) p:5Triage>3Unbreak! [16:50:56] mobrovac: if not, it seems ideal to start pushing that somewhere or setup a new group to read them. is that the /var/lib/cassandra directory info? [16:51:01] 6operations: Undeletion of "Add-on" at dewiki fails - https://phabricator.wikimedia.org/T118747#1808201 (10Reedy) [16:51:07] robh: the data itself is available in cassandra, if a user can connect to the nodes, then they can start a cql shell as well and look at the pageviews data [16:51:31] robh: sorry, i meant /var/log/cassandra/ for cass logs [16:52:01] robh: but the data itself, they just need to be able to log onto a node and start cqlsh [16:52:07] so they have to connect to the cassandra nodes and be able to run the cql shell OR connect to the cassandra nodes and read the info in /var/log/cassandra [16:52:08] oh [16:52:27] so 'ALL = (cassandra) NOPASSWD: ALL', just has to be limited down for this usecase [16:52:39] i think thats more reasonable than letting non admins have full admin access, right? [16:52:40] 6operations, 7Database: Drop database table "email_capture" from Wikimedia wikis - https://phabricator.wikimedia.org/T57676#1808205 (10Reedy) Seems likely to exist anywhere that AFT would've existed... Minimal rows on testwiki. 333052 rows on enwiki. Looks to essentially be a lot (including contents) of email... [16:53:04] isnt running that shell as cassandra going to let them do possible breaking things? [16:53:17] or can they run it as themselves? [16:53:30] (sorry for the questions! ;) [16:53:55] You never said they ran it as cassandra, I read that into it, likely incorrectly. [16:54:21] RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 1.00% above the threshold [1000000.0] [16:55:40] I see you have to provide a user and pass though for that shell [16:55:57] so not sure how that is defined/deployed [16:56:28] (i see adding users, but not sure how you guys process that) [16:56:37] would I just create a task and assign to your team? [16:57:32] robh: ideally, two groups are needed: aqs and aqs-admin, where the former only needs to be able connect and read /var/log/cassandra, the latter should stay as-is [16:58:09] (03PS1) 10Zfilipin: RuboCop: fixed Style/TrailingBlankLines offense [puppet] - 10https://gerrit.wikimedia.org/r/253349 (https://phabricator.wikimedia.org/T112651) [16:58:10] cool, that seems right to me as well [16:58:10] (03PS1) 10Zfilipin: RuboCop: fixed Style/Tab offense [puppet] - 10https://gerrit.wikimedia.org/r/253350 (https://phabricator.wikimedia.org/T112651) [16:58:13] (03PS1) 10Zfilipin: RuboCop: fixed Style/StringLiterals offense [puppet] - 10https://gerrit.wikimedia.org/r/253351 (https://phabricator.wikimedia.org/T112651) [16:58:19] so then when i finish that, you guys still have to make a cassandra user? [16:58:32] rephrase: after that is done, what else is needed for them to read the data in the shell? [16:59:14] robh: nothing, they need to know the cassandra user/pass in order to connect to it via a CQL shell [16:59:33] robh: there's no per-login-user cass user [16:59:38] Ok, and they should interface with you guys for that right? [16:59:50] services is the gatekeeper of that user/pass (i mean) [17:00:14] robh: yup, but ops are too [17:01:52] Where exactly is that information stored? [17:01:56] (I don't know it ;) [17:11:03] robh: ops/private iirc [17:11:13] 10Ops-Access-Requests, 6operations: Requesting access to rest base and cassandra nodes - https://phabricator.wikimedia.org/T117473#1808246 (10RobH) Update from IRC: Ideally, two groups are needed: aqs-user and aqs-admin, where the former only needs to be able connect and read /var/log/cassandra, the latter sh... [17:19:55] cool, i'll look for it post meeting, ive updated ticket [17:20:21] mobrovac: thank you! you clarified that nicely, I think I understand the next steps now (make a new aqs-user group) [17:23:22] cool, np robh [17:25:12] 6operations, 5Continuous-Integration-Scaling, 5Patch-For-Review: install/deploy scandium as zuul merger (ci) server - https://phabricator.wikimedia.org/T95046#1808301 (10hashar) Scheduled for Tuesday 17th November 15:00–16:00 UTC # 07:00–08:00 PST 16:00–17:00 UTC+1 [17:28:44] (03PS4) 10Yuvipanda: etherpad: Add an autorestarter [puppet] - 10https://gerrit.wikimedia.org/r/253048 [17:28:53] akosiaris: ^updated [17:30:28] (03CR) 10Ottomata: [C: 031] Uninstall wpasupplicant [puppet] - 10https://gerrit.wikimedia.org/r/252916 (owner: 10Muehlenhoff) [17:31:40] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [5000000.0] [17:31:41] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [5000000.0] [17:33:44] (03CR) 10CSteipp: "With https://gerrit.wikimedia.org/r/#/c/223702/ this is possible, and we should merge it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222057 (https://phabricator.wikimedia.org/T104370) (owner: 10CSteipp) [17:39:34] (03CR) 10Reedy: [C: 031] "I presume this can actually go without the dependant patch?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222057 (https://phabricator.wikimedia.org/T104370) (owner: 10CSteipp) [17:41:19] 6operations: Google Webmaster Tools - 1000 domain limit - https://phabricator.wikimedia.org/T99132#1808385 (10dr0ptp4kt) @aklapper, thanks for reminder. Yes, the feedback was suggestive of there not being a means of modifying the system's behavior. It was suggested people need to concoct their own custom bulk ad... [17:43:10] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 57.14% of data above the critical threshold [5000000.0] [17:44:29] PROBLEM - Hadoop NodeManager on analytics1030 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [17:46:18] (03PS2) 1020after4: scap: Create wrapper script for master-master rsync [puppet] - 10https://gerrit.wikimedia.org/r/253040 (https://phabricator.wikimedia.org/T117016) (owner: 10BryanDavis) [17:46:37] (03PS3) 10Alexandros Kosiaris: Allow same perms to tileratorui as tilerator [puppet] - 10https://gerrit.wikimedia.org/r/249501 (https://phabricator.wikimedia.org/T112914) (owner: 10Yurik) [17:46:50] PROBLEM - etherpad.wikimedia.org HTTP on etherpad1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:47:16] 10Ops-Access-Requests, 6operations, 6WMDE-Analytics-Engineering, 10Wikidata: Requesting access to dataset-admins for Addshore - https://phabricator.wikimedia.org/T118739#1808412 (10ArielGlenn) The logs are on dataset1001, but we should really be copying them off somewhere else like all apache logs. Do you... [17:47:36] ^ etherpad is indeed dead. [17:47:42] lol [17:47:53] :( :( [17:48:11] seems pretty standard [17:48:23] yuvi is trying to make it autorestart [17:48:27] or is, at least, working on it [17:48:40] RECOVERY - etherpad.wikimedia.org HTTP on etherpad1001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 522 bytes in 0.006 second response time [17:48:50] RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 1.00% above the threshold [1000000.0] [17:49:08] <_joe_> Reedy: etherpad is a piece of crap and restarting it would not heal it [17:49:18] lol [17:49:32] (03CR) 10Alexandros Kosiaris: [C: 032] "Approved during meeting" [puppet] - 10https://gerrit.wikimedia.org/r/249501 (https://phabricator.wikimedia.org/T112914) (owner: 10Yurik) [17:49:35] <_joe_> it would still heal some pain points for people [17:50:21] Reedy: yeah I just modified that patch to match akosiaris' comments [17:50:29] should maybe merge that again [17:51:03] I just restarted the crap service [17:51:04] (03PS5) 10Yuvipanda: etherpad: Add an autorestarter [puppet] - 10https://gerrit.wikimedia.org/r/253048 [17:51:18] akosiaris: ^ we should autorestart :D [17:51:37] <_joe_> PlasmaFury: if you want to make it really shitty [17:51:50] <_joe_> make the nrpe check restart the damn thing when it fails [17:51:51] <_joe_> :P [17:51:54] no :P [17:52:00] RECOVERY - Hadoop NodeManager on analytics1030 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [17:52:02] PlasmaFury: yay... it's not going to fix anything aside from us not being pinged about it [17:52:12] Out of sight, out of mind [17:52:16] lol [17:52:20] akosiaris: that's enough fixing for us without us doing stuff upstream no? [17:52:38] good question.... I remember us saying it's a best effort service [17:52:41] RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 1.00% above the threshold [1000000.0] [17:52:45] I am not sure when it stopped being one [17:52:46] 6operations, 5Patch-For-Review: Fix all .erb variable warnings - https://phabricator.wikimedia.org/T97251#1808454 (10Andrew) a:5Andrew>3None [17:52:53] not that I did not anticipate that [17:53:06] akosiaris: yeah my understanding is that we won't actually have time to investigate scaling it 'properly' or if it even scales properly [17:53:12] it used to crash on its own frequently enough [17:53:19] so that we would not have that problem [17:53:30] PROBLEM - YARN NodeManager Node-State on analytics1030 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:53:52] tbh, a cron restarting unconditionally once per day might make more sense [17:54:03] but let's evaluate the monit approach first since it is better [17:54:13] akosiaris: I use it for magnus' wdq [17:54:18] and is quite nice [17:54:20] lol [17:54:25] it leaks memory and then dies [17:54:33] oh I know monit enough, I was not saying that [17:54:45] I am not sure it is easy to detect the failure [17:54:48] that is what I am saying [17:55:06] akosiaris: http failure is at least a high bar - if it's http failing it's indeed dead no [17:55:19] RECOVERY - YARN NodeManager Node-State on analytics1030 is OK: CRITICAL: YARN NodeManager analytics1030.eqiad.wmnet:8041 Node-State [17:55:22] what happens it that is stalls (being a nodejs thing) answering and then suddenly it wakes up [17:55:37] PlasmaFury: yeah, I am hoping [17:55:46] akosiaris: the alternative is to write an irc bot that tails wikimedia-operations and greps for 'etherpad is down' [17:56:08] ori: yay!!! human powered nagios [17:56:23] wikimedia monitoring: OH MY GOD, IT'S MADE OF PEOPLE [17:56:59] 10Ops-Access-Requests, 6operations, 6WMDE-Analytics-Engineering, 10Wikidata: Requesting access to dataset-admins for Addshore - https://phabricator.wikimedia.org/T118739#1808482 (10Addshore) I have access to fluorine which contains mediawiki logs and udp2log. Also the stat / analytics cluster. Copying to... [17:57:23] ori: actually it makes sense [17:57:32] volunteers do pretty much everything else [17:57:39] why automate monitoring ? we got people!!! :P [17:57:54] lets start a new movement [17:58:02] #peoplemonitoring ! [17:58:18] when #devops doesn't work, rely on end users [17:58:40] crowdsourced service monitoring [17:58:41] :-D [17:58:43] big brother is watching... [17:58:58] sounds like a venture capital worthy startup [17:59:01] Usually, people are quicker than nagios [17:59:03] we have a service called 'bigbrother' in labs [17:59:05] 6operations, 10ops-eqiad, 5Patch-For-Review: reclaim plutonium to spares - https://phabricator.wikimedia.org/T118586#1808491 (10RobH) a:3Cmjohnson Re-tasking to Chris per Ops meeting update from Alex. [17:59:06] https://en.wikipedia.org/wiki/Big_Brother_(software) ? :) [17:59:11] I've been trying to kill it for a while [17:59:16] akosiaris: btw we're going to do this with perf [17:59:18] it's a perl script with a bunch of bugs :( [17:59:19] not even joking [17:59:29] "it's slow" [17:59:34] phedenskog is working on having an on-wiki perf dashboard you can use to identify issues and report them [18:00:12] akosiaris, thx, i think we should put all these admin access settings in a puppet role "service" [18:00:33] yurik: ? [18:00:43] akosiaris, re sudo [18:00:48] 10Ops-Access-Requests, 6operations, 6WMDE-Analytics-Engineering, 10Wikidata: Requesting access to dataset-admins for Addshore - https://phabricator.wikimedia.org/T118739#1808495 (10ArielGlenn) Well, these should end up on fluorine like everything else. Let me look into how that works (or someone who knows... [18:01:14] yurik: yeah I got that part, the puppet role "service" thing has me confused [18:02:29] lol [18:02:48] Reedy: hashar left some work for us ^ [18:03:04] False positive [18:03:22] let's write a bot that monitors hashar's quit messages and replays them back to him amplified 10 times on rejoin [18:03:50] We could send them to him on memoserv, so he gets emails too [18:03:52] (03CR) 10ArielGlenn: [C: 031] Uninstall wpasupplicant [puppet] - 10https://gerrit.wikimedia.org/r/252916 (owner: 10Muehlenhoff) [18:04:16] 6operations, 6Labs: Setup private docker registry with authentication support in tools - https://phabricator.wikimedia.org/T118758#1808509 (10yuvipanda) 3NEW a:3yuvipanda [18:04:16] Reedy: you are a sneaky one. I like that ;-) [18:04:29] _joe_: https://integration.wikimedia.org/ci/view/Ops/job/operations-puppet-catalog-compiler/1294/console [18:04:35] puppet-compiler bug I assume ? [18:04:40] 6operations, 7Database: Drop *_old database tables from Wikimedia wikis - https://phabricator.wikimedia.org/T54932#1808517 (10brion) I have no record of rel13testwiki or devwikiinternal; would assume they're junk but it might be worth double-checking their content before trashing them. [18:04:57] it was running for 50mins due to that None error [18:08:58] <_joe_> akosiaris: already known, if you don't specify a node it doesn't work [18:09:42] _joe_: yeah, I want to fix it ... [18:09:55] got a rather big change for role::salt... need to make it default to ALL [18:10:13] <_joe_> akosiaris: yeah the code should be fixable by you if you have the time [18:10:38] <_joe_> akosiaris: https://phabricator.wikimedia.org/T114305 [18:12:21] 6operations, 7Database: Drop *_old database tables from Wikimedia wikis - https://phabricator.wikimedia.org/T54932#1808563 (10Reedy) ``` mysql:wikiadmin@db1044 [rel13testwiki]> show tables; +-------------------------+ | Tables_in_rel13testwiki | +-------------------------+ | archive | | blobs... [18:12:44] _joe_: ok thanks! [18:14:00] RECOVERY - cassandra-c CQL 10.192.16.164:9042 on restbase2001 is OK: TCP OK - 0.038 second response time on port 9042 [18:15:38] 6operations, 10vm-requests: VM request for OpenLDAP labs servers - https://phabricator.wikimedia.org/T118726#1808580 (10akosiaris) I just picked names for those 2 - eqiad: seaborgium.wikimedia.org - codfw: serpens.wikimedia.org [18:17:35] 10Ops-Access-Requests, 6operations, 6WMDE-Analytics-Engineering, 10Wikidata: Requesting access to dataset-admins for Addshore - https://phabricator.wikimedia.org/T118739#1808588 (10Addshore) They could be added here https://github.com/wikimedia/operations-puppet/blob/production/modules/statistics/manifest... [18:17:59] <_joe_> akosiaris: there is some smart code that was selecting nodes and did not work in the new version and I didn't bother fixing it [18:18:26] <_joe_> (re: puppet compiler) [18:18:49] yeah, I am gonna go for the simple globs for now [18:18:56] dumb, not smart [18:20:59] moritzm: andrewbogott akosiaris: chasemp uh, I feel slightly wary of putting labs VM support stuff on ganeti VMs, just that there are now two VM systems interacting makes me feel straaaange. [18:21:18] 6operations, 7Database: Drop *_old database tables from Wikimedia wikis - https://phabricator.wikimedia.org/T54932#1808596 (10brion) Yeah those sound safe to remove! [18:21:29] PlasmaFury: you talking about ldap? [18:21:34] andrewbogott: yeah [18:21:40] that really shouldn't be an isue in that respect [18:21:47] hopefully nothing in labs knows it's a vm on ganeti, etc [18:21:48] :) [18:21:51] heh, if ldap is going to stop being my problem then I don’t care where it is :) [18:22:46] hmmm [18:23:11] PlasmaFury: if we were going to run labs support ldap /on labs/ I would be a bit more worried [18:23:19] I can understand the hesitation but I can't come up with any specific reason to shy away from it [18:23:26] chasemp: yeah me neither [18:23:30] this is just 'vaggueee feeling' [18:30:59] PlasmaFury: there should be no problem [18:31:29] why do you care if it is a VM or a hardware box in an unrelated infrastructure to labs ? [18:31:52] 6operations: ganglia eqiad misc hosts shows various openstack vms - https://phabricator.wikimedia.org/T118690#1808639 (10Dzahn) see T115330 [18:32:34] 6operations: ganglia eqiad misc hosts shows various openstack vms - https://phabricator.wikimedia.org/T118690#1808641 (10Dzahn) [18:32:35] 6operations, 10netops: block labs IPs from sending data to prod ganglia - https://phabricator.wikimedia.org/T115330#1721704 (10Dzahn) [18:33:19] akosiaris: as I said, 'vague feelings of complexity' that I can't really back up with anything real [18:33:23] so let's ignore them I guess [18:33:27] 6operations, 7Icinga: make critical icinga services always send email but keep honoring timezones for pages - https://phabricator.wikimedia.org/T114661#1808644 (10Dzahn) p:5Normal>3High [18:35:07] 10Ops-Access-Requests, 6operations, 6WMDE-Analytics-Engineering, 10Wikidata: Requesting access to dataset-admins for Addshore - https://phabricator.wikimedia.org/T118739#1808649 (10ArielGlenn) Id' rather put em on fluorine like the mw apache logs. [18:37:55] (03PS6) 10Yuvipanda: etherpad: Add an autorestarter [puppet] - 10https://gerrit.wikimedia.org/r/253048 [18:37:57] (03PS1) 10Yuvipanda: dynamicproxy: Move to redis::instance [puppet] - 10https://gerrit.wikimedia.org/r/253360 [18:38:09] ori: ^ one migration of redis, I'll merge and test after you take a look [18:41:00] (03CR) 10Ori.livneh: [C: 031] dynamicproxy: Move to redis::instance [puppet] - 10https://gerrit.wikimedia.org/r/253360 (owner: 10Yuvipanda) [18:41:52] PlasmaFury: ok then [18:43:44] (03PS3) 10CSteipp: Log privileged users with short passwords [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222025 (https://phabricator.wikimedia.org/T94774) [18:46:04] thanks ori [18:47:54] (03CR) 10CSteipp: "Updated this patch to very specifically only measure a count of number of logins that would be affected by T104370-T104373 (so it logs bas" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222025 (https://phabricator.wikimedia.org/T94774) (owner: 10CSteipp) [18:48:42] akosiaris: mind if I merge the etherpad restarter? [18:48:55] * PlasmaFury is too lazy to git rebase it out of the dynamicproxy patch but can do if there are objections [18:55:25] (03CR) 10Yuvipanda: [C: 032] "Let's see what happens" [puppet] - 10https://gerrit.wikimedia.org/r/253048 (owner: 10Yuvipanda) [18:55:27] * PlasmaFury does anyway [18:56:25] akosiaris: wait why do we have apache in front of nodejs?! [18:56:30] that seems strange [18:57:26] ok now to, uh, test it... [19:01:23] 6operations, 7Database: Drop *_old database tables from Wikimedia wikis - https://phabricator.wikimedia.org/T54932#1808736 (10Reedy) I'll put them in a new request per Jaimes request [19:01:58] yay [19:02:00] it worked [19:02:02] ok [19:02:22] I'm going to get some breakfast and then do the proxy stuff [19:14:47] (03CR) 10Yuvipanda: [C: 032] dynamicproxy: Move to redis::instance [puppet] - 10https://gerrit.wikimedia.org/r/253360 (owner: 10Yuvipanda) [19:16:45] 6operations, 10Deployment-Systems: install/deploy mira as codfw deployment server - https://phabricator.wikimedia.org/T95436#1808803 (10mmodell) [19:17:48] (03PS1) 10Yuvipanda: dynamicproxy: Don't try to restart redis when config changes [puppet] - 10https://gerrit.wikimedia.org/r/253364 [19:18:00] (03PS1) 10Dzahn: salt: add motd comment about keys on 2 servers [puppet] - 10https://gerrit.wikimedia.org/r/253365 [19:18:50] (03CR) 10Yuvipanda: [C: 032] dynamicproxy: Don't try to restart redis when config changes [puppet] - 10https://gerrit.wikimedia.org/r/253364 (owner: 10Yuvipanda) [19:20:43] (03PS2) 10Dzahn: salt: add motd comment about keys on 2 servers [puppet] - 10https://gerrit.wikimedia.org/r/253365 [19:20:57] (03CR) 10Dzahn: [C: 032] salt: add motd comment about keys on 2 servers [puppet] - 10https://gerrit.wikimedia.org/r/253365 (owner: 10Dzahn) [19:21:35] PlasmaFury: \o/ [19:21:36] thanks! [19:23:16] !log Deployed patch for T116095 [19:23:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Mr. Obvious [19:26:01] (03PS1) 10Andrew Bogott: Set minimum password length of 8 for wikitech. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253366 (https://phabricator.wikimedia.org/T118751) [19:28:08] ori: so there's a lot of things that the redis::legacy module did that we've to do outside now [19:28:12] like vm overcommit [19:28:19] PROBLEM - puppet last run on lvs3001 is CRITICAL: CRITICAL: puppet fail [19:28:26] (03CR) 10Dzahn: "can we easily find out how many users there are with shorter passwords than that? if there are just a few i'd say let's also ban the exist" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253366 (https://phabricator.wikimedia.org/T118751) (owner: 10Andrew Bogott) [19:28:38] PlasmaFury: that belongs outside imo [19:28:43] (03PS1) 10Yuvipanda: dynamicproxy: Specify port number for replication [puppet] - 10https://gerrit.wikimedia.org/r/253368 [19:28:49] ori: yeah I was thinking about that, and that's right [19:28:50] mutante: csteipp has a logging patch waiting to go in... ie priveleged users that have short passwords [19:29:06] ori: like for dynamicproxy, the maxmemory is way lower than the memory on the machine and so overcommit makes no sense [19:29:09] mutante: https://gerrit.wikimedia.org/r/#/c/222025/2 [19:29:23] PlasmaFury: yep [19:29:39] (03PS2) 10Andrew Bogott: Set minimum password length of 8 for wikitech. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253366 (https://phabricator.wikimedia.org/T118751) [19:29:48] (03PS2) 10Yuvipanda: dynamicproxy: Specify port number for replication [puppet] - 10https://gerrit.wikimedia.org/r/253368 [19:29:54] Reedy: yes, thanks! this one would be just for wikitech though, so way smaller user base [19:30:05] Reedy: Nice... together with other logs, you can see who has the short passwords [19:30:19] timing [19:30:20] mutante: works as a PoC etc... Copy, paste, amend abuse [19:31:04] (03CR) 10Yuvipanda: [C: 032] dynamicproxy: Specify port number for replication [puppet] - 10https://gerrit.wikimedia.org/r/253368 (owner: 10Yuvipanda) [19:31:21] (03CR) 10Dzahn: "this would be interesting to have for wikitech: https://gerrit.wikimedia.org/r/#/c/222025/2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253366 (https://phabricator.wikimedia.org/T118751) (owner: 10Andrew Bogott) [19:31:39] Reedy: *nod* looks like it, yea [19:32:51] (03CR) 10Reedy: [C: 04-1] Log privileged users with short passwords (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222025 (https://phabricator.wikimedia.org/T94774) (owner: 10CSteipp) [19:33:04] csteipp: You need a global $wmgUseCA [19:34:38] so separately from this patch.. ..we put everything in CommonSettings.php even if it's not common. we add "if ( $wgDBname =" sections instead. [19:34:50] it kind of seems against the idea of "common" but shrug [19:35:03] SometimesCommonSettings.php [19:35:06] yea:) [19:40:29] (03PS1) 10Greg Grossmeier: Move Catalan and Hebrew Wikipedias to group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253371 (https://phabricator.wikimedia.org/T115002) [19:41:17] greg-g: Interesting. [19:41:28] greg-g: Canary wikis? [19:41:42] (03CR) 10Greg Grossmeier: "This is weird because I didn't want to remove hewiki and cawiki from wikipedias.dblist so they're double listed (in group1 and wikipedias)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253371 (https://phabricator.wikimedia.org/T115002) (owner: 10Greg Grossmeier) [19:41:47] (03PS3) 10Andrew Bogott: Set minimum password length of 8 for wikitech. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253366 (https://phabricator.wikimedia.org/T118751) [19:42:20] James_F: yup [19:43:17] greg-g: Makes for a more confusing release cadence. "Test wikis on Tuesday, sister projects on Wednesday, Wikipedias on Thursday" is at least simple. But not sure anyone to whom I have to explain that cares. [19:43:34] 6operations: sitemap.wikimedia.org ? - https://phabricator.wikimedia.org/T101486#1808894 (10Deskana) >>! In T101486#1478979, @Aklapper wrote: > @Deskana: Any idea who could decide on this request or in whose court that could be? :-/ Sorry, I have no idea. [19:43:53] (03PS1) 10Yuvipanda: dynamicproxy: Don't set slaveof if it is not a slave [puppet] - 10https://gerrit.wikimedia.org/r/253373 [19:44:37] James_F: yeah, plus "test wikis" is kind of weird too with zerowiki :) [19:44:42] 6operations: sitemap.wikimedia.org ? - https://phabricator.wikimedia.org/T101486#1808903 (10demon) Decom? Is anything using it or is this just an ancient alias that everyone has forgotten? [19:44:48] greg-g: Indeed, but. :_) [19:44:59] yeah, alas... don't look at the man behind the curtain [19:45:09] (03PS2) 10Yuvipanda: dynamicproxy: Don't set slaveof if it is not a slave [puppet] - 10https://gerrit.wikimedia.org/r/253373 [19:46:24] (03CR) 10Yuvipanda: [C: 032] dynamicproxy: Don't set slaveof if it is not a slave [puppet] - 10https://gerrit.wikimedia.org/r/253373 (owner: 10Yuvipanda) [19:47:52] 6operations: sitemap.wikimedia.org ? - https://phabricator.wikimedia.org/T101486#1808918 (10Dzahn) Nothing much in Google for this. except the ticket itself, adding exceptions for HTTPSeverywhere and meta stuff. Probably the latter. So if Discovery is not interested in this we should indeed remove it. The ticke... [19:51:09] (03PS1) 10Dzahn: delete sitemap.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/253375 (https://phabricator.wikimedia.org/T101486) [19:52:28] !log Running cleanupBlocks.php on arwiki, eswikibooks, eswikinews, and testwiki for [[phab:T118625]] [19:52:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:52:53] (03PS2) 10Dzahn: interface: some lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/249344 [19:58:42] (03CR) 10Dzahn: [C: 032] interface: some lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/249344 (owner: 10Dzahn) [19:58:43] ori: hmm, interesting fun issues. it tries to start and then fails with [29849] 16 Nov 19:52:47.527 # Creating Server TCP listening socket *:6379: bind: Address already in use [19:58:43] ori: so there's a redis server running except there's also a failed unit [19:58:43] ah [19:58:43] redis-server wasn't stopped properly [19:58:44] RECOVERY - puppet last run on lvs3001 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [19:58:44] (03CR) 10Chad: [C: 031] delete sitemap.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/253375 (https://phabricator.wikimedia.org/T101486) (owner: 10Dzahn) [19:58:48] (03CR) 10Alex Monk: "Surely a user has to pass all policies that their account is subject to, and therefore upping 'default' is enough without repeating it for" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253366 (https://phabricator.wikimedia.org/T118751) (owner: 10Andrew Bogott) [19:59:30] (03CR) 10Dzahn: "yes, can we fix the typo separate from a general change though" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253063 (owner: 10Dzahn) [20:01:05] 6operations, 10EventBus, 10MediaWiki-Cache, 6Performance-Team, and 2 others: Setup a 2 server Kafka instance in both eqiad and codfw for reliable purge streams - https://phabricator.wikimedia.org/T114191#1809023 (10faidon) OK, the above comment caused some confusion — apologies for that. I had a chat with... [20:01:43] (03PS1) 10Yuvipanda: tools: Don't explicitly require redis service for kube2proxy [puppet] - 10https://gerrit.wikimedia.org/r/253379 [20:01:57] (03PS2) 10Yuvipanda: tools: Don't explicitly require redis service for kube2proxy [puppet] - 10https://gerrit.wikimedia.org/r/253379 [20:02:31] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Don't explicitly require redis service for kube2proxy [puppet] - 10https://gerrit.wikimedia.org/r/253379 (owner: 10Yuvipanda) [20:03:34] (03CR) 10Dzahn: "i abstain. don't have anything to add. removing self. please discuss with the author of the patch that disabled them" [puppet] - 10https://gerrit.wikimedia.org/r/250449 (owner: 10Paladox) [20:09:05] (03CR) 10Andrew Bogott: "Alex, Reedy just tested and says you are right!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253366 (https://phabricator.wikimedia.org/T118751) (owner: 10Andrew Bogott) [20:09:38] (03PS4) 10Andrew Bogott: Set minimum password length of 8 for wikitech. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253366 (https://phabricator.wikimedia.org/T118751) [20:19:10] PROBLEM - puppet last run on cp3047 is CRITICAL: CRITICAL: Puppet has 1 failures [20:19:45] (03PS4) 10Reedy: Log privileged users with short passwords [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222025 (https://phabricator.wikimedia.org/T94774) (owner: 10CSteipp) [20:20:00] (03CR) 10Reedy: [C: 031] Log privileged users with short passwords [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222025 (https://phabricator.wikimedia.org/T94774) (owner: 10CSteipp) [20:20:00] PROBLEM - Packetloss_Average on analytics1026 is CRITICAL: packet_loss_average CRITICAL: 8.22282191919 [20:22:59] RECOVERY - Packetloss_Average on analytics1026 is OK: packet_loss_average OKAY: 1.3934430303 [20:23:08] (03PS2) 10Reedy: Set initial Staff password policy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222057 (https://phabricator.wikimedia.org/T104370) (owner: 10CSteipp) [20:24:03] (03PS1) 10Yuvipanda: tools: Fix bug in unregistering routes in dynamicproxy [puppet] - 10https://gerrit.wikimedia.org/r/253386 [20:24:14] (03CR) 10Reedy: "Is it worth logging names, even if we're mostly interested in the count of users?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222025 (https://phabricator.wikimedia.org/T94774) (owner: 10CSteipp) [20:24:35] (03PS2) 10Yuvipanda: tools: Fix bug in unregistering routes in dynamicproxy [puppet] - 10https://gerrit.wikimedia.org/r/253386 [20:25:30] (03PS4) 10Dzahn: lvs: double quoted string and other lint [puppet] - 10https://gerrit.wikimedia.org/r/243856 [20:29:52] (03CR) 10Yuvipanda: [C: 032] tools: Fix bug in unregistering routes in dynamicproxy [puppet] - 10https://gerrit.wikimedia.org/r/253386 (owner: 10Yuvipanda) [20:30:50] (03PS1) 10Rush: wmflib: nuyaml_backend.rb handle empty files [puppet] - 10https://gerrit.wikimedia.org/r/253387 [20:33:10] (03CR) 10CSteipp: "More data would be nice, but I'd rather just get this out than do anything that might be borderline for privacy. I'm hoping we turn this o" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222025 (https://phabricator.wikimedia.org/T94774) (owner: 10CSteipp) [20:38:45] (03PS5) 10Dzahn: lvs: double quoted string and other lint [puppet] - 10https://gerrit.wikimedia.org/r/243856 [20:38:57] (03PS5) 10Andrew Bogott: Set minimum password length of 10 for wikitech. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253366 (https://phabricator.wikimedia.org/T118751) [20:40:45] (03CR) 10Dzahn: [C: 032] "checked with compiler http://puppet-compiler.wmflabs.org/1297/" [puppet] - 10https://gerrit.wikimedia.org/r/243856 (owner: 10Dzahn) [20:43:32] (03CR) 10Dduvall: [C: 031] wmflib: nuyaml_backend.rb handle empty files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/253387 (owner: 10Rush) [20:45:49] RECOVERY - puppet last run on cp3047 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:48:47] (03PS2) 10Rush: wmflib: nuyaml_backend.rb handle empty files [puppet] - 10https://gerrit.wikimedia.org/r/253387 [20:50:31] akosiaris: https://gerrit.wikimedia.org/r/#/c/253052/ would be useful for something unrelated [20:50:37] (03CR) 10Luke081515: [C: 031] Set minimum password length of 10 for wikitech. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253366 (https://phabricator.wikimedia.org/T118751) (owner: 10Andrew Bogott) [20:50:41] not sure about those /test/ files [20:51:52] (03CR) 10Luke081515: [C: 031] Log privileged users with short passwords [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222025 (https://phabricator.wikimedia.org/T94774) (owner: 10CSteipp) [20:52:20] ori: dynamicproxy has been fully transitioned [20:52:25] \o/ [20:52:30] thanks! :) [20:53:28] PlasmaFury: transitioned from what to what? [20:53:46] andrewbogott: oh from the redis::legacy class to redis::instance [20:53:57] it's a noop everywhere except in puppet [20:54:25] (03CR) 10Rush: [C: 032] wmflib: nuyaml_backend.rb handle empty files [puppet] - 10https://gerrit.wikimedia.org/r/253387 (owner: 10Rush) [20:54:59] ping cajoel [20:55:07] ? [20:55:48] * Platonides wasn't expecting such a quick reply for someone who had been idle for 7 hours :) [20:55:56] (03CR) 10Dzahn: [C: 031] Set minimum password length of 10 for wikitech. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253366 (https://phabricator.wikimedia.org/T118751) (owner: 10Andrew Bogott) [20:55:59] pm [20:56:09] (03CR) 10Reedy: [C: 031] Set minimum password length of 10 for wikitech. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253366 (https://phabricator.wikimedia.org/T118751) (owner: 10Andrew Bogott) [20:56:15] andrewbogott: Want it adding to swat? [20:56:27] Reedy: yep, adding [20:56:49] I'd be tempted to just deploy that and Chris' patch [20:56:49] haha [20:57:20] <_joe_> andrewbogott: will that affect new passwords or old users will be affected too? [20:57:35] DOIT :) [20:57:39] <_joe_> also, is this doen in mediawiki and not in ldap? [20:57:43] _joe_: as I understand it, old users will be nagged on login but not required to actually reset [20:57:58] they will be redirected to the password reset page [20:58:02] * Reedy looks what's on the calendar [20:58:03] And, yeah, the check is at login time, which happens via mediawiki [20:58:05] <_joe_> ldap has a very good password checking system [20:58:25] Hmm. There's nothing going on atm... [20:58:30] yeah, but having the ldap password check talk back to the user is a good deal more trouble. [20:58:46] Especially if you're running edir (/me wrote a password checking app for that... another life). [20:59:00] csteipp: Imma jfdi [20:59:16] Well, actually, I’m just assuming that the password code we’re talking about wraps the ldap extension. Let me make sure... [20:59:28] 10Ops-Access-Requests, 6operations: Requesting access to rest base and cassandra nodes - https://phabricator.wikimedia.org/T117473#1809106 (10kevinator) approved: you can add @nuria, @mforns and @madhuvishy to aqs-user [20:59:45] 10Ops-Access-Requests, 6operations: Requesting access to rest base and cassandra nodes - https://phabricator.wikimedia.org/T117473#1809108 (10kevinator) a:5kevinator>3None [21:00:04] gwicke cscott arlolra subbu bearND mdholloway: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151116T2100). Please do the needful. [21:00:20] (03PS1) 10Ori.livneh: xenon: migrate to redis::instance [puppet] - 10https://gerrit.wikimedia.org/r/253395 [21:00:26] no mobileapps deploy today [21:00:33] parsoid deploy time ... [21:00:47] andrewbogott: It should execute this stuff before it gets to the LDAP code [21:01:01] yeah, that’s what i’d expect [21:01:58] (03CR) 10Reedy: [C: 032] Log privileged users with short passwords [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222025 (https://phabricator.wikimedia.org/T94774) (owner: 10CSteipp) [21:02:22] (03Merged) 10jenkins-bot: Log privileged users with short passwords [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222025 (https://phabricator.wikimedia.org/T94774) (owner: 10CSteipp) [21:02:35] (03PS6) 10Reedy: Set minimum password length of 10 for wikitech. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253366 (https://phabricator.wikimedia.org/T118751) (owner: 10Andrew Bogott) [21:02:49] (03CR) 10Reedy: [C: 032] Set minimum password length of 10 for wikitech. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253366 (https://phabricator.wikimedia.org/T118751) (owner: 10Andrew Bogott) [21:03:10] (03Merged) 10jenkins-bot: Set minimum password length of 10 for wikitech. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253366 (https://phabricator.wikimedia.org/T118751) (owner: 10Andrew Bogott) [21:03:55] !log starting parsoid deploy [21:03:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:04:14] Reedy: I can’t remember how config changes work… does that apply to hosts automatically, or do we still need a swat entry? [21:04:28] andrewbogott: I'm just gonna deploy it now, out of sync [21:04:35] ah, ok [21:04:40] in that case I’ll remove it from swat [21:05:57] Can you remove the one I added too? :) [21:06:29] !log reedy@tin Synchronized wmf-config/CommonSettings.php: Log privileged users with short passwords. Set minimum password length of 10 for wikitech. (duration: 00m 26s) [21:06:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:06:44] (03PS3) 10Reedy: Set initial Staff password policy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222057 (https://phabricator.wikimedia.org/T104370) (owner: 10CSteipp) [21:06:55] ^ that one can wait [21:07:00] !log synced fresh code; restarting parsoid on wtp1005 as a canary [21:07:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:08:54] csteipp: We've already got one! [21:08:55] 2015-11-16 21:08:08 mw1148 dewiki badpass INFO: Login by privileged user with too short password [21:10:37] Another on dewiki [21:10:57] looking good. restarting parsoid on all nodes. [21:16:20] PROBLEM - Parsoid on wtp1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:16:59] can someone kill process 9900 on wtp1012 [21:17:15] it is holding up restart of parsoid and just keeping itin a restart loop via upstart. [21:17:52] akosiaris, _joe_ ^ [21:19:00] or andrewbogott if you can help. [21:19:19] * subbu cannot sudo kill stuck processes [21:19:37] i will kill it for you subbu [21:19:39] thanks. [21:19:47] * andrewbogott is a moment too late [21:20:51] subbu done [21:21:06] thanks .. will restart parsoid there. [21:21:28] !log restbase canary deploy to restbase1001 of 54bc0517 [21:21:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:21:50] RECOVERY - Parsoid on wtp1012 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.034 second response time [21:21:55] (03PS1) 10Reedy: Log username of user with short password [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253455 [21:23:08] 10Ops-Access-Requests, 6operations, 6WMDE-Analytics-Engineering, 10Wikidata: Requesting access to dataset-admins for Addshore - https://phabricator.wikimedia.org/T118739#1809136 (10Addshore) I don't think any apache logs end up on fluorine [21:23:31] addshore: Yes they do [21:23:37] where? [21:23:38] /a/mw-log/apache2.log [21:23:43] !log finished deploying parsoid 3a6f3b9e [21:23:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:24:04] Reedy: Is there a progress at T118747? [21:24:20] Reedy: what logs exactly are those though? error logs only [21:24:23] ? [21:24:31] Nov 16 21:23:23 mw1184: [proxy_fcgi:error] [pid 5582:tid 140186256078592] [client 10.64.0.105:12116] AH01070: Error parsing script headers, referer: https://en.wikipedia.org/w/index.php?title=User:Praemonitus/sandbox&action=submit [21:24:34] etc [21:24:39] addshore: You said "any" :D [21:24:42] (03CR) 10CSteipp: "Since we have accounts logging in with passwords that are too short, I think this is appropriate so that we can determine the impact updat" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253455 (owner: 10Reedy) [21:24:43] :P [21:25:13] (03CR) 10Reedy: [C: 032] Log username of user with short password [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253455 (owner: 10Reedy) [21:25:48] (03PS1) 10Yuvipanda: labs: Add a simple LAMP role [puppet] - 10https://gerrit.wikimedia.org/r/253456 [21:25:55] (03Merged) 10jenkins-bot: Log username of user with short password [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253455 (owner: 10Reedy) [21:26:33] how the hell do you strikethrough text in phabricator... I'm sure its possible.... [21:26:39] !log reedy@tin Synchronized wmf-config/CommonSettings.php: Log username of users with short passwords (duration: 00m 26s) [21:26:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:27:25] addshore: ~~oh noes, I'm striked through~~ [21:27:30] ahhh [21:34:13] (03PS1) 10Dzahn: varnish: move file to module [puppet] - 10https://gerrit.wikimedia.org/r/253457 [21:35:29] 6operations, 6Parsing-Team, 10hardware-requests: Dedicated server for running Parsoid's roundtrip tests to get reliable parse latencies and use as perf. benchmarking tests - https://phabricator.wikimedia.org/T116090#1809155 (10ssastry) @RobH I didn't set up ruthenium and I am not familiar with what is involv... [21:36:30] (03CR) 10Yuvipanda: [C: 032] labs: Add a simple LAMP role [puppet] - 10https://gerrit.wikimedia.org/r/253456 (owner: 10Yuvipanda) [21:38:48] !log restbase start deployment of 54bc0517 [21:38:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:39:14] (03Abandoned) 10BBlack: Exclude WMF Office from ratelimiter [puppet] - 10https://gerrit.wikimedia.org/r/252439 (owner: 10BBlack) [21:42:00] (03PS1) 10Yuvipanda: mysql: Add jessie support by fixing conditional [puppet] - 10https://gerrit.wikimedia.org/r/253458 [21:42:20] 6operations, 6Parsing-Team, 10hardware-requests: Dedicated server for running Parsoid's roundtrip tests to get reliable parse latencies and use as perf. benchmarking tests - https://phabricator.wikimedia.org/T116090#1809165 (10ssastry) [21:43:26] (03CR) 10Yuvipanda: [C: 032] mysql: Add jessie support by fixing conditional [puppet] - 10https://gerrit.wikimedia.org/r/253458 (owner: 10Yuvipanda) [21:47:18] (03PS2) 10Dzahn: puppet-tests: remove broken inclusion of nagios.pp [puppet] - 10https://gerrit.wikimedia.org/r/253052 [21:47:57] !log restbase end deployment of 54bc0517 [21:48:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:48:09] addshore: I think you want to be in analytics-privatedata-users [21:48:21] (03CR) 10Dzahn: [C: 032] "the file that is included here does not exist at this place anymore" [puppet] - 10https://gerrit.wikimedia.org/r/253052 (owner: 10Dzahn) [21:48:32] /stat1002 for sampled webrequest [21:49:53] Reedy: Is there a progress at T118747? [21:50:05] no [21:50:07] If no one has commented, I suspect not [21:50:21] 6operations, 10RESTBase, 6Services: Switch RESTBase to use Node.js 4 - https://phabricator.wikimedia.org/T107762#1809178 (10cscott) +1. Parsoid would like to use a version of node which includes generator support, which would be node >= 0.11. The version of node running in production (0.10.25) has had a nu... [21:50:25] I might have a quick look in a bit, but I don't intend to do much more today [21:50:29] (03PS1) 10Reedy: Don't log sysop with short passwords [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253462 [21:50:36] csteipp: ^ [21:50:58] (03CR) 10CSteipp: [C: 031] Don't log sysop with short passwords [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253462 (owner: 10Reedy) [21:51:11] (03CR) 10Reedy: [C: 032] Don't log sysop with short passwords [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253462 (owner: 10Reedy) [21:51:28] (03CR) 10Krinkle: [C: 031] fix wrong IP for codfw redis server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253063 (owner: 10Dzahn) [21:51:31] (03Merged) 10jenkins-bot: Don't log sysop with short passwords [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253462 (owner: 10Reedy) [21:51:47] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/1298/" [puppet] - 10https://gerrit.wikimedia.org/r/249655 (owner: 10Dzahn) [21:52:20] (03PS1) 10Yuvipanda: mysql: Disable apparmor in labs instances [puppet] - 10https://gerrit.wikimedia.org/r/253463 [21:52:26] !log reedy@tin Synchronized wmf-config/CommonSettings.php: Don't log sysops with short passwords (duration: 00m 27s) [21:52:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:52:43] (03PS2) 10Yuvipanda: mysql: Disable apparmor in labs instances [puppet] - 10https://gerrit.wikimedia.org/r/253463 [21:53:34] (03CR) 10Krinkle: "In logstash there is a constant stream of "Could not connect to server 10.192.0.199" in the redis channel for mediawiki errors. Mostly fro" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253063 (owner: 10Dzahn) [21:53:52] mutante: OK to add to swat in a few hours? ^ [21:54:04] jouncebot: next [21:54:04] In 2 hour(s) and 5 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151117T0000) [21:54:08] Krinkle: yes [21:54:24] i wish adding was a bot trigger too :) [21:54:56] (03CR) 10Yuvipanda: [C: 032] mysql: Disable apparmor in labs instances [puppet] - 10https://gerrit.wikimedia.org/r/253463 (owner: 10Yuvipanda) [21:55:04] (03Abandoned) 10BBlack: Remove webrequest "ip" and "x_forwarded_for" [puppet] - 10https://gerrit.wikimedia.org/r/252929 (https://phabricator.wikimedia.org/T118557) (owner: 10BBlack) [21:56:07] I need to kill the 'webserver' module [21:56:40] goddamit [21:56:43] https://tools.wmflabs.org/watroles/role/role::lamp::labs [21:56:45] * PlasmaFury weeps [21:57:30] PROBLEM - puppet last run on magnesium is CRITICAL: CRITICAL: Puppet has 1 failures [22:00:07] Krinkle: you already did it. ok, cool [22:00:09] (03PS1) 10Yuvipanda: mysql: Followup to I14cf84a4e7452 [puppet] - 10https://gerrit.wikimedia.org/r/253464 [22:00:20] PlasmaFury: i think your change broke puppet on magnesium [22:00:34] ah, probably that follow-up [22:00:38] mutante: yup the followup should fix it [22:00:44] (03PS2) 10Yuvipanda: mysql: Followup to I14cf84a4e7452 [puppet] - 10https://gerrit.wikimedia.org/r/253464 [22:00:46] *nod* 'l [22:00:53] (03CR) 10Yuvipanda: [C: 032 V: 032] mysql: Followup to I14cf84a4e7452 [puppet] - 10https://gerrit.wikimedia.org/r/253464 (owner: 10Yuvipanda) [22:01:13] 6operations, 10Analytics, 6Analytics-Kanban, 6Discovery, and 8 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1809193 (10Ottomata) [22:01:32] mutante: hmm it's broken in different ways [22:02:04] hmmmmmmmmmm [22:02:08] it wanted mysql-client-5.1 but there is 5.5 by now [22:02:38] we can also fix this differently [22:02:46] move the services away from magnesium [22:02:51] we want to anyways :) [22:02:59] and have an open thing for it [22:03:13] mutante: hmm, I'm wondering if this means that that role is broken on precise everywhere [22:03:25] well, all uses of mysql + precise [22:03:38] is this the only one in production? [22:03:47] because i saw no other errors pop up [22:03:51] so it seems it is [22:04:02] shouldnt have mysql on localhost in the first place ..eh [22:04:41] the databases should be on a real db host and the roles should be on a ganeti VM [22:05:12] PlasmaFury: if it's hard to fix, i can probably up the priority for just killing magnesium [22:05:37] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Puppetize eventlogging-service with systemd [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) (owner: 10Ottomata) [22:05:45] mutante: I'll know in about 5minutes! [22:05:45] (03PS1) 10Yuvipanda: mysql: Followup to I14cf84a4 [puppet] - 10https://gerrit.wikimedia.org/r/253466 [22:06:10] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Puppet has 1 failures [22:06:12] (03CR) 10jenkins-bot: [V: 04-1] mysql: Followup to I14cf84a4 [puppet] - 10https://gerrit.wikimedia.org/r/253466 (owner: 10Yuvipanda) [22:06:13] ok, let me know. i'll be afk for a little bit and then back [22:06:14] mutante: can you salt or something and find out if we have mysql-server-5.1 anywhere? [22:06:16] mutante: ah ok [22:06:27] (03PS2) 10Yuvipanda: mysql: Followup to I14cf84a4 [puppet] - 10https://gerrit.wikimedia.org/r/253466 [22:07:23] (03CR) 10Yuvipanda: [C: 032 V: 032] mysql: Followup to I14cf84a4 [puppet] - 10https://gerrit.wikimedia.org/r/253466 (owner: 10Yuvipanda) [22:09:28] (03PS1) 10Andrew Bogott: Add new group aqs-users for shell and cqlsh access only. [puppet] - 10https://gerrit.wikimedia.org/r/253467 (https://phabricator.wikimedia.org/T117473) [22:11:19] PROBLEM - puppet last run on cp3041 is CRITICAL: CRITICAL: puppet fail [22:12:47] (03PS1) 10GWicke: Remove HTML dump reference [puppet] - 10https://gerrit.wikimedia.org/r/253468 [22:13:18] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to rest base and cassandra nodes - https://phabricator.wikimedia.org/T117473#1809234 (10RobH) I forgot to add from the IRC discussion comment: @Nuria may need to use the cassandra shell on these systems. (Use is documented on https://... [22:13:40] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Puppet has 1 failures [22:15:00] (03CR) 10Milimetric: [C: 031] Remove HTML dump reference [puppet] - 10https://gerrit.wikimedia.org/r/253468 (owner: 10GWicke) [22:15:09] RECOVERY - puppet last run on cp3041 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [22:15:43] (03CR) 10Mforns: [C: 031] "thx" [puppet] - 10https://gerrit.wikimedia.org/r/253467 (https://phabricator.wikimedia.org/T117473) (owner: 10Andrew Bogott) [22:17:29] (03PS1) 10Yuvipanda: mysql: Drop lucid support! [puppet] - 10https://gerrit.wikimedia.org/r/253471 [22:17:41] (03PS2) 10Yuvipanda: mysql: Drop lucid support! [puppet] - 10https://gerrit.wikimedia.org/r/253471 [22:18:45] (03PS1) 10BBlack: webrequest: Add X-Client-IP -> client_ip [puppet] - 10https://gerrit.wikimedia.org/r/253472 (https://phabricator.wikimedia.org/T118557) [22:18:47] (03PS1) 10BBlack: webrequest: remove "ip" field [puppet] - 10https://gerrit.wikimedia.org/r/253473 (https://phabricator.wikimedia.org/T118557) [22:18:49] (03PS1) 10BBlack: webrequest: remove X-Forwarded-For [puppet] - 10https://gerrit.wikimedia.org/r/253474 (https://phabricator.wikimedia.org/T118557) [22:22:43] Krenair: ostriches: RoanKattouw: Who's doing the swat in a couple of hours? [22:22:55] bit busy [22:23:05] I can do it [22:23:12] * ostriches looks at calendar [22:23:39] I want to add a debugging patch that hoo committed (I've cherry picked to wmf.6) but I'm not gonna be about at deploy time probably [22:23:58] I would like to push it out soonish [22:24:08] it just makes an exception message more specific [22:24:39] hoo: well, swat is just over 90 minutes [22:25:08] (03PS1) 10BBlack: allow config lines up to 4K length [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/253476 [22:25:16] If it's just an exception message tweak, I'm happy to review it and not require others to be there [22:25:31] Reedy: But I need to get up early tomorrow [22:25:57] hoo: Me too. Hence asking the swatters if they mind putting it out anyway :P [22:26:06] (03CR) 10Ottomata: [C: 032] Remove HTML dump reference [puppet] - 10https://gerrit.wikimedia.org/r/253468 (owner: 10GWicke) [22:26:08] RoanKattouw: I reviewed it in master... [22:26:21] Cool [22:26:25] https://gerrit.wikimedia.org/r/253475 is the cherry pick [22:27:08] LGTM [22:28:37] shall I deploy or does anyone else wants to? [22:29:18] s/wants/want/ [22:29:28] hoo: RoanKattouw has said he'll happily do it in SWAT with neither of us about [22:29:34] meh [22:29:52] I would like to do it now to see what's actually wrong on dewiki [22:29:54] Presumably, if we do it earlier, you'll end up staying later to debug the issue further? ;) [22:30:03] without having to do shell magic looping over revision or crap like that [22:30:30] I don't think there's anything else going on atm, so would be ok to do it now [22:31:44] greg-g: Any objections? [22:32:23] (03PS3) 10Yuvipanda: mysql: Drop lucid support! [puppet] - 10https://gerrit.wikimedia.org/r/253471 [22:32:30] (03CR) 10Yuvipanda: [C: 032 V: 032] mysql: Drop lucid support! [puppet] - 10https://gerrit.wikimedia.org/r/253471 (owner: 10Yuvipanda) [22:34:20] mutante: ^ should fix magnesium [22:34:25] the original conditional was for lucid support [22:34:27] and wasn't dropped [22:34:32] I misread it as precise support [22:34:47] hoo: Go for it [22:35:07] ok [22:35:49] RECOVERY - puppet last run on magnesium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:36:27] ok on neon too [22:36:38] so those are the only two precise hosts with mysql installed :) [22:36:50] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [22:37:03] yay [22:38:30] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:45:36] !log rolling restbase restart to apply https://gerrit.wikimedia.org/r/#/c/253468/ [22:45:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:47:11] 7Puppet, 6operations: Remove the webserver module - https://phabricator.wikimedia.org/T118786#1809286 (10yuvipanda) 3NEW [22:47:21] 7Puppet, 6operations: Remove the webserver module - https://phabricator.wikimedia.org/T118786#1809293 (10yuvipanda) [22:47:36] !log hoo@tin Synchronized php-1.27.0-wmf.6/includes/Revision.php: Don't claim model validation failed if the content couldn't be loaded (duration: 00m 26s) [22:47:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:48:04] (03PS1) 10Yuvipanda: releases: Move from webserver module to apache module [puppet] - 10https://gerrit.wikimedia.org/r/253478 (https://phabricator.wikimedia.org/T118786) [22:48:06] doctaxon: ^ [22:48:09] ori: I'm trying to do code cleanup on the puppet repo for a bit :) ^ if you have any time [22:48:15] Please try it again, now [22:48:17] yes I saw it [22:48:20] okay [22:48:24] Btw, syncing to tin failed (yes, for real) [22:49:39] <_joe_> PlasmaFury: I thought we got rid of the webserver module a long time ago :P [22:49:45] [29c86753] 2015-11-16 22:49:00: Fatal exception of type MWException [22:49:51] hoo ^ [22:49:59] (03CR) 10Ori.livneh: [C: 031] "+1, but: I think the docroot was one of those things that was parametrized for the sake of having parameters rather than because someone w" [puppet] - 10https://gerrit.wikimedia.org/r/253478 (https://phabricator.wikimedia.org/T118786) (owner: 10Yuvipanda) [22:50:05] (03PS1) 10Ori.livneh: Add eventlog2001 to Puppet [puppet] - 10https://gerrit.wikimedia.org/r/253479 [22:50:16] hoo: failed completely? Or just the mira part threw up some permissiion errors? [22:50:18] doctaxon: Got it [22:50:26] (03PS1) 10Yuvipanda: releases: Remove double include of base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/253480 [22:50:26] Reedy: Well, I synced from tin [22:50:31] and the tin part threw up [22:50:52] might want to pastebin it for twentyafterfour [22:52:19] yeah [22:52:44] http://pastebin.com/FggXg18K [22:53:09] that's the expected result right now [22:53:09] (03PS2) 10Yuvipanda: releases: Move from webserver module to apache module [puppet] - 10https://gerrit.wikimedia.org/r/253478 (https://phabricator.wikimedia.org/T118786) [22:53:11] (03PS2) 10Yuvipanda: releases: Remove double include of base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/253480 [22:53:28] hoo: notice how the actual sync succeeded [22:53:39] only the mirroring to mira has issues [22:54:07] https://gerrit.wikimedia.org/r/#/c/253040/ [22:54:32] good to know [22:54:44] Reedy: Seems like we miss the "text" for a specific revision [22:54:52] not surprising [22:54:58] hmm [22:55:09] (03PS3) 10Yuvipanda: releases: Move from webserver module to apache module [puppet] - 10https://gerrit.wikimedia.org/r/253478 (https://phabricator.wikimedia.org/T118786) [22:55:11] Priority Unbreak Now! [22:55:11] (03PS3) 10Yuvipanda: releases: Remove double include of base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/253480 [22:55:14] seems a bit excessive :) [22:56:05] (03CR) 10Yuvipanda: [C: 032 V: 032] releases: Remove double include of base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/253480 (owner: 10Yuvipanda) [22:56:34] Reedy: Do we have a standard procedure about this? Could try all the ES servers [22:56:53] (03CR) 10Yuvipanda: [C: 032] releases: Move from webserver module to apache module [puppet] - 10https://gerrit.wikimedia.org/r/253478 (https://phabricator.wikimedia.org/T118786) (owner: 10Yuvipanda) [22:57:39] Not really.. [23:00:18] 6operations: releases.wikimedia.org should be https only and have hsts set - https://phabricator.wikimedia.org/T118787#1809332 (10yuvipanda) 3NEW [23:02:20] ok [23:02:24] looks like my something died [23:02:34] that's better [23:02:40] _joe_: librenms and torrus also use it [23:03:43] am I here? [23:03:45] * yuvipanda isn't sure [23:03:48] !ping [23:04:54] yuvipanda: gnip! [23:05:00] <_joe_> no you'r not [23:06:32] ok! [23:06:55] (03PS1) 10Yuvipanda: statistics: Remove redundant webserver::apache include [puppet] - 10https://gerrit.wikimedia.org/r/253481 (https://phabricator.wikimedia.org/T118786) [23:06:55] ok, down to two uses now [23:08:23] (03CR) 10Yuvipanda: [C: 032] statistics: Remove redundant webserver::apache include [puppet] - 10https://gerrit.wikimedia.org/r/253481 (https://phabricator.wikimedia.org/T118786) (owner: 10Yuvipanda) [23:09:25] Reedy: Looks like that one might have missed some conversion round [23:09:37] See SELECT * FROM text WHERE old_id = 5422832; [23:09:59] mutante: paravoid is torrus still in use? [23:10:31] hmm [23:10:35] ah I see how that works [23:14:20] PROBLEM - puppet last run on stat1001 is CRITICAL: CRITICAL: puppet fail [23:15:22] (03PS1) 10Yuvipanda: statistics: Followup to I5c5ef877003 [puppet] - 10https://gerrit.wikimedia.org/r/253482 [23:15:32] * PlasmaFury has finally fixed up his alternatives to make nvim default [23:16:08] (03CR) 10Yuvipanda: [C: 032 V: 032] statistics: Followup to I5c5ef877003 [puppet] - 10https://gerrit.wikimedia.org/r/253482 (owner: 10Yuvipanda) [23:17:08] (03PS2) 10Ori.livneh: xenon: migrate to redis::instance [puppet] - 10https://gerrit.wikimedia.org/r/253395 [23:17:14] (03CR) 10Ori.livneh: [C: 032 V: 032] xenon: migrate to redis::instance [puppet] - 10https://gerrit.wikimedia.org/r/253395 (owner: 10Ori.livneh) [23:18:09] RECOVERY - puppet last run on stat1001 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [23:18:43] (03PS2) 10Ori.livneh: Add eventlog2001 to Puppet [puppet] - 10https://gerrit.wikimedia.org/r/253479 [23:21:08] _joe_: did you end up creating subtasks for etcd auth? [23:21:14] (03PS1) 10Yuvipanda: smokeping: Use apache::site [puppet] - 10https://gerrit.wikimedia.org/r/253484 (https://phabricator.wikimedia.org/T118786) [23:21:56] (03CR) 10Ori.livneh: [C: 032] Add eventlog2001 to Puppet [puppet] - 10https://gerrit.wikimedia.org/r/253479 (owner: 10Ori.livneh) [23:23:20] PlasmaFury: bless your heart [23:24:01] (03PS1) 10Ori.livneh: migrate rcstream from redis::legacy to redis::instance [puppet] - 10https://gerrit.wikimedia.org/r/253487 [23:24:02] (03PS1) 10Yuvipanda: torrus: Move to apache::site [puppet] - 10https://gerrit.wikimedia.org/r/253488 (https://phabricator.wikimedia.org/T118786) [23:24:27] # redirect the old, pre-Jan 2014 name to librenms [23:24:29] that can die [23:25:05] (03PS2) 10Ori.livneh: migrate rcstream from redis::legacy to redis::instance [puppet] - 10https://gerrit.wikimedia.org/r/253487 [23:25:13] (03CR) 10Ori.livneh: [C: 032 V: 032] migrate rcstream from redis::legacy to redis::instance [puppet] - 10https://gerrit.wikimedia.org/r/253487 (owner: 10Ori.livneh) [23:25:31] 6operations, 10Wikimedia-DNS: Decom observium.wikimedia.org - https://phabricator.wikimedia.org/T118790#1809413 (10yuvipanda) 3NEW [23:25:53] PlasmaFury: you are on a roll! [23:26:27] (03PS1) 10Yuvipanda: librenms: Kill old redirect [puppet] - 10https://gerrit.wikimedia.org/r/253489 (https://phabricator.wikimedia.org/T118790) [23:26:31] ori_: heh [23:26:48] ori_: design research wanted a labs instance to run wordpress on so I looked at our old lamp role and was horrified and 3h later here I am [23:26:58] heh [23:27:01] there's only one more webserver left to kill [23:27:03] apache::site is not bad, right? [23:27:15] ori: yeah, much better than the webserver::apache::site [23:27:36] (03PS2) 10Yuvipanda: smokeping: Use apache::site [puppet] - 10https://gerrit.wikimedia.org/r/253484 (https://phabricator.wikimedia.org/T118786) [23:28:24] (03CR) 10Yuvipanda: [C: 032 V: 032] smokeping: Use apache::site [puppet] - 10https://gerrit.wikimedia.org/r/253484 (https://phabricator.wikimedia.org/T118786) (owner: 10Yuvipanda) [23:29:40] PROBLEM - puppet last run on eventlog2001 is CRITICAL: CRITICAL: puppet fail [23:34:50] hmm [23:34:53] I'm not sure if I broke smokeping [23:34:58] or if it has been broken for a while [23:35:09] since I just copied the generated apache config [23:36:29] (03PS2) 10Yuvipanda: torrus: Move to apache::site [puppet] - 10https://gerrit.wikimedia.org/r/253488 (https://phabricator.wikimedia.org/T118786) [23:36:37] (03CR) 10Yuvipanda: [C: 032 V: 032] torrus: Move to apache::site [puppet] - 10https://gerrit.wikimedia.org/r/253488 (https://phabricator.wikimedia.org/T118786) (owner: 10Yuvipanda) [23:39:39] PlasmaFury: re torrus https://phabricator.wikimedia.org/T87840#1538794 [23:39:40] torrus still works fine [23:39:55] mutante: ah ok [23:40:04] mutante: do you know if the smokeping website used to work at all? [23:40:18] (03PS2) 10Yuvipanda: librenms: Kill old redirect [puppet] - 10https://gerrit.wikimedia.org/r/253489 (https://phabricator.wikimedia.org/T118790) [23:40:38] PlasmaFury: at some point ..yea [23:40:45] ok [23:40:52] (03CR) 10Yuvipanda: [C: 032 V: 032] librenms: Kill old redirect [puppet] - 10https://gerrit.wikimedia.org/r/253489 (https://phabricator.wikimedia.org/T118790) (owner: 10Yuvipanda) [23:41:15] I need to make teh DNS change for https://phabricator.wikimedia.org/T118790 [23:41:17] PlasmaFury: https://phabricator.wikimedia.org/T80762 [23:42:19] PlasmaFury: nice @ removing observium [23:42:28] never knew about that one [23:42:42] mutante: :D [23:42:48] mutante: can you tell me how to remove the DNS entry? [23:43:13] 6operations, 5Continuous-Integration-Scaling: Upload new Zuul packages on apt.wikimedia.org for Precise / Trusty / Jessie - https://phabricator.wikimedia.org/T118340#1809485 (10Andrew) I'd be a bit happier if the source tree that these packages are built from is checked into gerrit, with clearly labeled branch... [23:43:17] PlasmaFury: ah, i can upload the change if you like [23:44:00] mutante: +1 please do :) [23:44:23] (03PS1) 10Dzahn: remove observium.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/253491 [23:44:54] mutante: ah nice and simple [23:44:55] (03PS2) 10Dzahn: remove observium.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/253491 (https://phabricator.wikimedia.org/T118790) [23:45:54] mutante: hmm, is librenms etc behind misc varnish? [23:46:01] (03PS3) 10Dzahn: remove observium.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/253491 (https://phabricator.wikimedia.org/T118790) [23:46:04] they do have a public IP [23:46:23] PlasmaFury: no, it's not. it's on netmon1001 directly [23:46:31] mutante: hmmmmmm [23:46:34] mutante: right [23:46:37] mutante: so only librenms has ssl [23:47:03] I guess I'll leave it be that way and just move it to apache::site [23:47:54] so netmon1001 is both [23:48:06] a backend for misc-web and a public webserver [23:48:28] only librnems has 443 on netmon1001, yea [23:49:29] servermon for example is also on it but behind misc-web [23:51:55] mutante: I see [23:52:04] mutante: is there a particular reason? will they all be behind misc-web at some point? [23:52:22] PROBLEM - citoid endpoints health on sca1001 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.32.153:1970/api: Timeout on connection while downloading http://10.64.32.153:1970/api [23:56:45] (03PS1) 10Yuvipanda: librenms: Move to apache::site [puppet] - 10https://gerrit.wikimedia.org/r/253495 (https://phabricator.wikimedia.org/T118786) [23:57:07] (03PS2) 10Yuvipanda: librenms: Move to apache::site [puppet] - 10https://gerrit.wikimedia.org/r/253495 (https://phabricator.wikimedia.org/T118786) [23:57:52] RECOVERY - citoid endpoints health on sca1001 is OK: All endpoints are healthy [23:58:04] greg-g, is it ok to deploy graphoid service update now? noone else is scheduled [23:58:38] PlasmaFury: the default should be being behind misc-web, but there are a few exceptions, and we once said that monitoring tools should probably not be [23:59:14] PlasmaFury: so i _think_ the execption here is because it's a network monitoring tool. BUT . i would still like to ask the same question to the entire team [23:59:52] mutante: yeah, makes sense (both the possibility + ask team)