[00:00:01] !log reedy@tin Synchronized static/images/project-logos/crwiki.png: (no justification provided) (duration: 01m 14s) [00:00:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:00:10] NO JUSTIFICATION [00:00:12] thanks [00:01:27] Reedy: I can see the new logo now, sweet; thanks & apologies for the late patch [00:01:32] !log reedy@tin Synchronized wmf-config/InitialiseSettings.php: crwiki logo (duration: 01m 15s) [00:01:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:02:35] Reedy: o_O [00:02:38] * jdlrobson runs to logstash [00:02:55] RECOVERY - Check systemd state on restbase-dev1006 is OK: OK - running: The system is fully operational [00:05:07] thanks Reedy !! [00:05:55] PROBLEM - Check systemd state on restbase-dev1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:06:55] ^ for that i have another patch pending [00:07:07] which would allow to skip systemd monitoring on any host [00:32:49] RECOVERY - cassandra-a service on restbase-dev1006 is OK: OK - cassandra-a is active [00:35:49] PROBLEM - cassandra-a service on restbase-dev1006 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [00:42:44] (03PS3) 10Dzahn: restbase: allow to skip monitoring, disable on dev hosts [puppet] - 10https://gerrit.wikimedia.org/r/419255 (https://phabricator.wikimedia.org/T189050) [00:46:21] (03PS1) 10BryanDavis: openstack: dns-floating-ip-updater.py [puppet] - 10https://gerrit.wikimedia.org/r/419336 [00:46:53] (03CR) 10jerkins-bot: [V: 04-1] openstack: dns-floating-ip-updater.py [puppet] - 10https://gerrit.wikimedia.org/r/419336 (owner: 10BryanDavis) [00:48:34] (03PS2) 10BryanDavis: openstack: dns-floating-ip-updater.py [puppet] - 10https://gerrit.wikimedia.org/r/419336 [00:53:26] (03PS4) 10Dzahn: restbase: allow to skip monitoring, disable on dev hosts [puppet] - 10https://gerrit.wikimedia.org/r/419255 (https://phabricator.wikimedia.org/T189050) [01:05:07] (03CR) 10BryanDavis: openstack: dns-floating-ip-updater.py (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/419336 (owner: 10BryanDavis) [01:08:32] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/10437/" [puppet] - 10https://gerrit.wikimedia.org/r/419255 (https://phabricator.wikimedia.org/T189050) (owner: 10Dzahn) [01:10:50] RECOVERY - cassandra-a service on restbase-dev1006 is OK: OK - cassandra-a is active [01:12:29] PROBLEM - HHVM rendering on mw2220 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:13:19] RECOVERY - HHVM rendering on mw2220 is OK: HTTP OK: HTTP/1.1 200 OK - 75178 bytes in 0.304 second response time [01:13:50] PROBLEM - cassandra-a service on restbase-dev1006 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [01:17:21] (03PS1) 10Dzahn: restbase: skip root URL monitoring on dev cluster, pt2 [puppet] - 10https://gerrit.wikimedia.org/r/419337 (https://phabricator.wikimedia.org/T189050) [01:18:40] (03PS2) 10Dzahn: restbase: skip root URL monitoring on dev cluster, pt2 [puppet] - 10https://gerrit.wikimedia.org/r/419337 (https://phabricator.wikimedia.org/T189050) [01:19:12] (03CR) 10Dzahn: [C: 032] "follow-up https://gerrit.wikimedia.org/r/#/c/419337/" [puppet] - 10https://gerrit.wikimedia.org/r/419255 (https://phabricator.wikimedia.org/T189050) (owner: 10Dzahn) [01:19:35] (03CR) 10Dzahn: [C: 032] restbase: skip root URL monitoring on dev cluster, pt2 [puppet] - 10https://gerrit.wikimedia.org/r/419337 (https://phabricator.wikimedia.org/T189050) (owner: 10Dzahn) [01:21:59] RECOVERY - cassandra-a service on restbase-dev1006 is OK: OK - cassandra-a is active [01:29:32] 10Operations, 10monitoring, 10Patch-For-Review: restbase: skip icinga monitoring if on "dev" machines - https://phabricator.wikimedia.org/T189050#4048527 (10Dzahn) [01:30:26] 10Operations, 10monitoring, 10Patch-For-Review: restbase: skip icinga monitoring if on "dev" machines - https://phabricator.wikimedia.org/T189050#4029186 (10Dzahn) on einsteinium after running puppet on restbase-dev1006: ``` define service { -# --PUPPET_NAME-- restbase-dev1006 restbase_http_root - active_... [01:37:10] PROBLEM - cassandra-a service on restbase-dev1006 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [02:07:59] PROBLEM - bacula director process on helium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (bacula), command name bacula-dir [02:09:26] no good.. looking [02:14:39] PROBLEM - Check systemd state on helium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:17:34] !log helium - bacula director process failed (Bacula interrupted by signal 11: Segmentation violation), icinga alerted. attempted to restart it. then: bacula-dir - the configtest failed! [02:17:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:18:26] !log helium - running bacula-dir with -f in foreground revealed: ERROR TERMINATION at parse_conf.c:485 - Config error: Could not find config Resource mysql-srv-backups - line 7, col 33 of file /etc/bacula/jobs.d/bohrium.eqiad.wmnet-mysql-predump-piwik-Weekly-Wed-production.conf [02:18:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:21:21] (03PS1) 10Dzahn: cassandra/icinga: make monitoring configurable, skip on dev [puppet] - 10https://gerrit.wikimedia.org/r/419339 (https://phabricator.wikimedia.org/T189050) [02:21:59] PROBLEM - puppet last run on helium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[bacula-director] [02:22:04] (03CR) 10jerkins-bot: [V: 04-1] cassandra/icinga: make monitoring configurable, skip on dev [puppet] - 10https://gerrit.wikimedia.org/r/419339 (https://phabricator.wikimedia.org/T189050) (owner: 10Dzahn) [02:29:09] RECOVERY - bacula director process on helium is OK: PROCS OK: 1 process with UID = 110 (bacula), command name bacula-dir [02:29:49] RECOVERY - Check systemd state on helium is OK: OK - running: The system is fully operational [02:31:59] RECOVERY - puppet last run on helium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:32:49] PROBLEM - Check systemd state on helium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:33:09] PROBLEM - bacula director process on helium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (bacula), command name bacula-dir [02:44:04] !log deleted 46 archived files [02:44:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:49:00] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.24) (duration: 05m 40s) [02:49:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:54:00] PROBLEM - puppet last run on helium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[bacula-director] [03:02:39] RECOVERY - Check systemd state on restbase-dev1006 is OK: OK - running: The system is fully operational [03:03:27] (03PS1) 10Dzahn: bacula: restore fileset mysql-srv-backups to fix director [puppet] - 10https://gerrit.wikimedia.org/r/419341 [03:05:17] (03PS2) 10Dzahn: bacula: restore fileset mysql-srv-backups to fix director [puppet] - 10https://gerrit.wikimedia.org/r/419341 [03:05:39] PROBLEM - Check systemd state on restbase-dev1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:06:35] (03CR) 10Dzahn: [C: 032] "to unbreak Bacula backups" [puppet] - 10https://gerrit.wikimedia.org/r/419341 (owner: 10Dzahn) [03:10:21] (03PS1) 10Dzahn: bacula: remove trailing slash for /srv/backups file set [puppet] - 10https://gerrit.wikimedia.org/r/419342 [03:10:59] (03CR) 10Dzahn: [C: 032] bacula: remove trailing slash for /srv/backups file set [puppet] - 10https://gerrit.wikimedia.org/r/419342 (owner: 10Dzahn) [03:11:59] RECOVERY - Check systemd state on helium is OK: OK - running: The system is fully operational [03:12:19] RECOVERY - bacula director process on helium is OK: PROCS OK: 1 process with UID = 110 (bacula), command name bacula-dir [03:13:53] !log bacula is working again - restored missing file set (https://gerrit.wikimedia.org/r/419341 ) [03:13:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:13:59] RECOVERY - puppet last run on helium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [03:14:34] (03CR) 10Dzahn: [C: 032] "plus https://gerrit.wikimedia.org/r/#/c/419342/" [puppet] - 10https://gerrit.wikimedia.org/r/419341 (owner: 10Dzahn) [03:17:33] (03CR) 10Dzahn: "today the bacula-director on helium crashed and when i wanted to restart it i got a config error because the fileset "mysql-srv-backup" wa" [puppet] - 10https://gerrit.wikimedia.org/r/416353 (https://phabricator.wikimedia.org/T184696) (owner: 10Jcrespo) [03:32:49] RECOVERY - Check systemd state on restbase-dev1006 is OK: OK - running: The system is fully operational [03:35:49] PROBLEM - Check systemd state on restbase-dev1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:38:42] (03CR) 10BryanDavis: wiki replicas: script index creation for easier maintenance (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/417357 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [03:39:38] (03PS1) 10Brian Wolff: Remove Masaryk University from public mirrors list [puppet] - 10https://gerrit.wikimedia.org/r/419344 [04:03:08] 10Operations, 10Traffic, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4048652 (10BBlack) See also the peering information tracked in T186835 [04:12:20] PROBLEM - HHVM rendering on mw2251 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:13:19] RECOVERY - HHVM rendering on mw2251 is OK: HTTP OK: HTTP/1.1 200 OK - 75121 bytes in 0.357 second response time [05:00:43] (03PS1) 10KartikMistry: lttoolbox: Update to latest upstream release [debs/contenttranslation/lttoolbox] - 10https://gerrit.wikimedia.org/r/419346 (https://phabricator.wikimedia.org/T189075) [05:29:31] (03PS1) 10KartikMistry: apertium: New upstream release [debs/contenttranslation/apertium] - 10https://gerrit.wikimedia.org/r/419351 (https://phabricator.wikimedia.org/T189075) [05:30:02] (03CR) 10jerkins-bot: [V: 04-1] apertium: New upstream release [debs/contenttranslation/apertium] - 10https://gerrit.wikimedia.org/r/419351 (https://phabricator.wikimedia.org/T189075) (owner: 10KartikMistry) [05:32:59] (03PS2) 10KartikMistry: apertium: New upstream release [debs/contenttranslation/apertium] - 10https://gerrit.wikimedia.org/r/419351 (https://phabricator.wikimedia.org/T189075) [05:33:27] (03CR) 10jerkins-bot: [V: 04-1] apertium: New upstream release [debs/contenttranslation/apertium] - 10https://gerrit.wikimedia.org/r/419351 (https://phabricator.wikimedia.org/T189075) (owner: 10KartikMistry) [06:03:03] 10Operations, 10Data-Services, 10Datasets-General-or-Unknown: Allow connections from labstore1006&7 to the analytics vlan - https://phabricator.wikimedia.org/T189644#4048741 (10madhuvishy) [06:23:32] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1097:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419352 [06:23:36] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1097:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419352 [06:26:36] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1097:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419352 (owner: 10Marostegui) [06:27:51] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1097:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419352 (owner: 10Marostegui) [06:30:02] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1097:3314 after alter table (duration: 01m 15s) [06:30:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:18] (03PS1) 10Marostegui: db-eqiad.php: Depool db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419354 (https://phabricator.wikimedia.org/T187089) [06:34:55] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419354 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [06:36:29] PROBLEM - HHVM rendering on mw2212 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:36:31] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419354 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [06:37:19] RECOVERY - HHVM rendering on mw2212 is OK: HTTP OK: HTTP/1.1 200 OK - 74816 bytes in 0.350 second response time [06:39:04] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1064 for alter table (duration: 01m 14s) [06:39:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:22] !log Deploy schema change on db1064 with replication (this will generate lag on s4 on labs hosts) - T187089 T185128 T153182 [06:45:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:29] T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 [06:45:29] T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182 [06:45:29] T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128 [06:52:18] 10Operations, 10Cloud-Services, 10DBA, 10Patch-For-Review: db1009 overloaded by idle connections to the nova database - https://phabricator.wikimedia.org/T188589#4048789 (10Marostegui) 05Open>03Resolved I am going to consider this resolved for now, as it hasn't happened again. Thanks everyone involved... [06:53:50] RECOVERY - Disk space on labtestnet2001 is OK: DISK OK [06:54:07] (03PS1) 10Marostegui: db-eqiad.php: Depool es1013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419355 [06:56:41] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool es1013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419355 (owner: 10Marostegui) [06:57:54] (03Merged) 10jenkins-bot: db-eqiad.php: Depool es1013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419355 (owner: 10Marostegui) [06:59:28] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool es1013 for kernel and mariadb upgrade (duration: 01m 14s) [06:59:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:57] !log Stop mariadb on es1013 for mariadb and kernel upgrade [07:04:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:53] (03PS1) 10KartikMistry: apertium-lex-tools: New upstream release [debs/contenttranslation/apertium-lex-tools] - 10https://gerrit.wikimedia.org/r/419356 (https://phabricator.wikimedia.org/T189075) [07:08:23] (03CR) 10jerkins-bot: [V: 04-1] apertium-lex-tools: New upstream release [debs/contenttranslation/apertium-lex-tools] - 10https://gerrit.wikimedia.org/r/419356 (https://phabricator.wikimedia.org/T189075) (owner: 10KartikMistry) [07:17:26] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool es1013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419357 [07:18:50] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Slowly repool es1013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419357 (owner: 10Marostegui) [07:20:01] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool es1013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419357 (owner: 10Marostegui) [07:21:41] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Slowly repool es1013 after kernel and mariadb upgrade (duration: 01m 14s) [07:21:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:40] (03PS1) 10Giuseppe Lavagetto: etcd: add class for v3 basic installation [puppet] - 10https://gerrit.wikimedia.org/r/419358 (https://phabricator.wikimedia.org/T166081) [07:32:02] (03PS1) 10Marostegui: db-codfw.php: Depool pc2004 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419359 [07:34:10] !log Reboot es2002 for kernel upgrade [07:34:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:57] 10Operations, 10Thumbor, 10Patch-For-Review, 10Performance-Team (Radar), 10User-fgiunchedi: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817#4048834 (10Gilles) [07:42:36] 10Operations, 10Commons, 10Thumbor, 10media-storage, 10Performance-Team (Radar): Jessie rsvg/cairo can't render specific SVG file on Commons - https://phabricator.wikimedia.org/T170628#4048851 (10Gilles) [07:43:04] 10Operations, 10Thumbor, 10Patch-For-Review, 10Performance-Team (Radar), 10User-fgiunchedi: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817#4048853 (10Gilles) [07:43:09] 10Operations, 10Commons, 10Thumbor, 10media-storage, 10Performance-Team (Radar): Jessie rsvg/cairo can't render specific SVG file on Commons - https://phabricator.wikimedia.org/T170628#3437497 (10Gilles) [07:44:31] 10Operations, 10Commons, 10Thumbor, 10media-storage, 10Performance-Team (Radar): Jessie rsvg/cairo can't render specific SVG file on Commons - https://phabricator.wikimedia.org/T170628#3437497 (10Gilles) [07:45:24] !log Reboot es2003 for kernel upgrade [07:45:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:35] !log Reboot es2004 for kernel upgrade [07:45:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:34] 10Operations, 10Thumbor, 10Patch-For-Review, 10Performance-Team (Radar), 10User-fgiunchedi: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817#4048871 (10Gilles) [08:00:22] Hi. I would like that I have issues again with loading of some pages (same issue as the days before) [08:00:26] report [08:02:10] Hi Wiki13, thanks for the ping! [08:02:54] just to confirm, you are browsing pages from EU right? [08:03:00] yep [08:03:11] esams caching again :p [08:03:46] did you get 503s or general slowness ? [08:04:52] I would say general slowness [08:05:17] sometimes some content literally takes ages to load if at all [08:05:46] seems to happen a lot with userscripts [08:11:10] (03PS2) 10Marostegui: db-codfw.php: Depool pc2004 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419359 [08:11:31] Wiki13: we are trying to check what's wrong and alerting the traffic team :) [08:11:45] 10Operations, 10Thumbor, 10Patch-For-Review, 10Performance-Team (Radar), 10User-fgiunchedi: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817#4048913 (10MoritzMuehlenhoff) >>! In T170817#3527971, @Gilles wrote: > OK, so if I'm following that means people are now advised to us... [08:11:54] thats no problem :) [08:16:18] (03CR) 10jenkins-bot: Correct logo for the Livvi-Karelian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419213 (https://phabricator.wikimedia.org/T146745) (owner: 10Odder) [08:17:01] (03CR) 10jenkins-bot: Update logo for the Maithili Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419246 (https://phabricator.wikimedia.org/T149790) (owner: 10Odder) [08:18:56] (03CR) 10jenkins-bot: Add high-density logos for seven Wikipedia projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419162 (https://phabricator.wikimedia.org/T150618) (owner: 10Odder) [08:21:22] !log Restarting the CI Jenkins [08:21:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:03] !log cp3040: restart varnish-be [08:28:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:37] hmm [08:31:58] Wiki13: the Traffic team is working on it, there are some Varnish backends in esams not behaving, but it seems that we are not in trouble for the moment [08:32:18] yea thats what I noticed as well [08:32:32] I believe cp3040 had issues before [08:33:11] T189085 seems to confirm my suspicion [08:33:11] T189085: Resources and pages occasionally take seconds to respond or fail - https://phabricator.wikimedia.org/T189085 [08:50:41] (03PS1) 10Marostegui: db-eqiad.php: Fully repool es1013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419364 [08:50:52] (03CR) 10ArielGlenn: [C: 032] Remove Masaryk University from public mirrors list [puppet] - 10https://gerrit.wikimedia.org/r/419344 (owner: 10Brian Wolff) [08:52:26] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Fully repool es1013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419364 (owner: 10Marostegui) [08:52:49] PROBLEM - Check systemd state on mendelevium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:53:37] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool es1013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419364 (owner: 10Marostegui) [08:53:49] RECOVERY - Check systemd state on mendelevium is OK: OK - running: The system is fully operational [08:55:16] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Fully repool es1013 after kernel and mariadb upgrade (duration: 01m 15s) [08:55:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:42] !log cp3041: restart varnish-be [08:57:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:39] (03PS1) 10Rduran: Add requirements.txt with pymysql and tabulate [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/419365 [09:02:41] (03PS1) 10Rduran: [WIP] Add port of osc_host.sh [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/419366 [09:04:16] 10Operations, 10Ops-Access-Requests: Access to analytics-privatedata-users for ayounsi - https://phabricator.wikimedia.org/T189650#4048979 (10elukey) p:05Triage>03Normal [09:05:24] (03PS1) 10DCausse: Add cirrussearch settings for wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419367 (https://phabricator.wikimedia.org/T182717) [09:06:32] (03CR) 10Jcrespo: [C: 031] "This seems ok- do it as fast as possible (repool even if cache is not full). If you plan to do a full rolling restart, bonus if you can mo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419359 (owner: 10Marostegui) [09:06:34] (03CR) 10jerkins-bot: [V: 04-1] Add cirrussearch settings for wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419367 (https://phabricator.wikimedia.org/T182717) (owner: 10DCausse) [09:06:36] (03PS1) 10Marostegui: es2012: Update socket location [puppet] - 10https://gerrit.wikimedia.org/r/419368 [09:06:42] (03PS1) 10Elukey: Grant access to analytics-privatedata-users to ayounsi [puppet] - 10https://gerrit.wikimedia.org/r/419369 (https://phabricator.wikimedia.org/T189650) [09:08:35] !log Stop mysql on es2012 to upgrade socket path [09:08:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:55] (03CR) 10Marostegui: [C: 032] es2012: Update socket location [puppet] - 10https://gerrit.wikimedia.org/r/419368 (owner: 10Marostegui) [09:10:16] (03CR) 10Ayounsi: [C: 031] Grant access to analytics-privatedata-users to ayounsi [puppet] - 10https://gerrit.wikimedia.org/r/419369 (https://phabricator.wikimedia.org/T189650) (owner: 10Elukey) [09:10:39] (03CR) 10Elukey: [C: 032] Grant access to analytics-privatedata-users to ayounsi [puppet] - 10https://gerrit.wikimedia.org/r/419369 (https://phabricator.wikimedia.org/T189650) (owner: 10Elukey) [09:10:45] (03PS2) 10Elukey: Grant access to analytics-privatedata-users to ayounsi [puppet] - 10https://gerrit.wikimedia.org/r/419369 (https://phabricator.wikimedia.org/T189650) [09:10:52] puppet-sniped by marostegui [09:10:57] xddddd [09:12:29] (03PS1) 10Marostegui: es2013.yaml: Update socket path [puppet] - 10https://gerrit.wikimedia.org/r/419370 [09:12:41] (03PS2) 10Marostegui: es2013.yaml: Update socket path [puppet] - 10https://gerrit.wikimedia.org/r/419370 [09:13:06] (03CR) 10Jcrespo: "This was my fault, when I changed the mysql backup system, I checked if it was in use somewhere else, and couldn't find it with a grep on " [puppet] - 10https://gerrit.wikimedia.org/r/419341 (owner: 10Dzahn) [09:13:39] (03CR) 10Marostegui: [C: 032] es2013.yaml: Update socket path [puppet] - 10https://gerrit.wikimedia.org/r/419370 (owner: 10Marostegui) [09:13:58] !log Stop mysql on es2013 to upgrade socket path [09:14:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:07] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Access to analytics-privatedata-users for ayounsi - https://phabricator.wikimedia.org/T189650#4049028 (10elukey) 05Open>03Resolved ``` elukey@analytics1001:~$ id ayounsi uid=16756(ayounsi) gid=500(wikidev) groups=500(wikidev),4(adm),700(ops),731(an... [09:18:10] (03PS1) 10Marostegui: es2014.yaml: Update socket location [puppet] - 10https://gerrit.wikimedia.org/r/419371 [09:18:41] (03CR) 10Jcrespo: "Looking good (I haven't checked it in depth yet)" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/419366 (owner: 10Rduran) [09:20:17] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool pc2004 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419359 (owner: 10Marostegui) [09:21:31] (03Merged) 10jenkins-bot: db-codfw.php: Depool pc2004 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419359 (owner: 10Marostegui) [09:23:00] !log Stop mariadb on pc2004 for kernel upgrade [09:23:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:08] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool pc2004 for kernel and mariadb upgrade (duration: 01m 14s) [09:23:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:53] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool pc2004" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419372 [09:33:13] (03CR) 10Marostegui: [C: 032] "> This seems ok- do it as fast as possible (repool even if cache is" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419359 (owner: 10Marostegui) [09:33:29] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool pc2004" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419372 (owner: 10Marostegui) [09:33:58] (03CR) 10Jcrespo: [C: 04-1] "Diff tables has been partially overriden by check_tables.py on wmfmariadbpy repository." [software] - 10https://gerrit.wikimedia.org/r/256231 (https://phabricator.wikimedia.org/T104459) (owner: 10Jcrespo) [09:34:31] (03CR) 10Marostegui: [C: 032] es2014.yaml: Update socket location [puppet] - 10https://gerrit.wikimedia.org/r/419371 (owner: 10Marostegui) [09:34:47] !log Stop mysql on es2014 to upgrade socket path [09:34:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:32] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool pc2004" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419372 (owner: 10Marostegui) [09:37:04] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool pc2004 after kernel and mariadb upgrade (duration: 01m 14s) [09:37:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:11] (03PS1) 10Marostegui: es2015.yaml: Update socket location [puppet] - 10https://gerrit.wikimedia.org/r/419374 [09:39:59] (03CR) 10Marostegui: [C: 032] es2015.yaml: Update socket location [puppet] - 10https://gerrit.wikimedia.org/r/419374 (owner: 10Marostegui) [09:40:03] !log Stop mysql on es2015 to upgrade socket path [09:40:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:24] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419354 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [09:43:46] (03CR) 10jenkins-bot: db-codfw.php: Depool pc2004 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419359 (owner: 10Marostegui) [09:44:44] !log installing samba security update (just the client side libraries) [09:44:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:52] (03PS1) 10Marostegui: db-codfw.php: Depool pc2005 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419376 [09:45:42] 10Operations, 10DBA: Switchover m1 master to a newer host - https://phabricator.wikimedia.org/T189655#4049075 (10jcrespo) p:05Triage>03Normal [09:46:22] 10Operations, 10DBA: Switchover m1 master to a newer host - https://phabricator.wikimedia.org/T189655#4049089 (10jcrespo) a:05Marostegui>03None [09:47:06] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool pc2005 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419376 (owner: 10Marostegui) [09:47:15] (03PS1) 10Elukey: profile::hadoop::monitoring: add require for hadoop common config [puppet] - 10https://gerrit.wikimedia.org/r/419377 (https://phabricator.wikimedia.org/T188294) [09:48:17] (03Merged) 10jenkins-bot: db-codfw.php: Depool pc2005 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419376 (owner: 10Marostegui) [09:49:01] 10Operations, 10DBA: Switchover m2 master to a newer host - https://phabricator.wikimedia.org/T189656#4049095 (10jcrespo) p:05Triage>03Normal [09:50:03] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool pc2005 for kernel and mariadb upgrade (duration: 01m 14s) [09:50:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:08] !log Stop mariadb on pc2005 for kernel and mariadb upgrade + change socket location [09:50:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:55] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool pc2004" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419372 (owner: 10Marostegui) [09:53:08] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool pc2005" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419378 [09:53:20] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool es1013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419357 (owner: 10Marostegui) [09:53:36] (03CR) 10jenkins-bot: Enable VirtualPageViews on Hungarian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419271 (https://phabricator.wikimedia.org/T184793) (owner: 10Jdlrobson) [09:58:39] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool pc2005" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419378 (owner: 10Marostegui) [10:00:39] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool pc2005" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419378 (owner: 10Marostegui) [10:01:57] (03CR) 10Elukey: [C: 032] profile::hadoop::monitoring: add require for hadoop common config [puppet] - 10https://gerrit.wikimedia.org/r/419377 (https://phabricator.wikimedia.org/T188294) (owner: 10Elukey) [10:03:04] (03PS1) 10Muehlenhoff: Remove further package leftovers after jessie->stretch upgrades [puppet] - 10https://gerrit.wikimedia.org/r/419379 [10:03:20] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool pc2005 after kernel and mariadb upgrade (duration: 01m 15s) [10:03:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:08] (03CR) 10Ema: [C: 031] Temporarily remove hydrogen from LVS name servers in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/419252 (owner: 10Muehlenhoff) [10:04:57] (03PS1) 10Marostegui: db-codfw.php: Depool pc2006 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419380 [10:06:31] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool pc2006 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419380 (owner: 10Marostegui) [10:07:01] !log archiving and dropping testblog from m2 [10:07:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:21] (03CR) 10jenkins-bot: Add a localised logo for the Cree Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419332 (owner: 10Odder) [10:07:38] (03Merged) 10jenkins-bot: db-codfw.php: Depool pc2006 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419380 (owner: 10Marostegui) [10:07:56] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1097:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419352 (owner: 10Marostegui) [10:08:05] 10Operations, 10Ops-Access-Requests: Requesting access to terbium.eqiad.wmnet for bmansurov - https://phabricator.wikimedia.org/T189285#4049144 (10Vgutierrez) p:05Triage>03Normal a:03Vgutierrez [10:09:10] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool pc2006 for kernel and mariadb upgrade (duration: 01m 14s) [10:09:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:25] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool es1013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419364 (owner: 10Marostegui) [10:09:40] (03CR) 10jenkins-bot: db-codfw.php: Depool pc2005 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419376 (owner: 10Marostegui) [10:09:54] (03CR) 10jenkins-bot: db-eqiad.php: Depool es1013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419355 (owner: 10Marostegui) [10:10:11] !log Stop mariadb on pc2006 for kernel and mariadb upgrade + change socket location [10:10:14] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool pc2005" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419378 (owner: 10Marostegui) [10:10:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:00] (03CR) 10jenkins-bot: db-codfw.php: Depool pc2006 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419380 (owner: 10Marostegui) [10:12:35] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool pc2006" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419381 [10:12:58] (03PS2) 10Muehlenhoff: Temporarily remove hydrogen from LVS name servers in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/419252 [10:13:53] (03CR) 10Muehlenhoff: [C: 032] Temporarily remove hydrogen from LVS name servers in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/419252 (owner: 10Muehlenhoff) [10:16:33] (03PS1) 10Elukey: Assign role::analytics_cluster::hadoop::worker to analytics1076 [puppet] - 10https://gerrit.wikimedia.org/r/419382 (https://phabricator.wikimedia.org/T188294) [10:16:39] !log archiving and dropping bugzilla_testing from m2 [10:16:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:14] (03CR) 10Elukey: [C: 032] Assign role::analytics_cluster::hadoop::worker to analytics1076 [puppet] - 10https://gerrit.wikimedia.org/r/419382 (https://phabricator.wikimedia.org/T188294) (owner: 10Elukey) [10:19:22] (03PS1) 10Ayounsi: Wheels for Netbox 2.3.1 [wheels/netbox] - 10https://gerrit.wikimedia.org/r/419383 [10:20:21] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool pc2006" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419381 (owner: 10Marostegui) [10:21:32] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool pc2006" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419381 (owner: 10Marostegui) [10:21:48] (03PS2) 10Ayounsi: Wheels for Netbox 2.3.1 [wheels/netbox] - 10https://gerrit.wikimedia.org/r/419383 [10:22:34] !log dropping testotrs from m2 [10:22:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:00] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool pc2006 after kernel and mariadb upgrade (duration: 01m 14s) [10:23:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:18] (03PS3) 10Ayounsi: Wheels for Netbox 2.3.1 [wheels/netbox] - 10https://gerrit.wikimedia.org/r/419383 [10:25:14] 10Operations, 10DBA: Switchover m2 master to a newer host - https://phabricator.wikimedia.org/T189656#4049207 (10jcrespo) [10:26:22] 10Operations, 10DBA: Switchover m2 master from db1020 to db1051 - https://phabricator.wikimedia.org/T189656#4049095 (10jcrespo) [10:28:59] (03PS4) 10Ayounsi: Wheels for Netbox 2.3.1 [wheels/netbox] - 10https://gerrit.wikimedia.org/r/419383 [10:29:53] 10Operations, 10DBA: Switchover m1 master from db1016 to db1063 - https://phabricator.wikimedia.org/T189655#4049215 (10jcrespo) [10:30:05] (03PS1) 10Marostegui: db-eqiad.php: Depool es1018 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419385 [10:31:05] (03PS1) 10Marostegui: es1018.yaml: Update socket location [puppet] - 10https://gerrit.wikimedia.org/r/419386 [10:31:27] (03PS2) 10DCausse: Add cirrussearch settings for wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419367 (https://phabricator.wikimedia.org/T182717) [10:31:31] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool es1018 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419385 (owner: 10Marostegui) [10:33:41] (03Merged) 10jenkins-bot: db-eqiad.php: Depool es1018 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419385 (owner: 10Marostegui) [10:35:14] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool es1018 for kernel and mariadb upgrade (duration: 01m 14s) [10:35:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:37] (03PS1) 10Vgutierrez: admin: Grant bmansurov access to terbium.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/419387 (https://phabricator.wikimedia.org/T189285) [10:35:44] !log rebooting hydrogen for kernel security update [10:35:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:18] (03CR) 10Vgutierrez: [C: 04-1] "pending ops meeting approval next Monday" [puppet] - 10https://gerrit.wikimedia.org/r/419387 (https://phabricator.wikimedia.org/T189285) (owner: 10Vgutierrez) [10:37:12] !log Stop mariadb on es1018 for kernel and mariadb upgrade + change socket location [10:37:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:17] (03CR) 10Muehlenhoff: [C: 031] "Looks good (but needs meeting approval)" [puppet] - 10https://gerrit.wikimedia.org/r/419387 (https://phabricator.wikimedia.org/T189285) (owner: 10Vgutierrez) [10:37:52] (03CR) 10Marostegui: [C: 032] es1018.yaml: Update socket location [puppet] - 10https://gerrit.wikimedia.org/r/419386 (owner: 10Marostegui) [10:43:43] 10Operations, 10Ops-Access-Requests, 10Ops-Access-Reviews, 10Patch-For-Review: Requesting access to terbium.eqiad.wmnet for bmansurov - https://phabricator.wikimedia.org/T189285#4049258 (10Vgutierrez) 05Open>03stalled [10:43:53] (03PS1) 10Muehlenhoff: Revert "Temporarily remove hydrogen from LVS name servers in eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/419388 [10:44:05] 10Operations, 10Ops-Access-Requests, 10Ops-Access-Reviews, 10Patch-For-Review: Requesting access to terbium.eqiad.wmnet for bmansurov - https://phabricator.wikimedia.org/T189285#4037542 (10Vgutierrez) The task is now just pending of Monday Ops meeting approval. [10:45:47] (03CR) 10Muehlenhoff: [C: 032] Revert "Temporarily remove hydrogen from LVS name servers in eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/419388 (owner: 10Muehlenhoff) [10:50:06] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool es1018 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419389 [10:51:27] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Slowly repool es1018 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419389 (owner: 10Marostegui) [10:51:41] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1961 bytes in 0.080 second response time [10:52:37] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool es1018 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419389 (owner: 10Marostegui) [10:52:59] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool pc2006" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419381 (owner: 10Marostegui) [10:53:04] (03CR) 10jenkins-bot: db-eqiad.php: Depool es1018 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419385 (owner: 10Marostegui) [10:53:09] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool es1018 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419389 (owner: 10Marostegui) [10:54:10] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Slwoly repool es1018 after kernel and mariadb upgrade (duration: 01m 14s) [10:54:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:47] (03PS1) 10ArielGlenn: Store all dataset/dumps mirrors info in one hiera structure, and use it [puppet] - 10https://gerrit.wikimedia.org/r/419390 (https://phabricator.wikimedia.org/T189657) [10:58:43] 10Operations, 10Ops-Access-Requests, 10Research, 10Research-collaborations, and 2 others: Request access to data for Wikimedia Donation Patterns research - https://phabricator.wikimedia.org/T188945#4049283 (10Vgutierrez) Please @DYNKM check @MoritzMuehlenhoff comment on https://gerrit.wikimedia.org/r/c/416... [11:02:25] !log rebooting einsteinium / icinga.wikimedia.org for kernel security update [11:02:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:32] (03PS1) 10Jcrespo: dbproxy: Reenable firewall on proxies for m1 and m2 with holes [puppet] - 10https://gerrit.wikimedia.org/r/419392 (https://phabricator.wikimedia.org/T189655) [11:03:20] (03CR) 10jerkins-bot: [V: 04-1] dbproxy: Reenable firewall on proxies for m1 and m2 with holes [puppet] - 10https://gerrit.wikimedia.org/r/419392 (https://phabricator.wikimedia.org/T189655) (owner: 10Jcrespo) [11:04:28] (03PS2) 10Jcrespo: dbproxy: Reenable firewall on proxies for m1 and m2 with holes [puppet] - 10https://gerrit.wikimedia.org/r/419392 (https://phabricator.wikimedia.org/T189655) [11:04:59] (03CR) 10jerkins-bot: [V: 04-1] dbproxy: Reenable firewall on proxies for m1 and m2 with holes [puppet] - 10https://gerrit.wikimedia.org/r/419392 (https://phabricator.wikimedia.org/T189655) (owner: 10Jcrespo) [11:06:00] icinga is back up [11:06:45] (03PS3) 10Jcrespo: dbproxy: Reenable firewall on proxies for m1 and m2 with holes [puppet] - 10https://gerrit.wikimedia.org/r/419392 (https://phabricator.wikimedia.org/T189655) [11:11:03] (03CR) 10Marostegui: [C: 031] "if those are the ips found...the code looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/419392 (https://phabricator.wikimedia.org/T189655) (owner: 10Jcrespo) [11:11:38] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1969 bytes in 0.091 second response time [11:12:48] 10Operations, 10Ops-Access-Requests, 10Analytics, 10Research, and 2 others: Restricting access for a collaboration nearing completion - https://phabricator.wikimedia.org/T189341#4049311 (10Vgutierrez) @DarTar this also affects users @Daniela.paolotti and ciro, right? [11:17:20] (03PS2) 10Rduran: [WIP] Add port of osc_host.sh [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/419366 [11:17:28] (03CR) 10Elukey: [C: 032] "commentary: aside from forgetting about the resource manager's monitoring profile I tested this and it doesn't work as expected. The packa" [puppet] - 10https://gerrit.wikimedia.org/r/419377 (https://phabricator.wikimedia.org/T188294) (owner: 10Elukey) [11:21:46] (03PS1) 10Arturo Borrero Gonzalez: puppetmaster: base_repo: ensure owner of repo tree [puppet] - 10https://gerrit.wikimedia.org/r/419395 (https://phabricator.wikimedia.org/T184259) [11:22:28] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: CRITICAL - kubelet_operational_latencies is 20723 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [11:24:28] RECOVERY - kubelet operational latencies on kubernetes1004 is OK: OK - kubelet_operational_latencies is 1005 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [11:29:31] (03CR) 10Arturo Borrero Gonzalez: [C: 04-2] "Something is very wrong with this change:" [puppet] - 10https://gerrit.wikimedia.org/r/419395 (https://phabricator.wikimedia.org/T184259) (owner: 10Arturo Borrero Gonzalez) [11:32:03] (03PS1) 10Marostegui: db-eqiad.php: Restore original weight for es1018 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419398 [11:32:11] (03PS2) 10Arturo Borrero Gonzalez: puppetmaster: base_repo: ensure owner of repo tree [puppet] - 10https://gerrit.wikimedia.org/r/419395 (https://phabricator.wikimedia.org/T184259) [11:33:23] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Restore original weight for es1018 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419398 (owner: 10Marostegui) [11:34:34] (03Merged) 10jenkins-bot: db-eqiad.php: Restore original weight for es1018 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419398 (owner: 10Marostegui) [11:34:44] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "1 - you should use gitowner and gitgroup" [puppet] - 10https://gerrit.wikimedia.org/r/419395 (https://phabricator.wikimedia.org/T184259) (owner: 10Arturo Borrero Gonzalez) [11:36:17] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Restore original weight for es1018 after kernel and mariadb upgrade (duration: 01m 15s) [11:36:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:34] (03PS1) 10Marostegui: db-eqiad.php: Depool db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419399 [11:39:03] (03CR) 10jenkins-bot: db-eqiad.php: Restore original weight for es1018 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419398 (owner: 10Marostegui) [11:39:38] (03PS2) 10Marostegui: db-eqiad.php: Depool db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419399 [11:40:11] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for nagios-nrpe-server [puppet] - 10https://gerrit.wikimedia.org/r/419400 (https://phabricator.wikimedia.org/T135991) [11:41:02] (03CR) 10jerkins-bot: [V: 04-1] Enable base::service_auto_restart for nagios-nrpe-server [puppet] - 10https://gerrit.wikimedia.org/r/419400 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [11:41:15] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419399 (owner: 10Marostegui) [11:42:34] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419399 (owner: 10Marostegui) [11:44:10] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1074 for data checks and kernel upgrade (duration: 01m 14s) [11:44:11] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419399 (owner: 10Marostegui) [11:44:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:01] !log Stop db1074 for kernel upgrade [11:45:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:44] (03PS3) 10Arturo Borrero Gonzalez: puppetmaster: base_repo: ensure owner of repo tree [puppet] - 10https://gerrit.wikimedia.org/r/419395 (https://phabricator.wikimedia.org/T184259) [11:51:20] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: base_repo: ensure owner of repo tree [puppet] - 10https://gerrit.wikimedia.org/r/419395 (https://phabricator.wikimedia.org/T184259) (owner: 10Arturo Borrero Gonzalez) [11:53:10] (03CR) 10Filippo Giunchedi: "LGTM, minor comment on commit" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/419400 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [11:55:27] (03CR) 10Filippo Giunchedi: [C: 031] Remove further package leftovers after jessie->stretch upgrades [puppet] - 10https://gerrit.wikimedia.org/r/419379 (owner: 10Muehlenhoff) [11:55:58] (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for nagios-nrpe-server [puppet] - 10https://gerrit.wikimedia.org/r/419400 (https://phabricator.wikimedia.org/T135991) [11:57:01] (03CR) 10jerkins-bot: [V: 04-1] Enable base::service_auto_restart for nagios-nrpe-server [puppet] - 10https://gerrit.wikimedia.org/r/419400 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [12:01:55] (03CR) 10Alexandros Kosiaris: "Indeed I missed that as well. Sorry about that." [puppet] - 10https://gerrit.wikimedia.org/r/419341 (owner: 10Dzahn) [12:03:12] (03PS4) 10Arturo Borrero Gonzalez: puppetmaster: base_repo: ensure owner of repo tree [puppet] - 10https://gerrit.wikimedia.org/r/419395 (https://phabricator.wikimedia.org/T184259) [12:03:28] 10Operations, 10Ops-Access-Requests, 10Research, 10Research-collaborations, and 2 others: Request access to data for Wikimedia Donation Patterns research - https://phabricator.wikimedia.org/T188945#4049415 (10DYNKM) Resumed sounds ideal; thank you! [12:08:12] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419404 [12:09:45] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Slowly repool db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419404 (owner: 10Marostegui) [12:11:03] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419404 (owner: 10Marostegui) [12:11:16] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419404 (owner: 10Marostegui) [12:14:02] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Slowly repool db1074 (duration: 01m 14s) [12:14:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:59] 10Operations, 10DNS, 10Release-Engineering-Team, 10Traffic, and 2 others: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches - https://phabricator.wikimedia.org/T188776#4049434 (10Pcoombe) The current plan is to move the thank you pages and some other fundraising specific pages... [12:19:03] !log kartik@tin Started deploy [cxserver/deploy@c204d9c]: Update cxserver to c355d0c [12:19:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:15] !log kartik@tin Finished deploy [cxserver/deploy@c204d9c]: Update cxserver to c355d0c (duration: 03m 12s) [12:22:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:20] (03PS1) 10Marostegui: db-eqiad.php: Increase weight for db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419408 [12:36:22] (03PS1) 10Ema: prometheus: aggregation rule for varnish_backend_conn [puppet] - 10https://gerrit.wikimedia.org/r/419409 (https://phabricator.wikimedia.org/T181315) [12:36:37] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase weight for db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419408 (owner: 10Marostegui) [12:37:13] Reedy: I'm kinda sad he called Persian Wikipedia "minor" it has 600K articles [12:37:50] (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight for db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419408 (owner: 10Marostegui) [12:38:58] (03CR) 10jenkins-bot: db-eqiad.php: Increase weight for db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419408 (owner: 10Marostegui) [12:39:21] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic for db1074 (duration: 01m 15s) [12:39:22] (03CR) 10Alexandros Kosiaris: "I have no real preference on this. Whatever works. I am guessing that means if the time to calculate all mbeans is smaller than the time b" [puppet] - 10https://gerrit.wikimedia.org/r/419158 (https://phabricator.wikimedia.org/T189516) (owner: 10Filippo Giunchedi) [12:39:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:09] (03CR) 10Filippo Giunchedi: [C: 031] prometheus: aggregation rule for varnish_backend_conn [puppet] - 10https://gerrit.wikimedia.org/r/419409 (https://phabricator.wikimedia.org/T181315) (owner: 10Ema) [12:40:21] Amir1: Blame David Gerrard :) [12:40:30] :D [12:40:49] (03PS2) 10Filippo Giunchedi: puppetmaster: export all puppetdb mbeans [puppet] - 10https://gerrit.wikimedia.org/r/419158 (https://phabricator.wikimedia.org/T189516) [12:41:11] (03PS1) 10Hashar: Fix nrpe spec for os_version() [puppet] - 10https://gerrit.wikimedia.org/r/419410 [12:41:59] (03CR) 10jerkins-bot: [V: 04-1] Fix nrpe spec for os_version() [puppet] - 10https://gerrit.wikimedia.org/r/419410 (owner: 10Hashar) [12:43:47] (03CR) 10Filippo Giunchedi: [C: 032] puppetmaster: export all puppetdb mbeans [puppet] - 10https://gerrit.wikimedia.org/r/419158 (https://phabricator.wikimedia.org/T189516) (owner: 10Filippo Giunchedi) [12:45:55] (03PS2) 10Ema: prometheus: aggregation rule for varnish_backend_conn [puppet] - 10https://gerrit.wikimedia.org/r/419409 (https://phabricator.wikimedia.org/T181315) [12:46:13] (03CR) 10Ema: [V: 032 C: 032] prometheus: aggregation rule for varnish_backend_conn [puppet] - 10https://gerrit.wikimedia.org/r/419409 (https://phabricator.wikimedia.org/T181315) (owner: 10Ema) [12:47:12] (03PS2) 10Muehlenhoff: Remove further package leftovers after jessie->stretch upgrades [puppet] - 10https://gerrit.wikimedia.org/r/419379 [12:48:51] (03CR) 10Filippo Giunchedi: "FTR it is taking ~6s for results to return on nihal and nitrogen, I'll keep an eye on it in the next few days." [puppet] - 10https://gerrit.wikimedia.org/r/419158 (https://phabricator.wikimedia.org/T189516) (owner: 10Filippo Giunchedi) [12:48:54] (03CR) 10Muehlenhoff: [C: 032] Remove further package leftovers after jessie->stretch upgrades [puppet] - 10https://gerrit.wikimedia.org/r/419379 (owner: 10Muehlenhoff) [12:49:27] 10Operations, 10Puppet, 10Patch-For-Review: Upgrade PuppetDB to version 4.4 - https://phabricator.wikimedia.org/T177253#4049592 (10fgiunchedi) [12:49:31] 10Operations, 10Puppet, 10Patch-For-Review, 10User-fgiunchedi: Update jmx_exporter mbeans whitelist for puppetdb 4 - https://phabricator.wikimedia.org/T189516#4049589 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi We're not whitelisting mbeans anymore, resolving. [12:49:45] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419413 [12:51:03] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419413 (owner: 10Marostegui) [12:52:11] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419413 (owner: 10Marostegui) [12:52:25] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419413 (owner: 10Marostegui) [12:52:50] (03CR) 10Zfilipin: [C: 031] Publish throttle-analyze at noc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414758 (https://phabricator.wikimedia.org/T187894) (owner: 10Urbanecm) [12:53:34] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic for db1074 (duration: 01m 14s) [12:53:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for European Mid-day SWAT(Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180314T1300). [13:00:04] Urbanecm and revi: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:11] I can SWAT today [13:00:17] zeljkof, I'm here [13:00:24] (03PS3) 10Rduran: [WIP] Add port of osc_host.sh [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/419366 [13:00:28] oh well... [13:00:38] (SWAT is a bit earlier, but it might be just because it isn't pinned to UTC) [13:00:41] I’ll need few minutes to sit in front of laptop [13:01:12] revi: do you want to go first? since you have one patch and it's probably late for you? [13:01:36] I’m commuting right now (yes, 10pm) [13:01:39] (03PS1) 10Rush: openstack: labtestn::neutron::metadata_proxy_shared_secret [labs/private] - 10https://gerrit.wikimedia.org/r/419416 [13:01:45] so just go with Urbanecm for now [13:01:54] revi: ok, let me know when you are ready [13:02:22] Urbanecm: there is nothing to test in the first patch, right? [13:02:39] zeljkof, yes [13:02:40] is there anything to test at mwdebug for the patches 2 and 3? [13:02:49] Both are testabl [13:02:49] e [13:03:04] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417687 (https://phabricator.wikimedia.org/T189241) (owner: 10Urbanecm) [13:03:16] Urbanecm: ok, I'll let you know when the second patch is at mwdebug [13:03:19] zeljkof, ack [13:03:28] merging the first patch [13:04:08] aack [13:04:26] ready now (except I'm doing stuff in a subway lol) [13:04:46] (03Merged) 10jenkins-bot: Remove obsolete throttle rules, add one new [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417687 (https://phabricator.wikimedia.org/T189241) (owner: 10Urbanecm) [13:04:51] I almost forgot that I rescheduled to today SWAT lol [13:04:55] revi: ok, I'll merge your patch next, so you can go [13:04:57] kk [13:05:43] revi, you have internet connection in subway? [13:05:48] yes [13:05:56] Wi-Fi via LTE [13:06:42] (03PS2) 10Revi: Disable upload for non-admins on kowikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417189 (https://phabricator.wikimedia.org/T189021) [13:06:54] !log zfilipin@tin Synchronized wmf-config/throttle.php: SWAT: [[gerrit:417687|Remove obsolete throttle rules, add one new (T189241)]] (duration: 01m 15s) [13:06:55] I did the rebase for you :P [13:06:58] Urbanecm: 417687 is deployed, I'll deploy revi's patch next so he can go, it's 10pm there [13:06:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:00] T189241: IP cap lift for University of Texas at Arlington HerStory Edit-a-thon on 2018-03-28 - https://phabricator.wikimedia.org/T189241 [13:07:14] zeljkof, sure, no need to hurry [13:07:19] I'm just after lunch [13:07:23] timezone heh [13:07:52] well, DST'ed SWAT time is better since non-DST is 11PM (one bad thing: morning meeting, 1 hr earlier) [13:07:57] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417189 (https://phabricator.wikimedia.org/T189021) (owner: 10Revi) [13:08:38] revi, EU isn't the only SWAT [13:08:46] US SWAT - even worse [13:08:47] (03Merged) 10jenkins-bot: Disable upload for non-admins on kowikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417189 (https://phabricator.wikimedia.org/T189021) (owner: 10Revi) [13:08:56] 2AM SWAT is no-no [13:09:17] revi: but there is US afternoon swat, that might be your morning, right? [13:09:26] 9AM before DST, yes [13:09:29] (03CR) 10jenkins-bot: Remove obsolete throttle rules, add one new [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417687 (https://phabricator.wikimedia.org/T189241) (owner: 10Urbanecm) [13:09:30] that was good [13:09:33] Evening SWAT? [13:09:38] (23:00 UTC) [13:09:43] Evening swat would be the correct name yeah [13:09:52] that was 00:00 UTC for PST [13:10:03] but PDT - 23:00 [13:10:09] 9am to 8am, meh [13:10:16] revi: your patch is at mwdebug1002, please test and let me know if I can deploy [13:10:17] conclusion: Timezone sucks [13:10:40] (03CR) 10Ottomata: "Package['hadoop'] -> Class['profile::hadoop::monitoring::datanode'] ?" [puppet] - 10https://gerrit.wikimedia.org/r/419377 (https://phabricator.wikimedia.org/T188294) (owner: 10Elukey) [13:11:19] yes, good to go [13:11:22] zeljkof: ^ [13:11:28] revi: deploying [13:12:58] !log zfilipin@tin Synchronized dblists/commonsuploads.dblist: SWAT: [[gerrit:417189|Disable upload for non-admins on kowikiversity (T189021)]] (duration: 01m 14s) [13:13:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:04] T189021: Disable upload for non-admins on kowikiversity - https://phabricator.wikimedia.org/T189021 [13:13:12] (03PS2) 10Andrew Bogott: Rename newhorizon and newtoolsadmin to horizon and toolsadmin [puppet] - 10https://gerrit.wikimedia.org/r/419225 (https://phabricator.wikimedia.org/T168470) [13:13:14] revi: deployed, please test and thanks for deploying with #releng ;) [13:13:31] confirmed, thanks, and have a great day! [13:13:34] Urbanecm: merging 414758 [13:13:47] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414758 (https://phabricator.wikimedia.org/T187894) (owner: 10Urbanecm) [13:14:01] (03CR) 10Andrew Bogott: [C: 032] Rename newhorizon and newtoolsadmin to horizon and toolsadmin [puppet] - 10https://gerrit.wikimedia.org/r/419225 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [13:14:07] zeljkof, ack [13:15:31] (03PS2) 10Hashar: Fix nrpe spec for os_version() [puppet] - 10https://gerrit.wikimedia.org/r/419410 [13:15:49] (03Merged) 10jenkins-bot: Publish throttle-analyze at noc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414758 (https://phabricator.wikimedia.org/T187894) (owner: 10Urbanecm) [13:16:36] (03PS2) 10Rush: openstack: labtestn::neutron::metadata_proxy_shared_secret [labs/private] - 10https://gerrit.wikimedia.org/r/419416 [13:16:38] Urbanecm: 414758 is at mwdebug1002 [13:16:44] (03CR) 10Rush: [V: 032 C: 032] openstack: labtestn::neutron::metadata_proxy_shared_secret [labs/private] - 10https://gerrit.wikimedia.org/r/419416 (owner: 10Rush) [13:17:00] zeljkof, ack [13:18:11] I'm not sure if mwdebug is working with noc.wikimedia.org [13:18:25] Can you please try to deploy it? [13:18:29] Urbanecm: sure [13:18:34] Urbanecm: deploying [13:20:03] !log ppchelko@tin Started deploy [cpjobqueue/deploy@5686f16]: Deduplicate based on the root job dt and sha1 combination [13:20:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:40] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@5686f16]: Deduplicate based on the root job dt and sha1 combination (duration: 00m 38s) [13:20:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:09] !log zfilipin@tin Synchronized docroot/noc/conf/throttle-analyze.php.txt: SWAT: [[gerrit:414758|Publish throttle-analyze at noc (T187894)]] (duration: 01m 13s) [13:21:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:15] T187894: Publish throttle-analyze.php to noc.wikimedia.org - https://phabricator.wikimedia.org/T187894 [13:21:19] !log ppchelko@tin Started deploy [cpjobqueue/deploy@c879056]: Deduplicate based on the root job dt and sha1 combination. Forgot to pull [13:21:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:32] Urbanecm: deployed 414758 [13:21:52] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@c879056]: Deduplicate based on the root job dt and sha1 combination. Forgot to pull (duration: 00m 33s) [13:21:53] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415580 (https://phabricator.wikimedia.org/T188456) (owner: 10Urbanecm) [13:21:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:27] zeljkof, well...probably I didn't use the right way, I don't know if asking for help and uploading a follow up is enough or reverting is better, you decide :) [13:22:55] Urbanecm: so, it does not work? [13:23:14] (414758) [13:23:15] (03Merged) 10jenkins-bot: Add ruwikimedia to wikidataclient [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415580 (https://phabricator.wikimedia.org/T188456) (owner: 10Urbanecm) [13:23:46] zeljkof, no [13:24:13] Urbanecm: ok, I'll revert it after 415580, so you can work on it after swat [13:24:25] (03CR) 10jenkins-bot: Disable upload for non-admins on kowikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417189 (https://phabricator.wikimedia.org/T189021) (owner: 10Revi) [13:24:32] zeljkof, ok [13:25:23] Urbanecm: 415580 is at mwdebug1002 [13:25:24] (03PS2) 10Andrew Bogott: Move horizon and toolsadmin to labweb backends [puppet] - 10https://gerrit.wikimedia.org/r/419226 (https://phabricator.wikimedia.org/T168470) [13:25:26] (03PS1) 10Andrew Bogott: labweb: move 'newhorizon' vhost to 'horizon' [puppet] - 10https://gerrit.wikimedia.org/r/419420 (https://phabricator.wikimedia.org/T168470) [13:25:27] zeljkof, ok [13:25:34] (03CR) 10jenkins-bot: Publish throttle-analyze at noc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414758 (https://phabricator.wikimedia.org/T187894) (owner: 10Urbanecm) [13:26:12] ACKNOWLEDGEMENT - Wikitech-static main page has content on labweb1001 is CRITICAL: CRITICAL - Cannot make SSL connection. andrew bogott why alert when the host is in downtime? [13:26:12] ACKNOWLEDGEMENT - Wikitech-static main page has content on labweb1002 is CRITICAL: CRITICAL - Cannot make SSL connection. andrew bogott why alert when the host is in downtime? [13:26:30] (03CR) 10Andrew Bogott: [C: 032] labweb: move 'newhorizon' vhost to 'horizon' [puppet] - 10https://gerrit.wikimedia.org/r/419420 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [13:26:45] zeljkof, well... I don't think that "[Wqki@gpAAC4AAFaqABkAAAAR] 2018-03-14 13:26:18: Fatal exception of type "Wikimedia\Rdbms\DBQueryError" [13:26:45] " is correct thing it should do [13:27:11] Urbanecm: :D [13:27:13] Urbanecm: revert? [13:27:37] zeljkof, probably, if we cannot find what's wrong. I don't have access to logstash so I have no clue what might be wrong [13:27:54] (03CR) 10jenkins-bot: Add ruwikimedia to wikidataclient [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415580 (https://phabricator.wikimedia.org/T188456) (owner: 10Urbanecm) [13:28:12] rdbms sounds like timeout [13:28:21] no logstash access either so just guessing [13:28:24] Urbanecm: swat is probably not a good time for debugging, I will revert then both 414758 and 415580, ok? [13:28:39] Urbanecm, Hauskatze can i help you guys? [13:28:53] (03PS2) 10Rush: openstack: labtestmetal partmon raid1 recipe [puppet] - 10https://gerrit.wikimedia.org/r/419258 (https://phabricator.wikimedia.org/T188266) [13:29:00] vgutierrez: hi, Urbanecm is getting a fatal exception [13:29:17] vgutierrez, I'm wondering why https://ru.wikimedia.org/w/index.php?title=%D0%A8%D0%B0%D0%B1%D0%BB%D0%BE%D0%BD:Q/doc&action=edit is fatalerror [13:29:17] [Wqki@gpAAC4AAFaqABkAAAAR] 2018-03-14 13:26:18: Fatal exception of type "Wikimedia\Rdbms\DBQueryError" [13:29:22] gracias [13:29:23] (when looking using mwdebug1002) [13:29:54] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for lldpd [puppet] - 10https://gerrit.wikimedia.org/r/419421 [13:30:28] related DB tables existed in that project DB? [13:30:41] rxy, that's the problem [13:31:00] zeljkof, can you create tables for wikidataclient please? [13:31:20] Urbanecm: um, you'll have to let me know how to do it? is there a script I can run? [13:31:31] zeljkof, probably, looking into docs [13:31:39] possible update.php ? [13:31:53] not sure for WMF wikis.. [13:32:09] Urbanecm: to make it explicit, 414758 (Publish throttle-analyze at noc) should be reverted, right? [13:32:29] any DBA arounds? [13:32:32] rxy, update.php is STRICTLY PROHIBITED for WMF wikis... [13:32:46] oh i don t know it [13:33:00] rxy, there's something in WikimediaMaintenance I think [13:33:10] zeljkof, yes, 414758 should be reverted [13:33:26] maybe jynus or marostegui are around (dbas) [13:33:33] (03PS1) 10Zfilipin: Revert "Publish throttle-analyze at noc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419422 [13:33:49] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419422 (owner: 10Zfilipin) [13:33:58] Per https://wikitech.wikimedia.org/wiki/Add_a_wiki, section Wikidata [13:34:30] That script is known to be troublesome, you might want to ask Marius (hoo) or Katie (aude) run it for you or just create a ticket (that may be done anytime after the wiki was created). [13:35:17] (03Merged) 10jenkins-bot: Revert "Publish throttle-analyze at noc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419422 (owner: 10Zfilipin) [13:35:19] Urbanecm: ok, so as it stands now, I would revert 415580 (Add ruwikimedia to wikidataclient) too, until we know how to properly deploy it, ok? [13:35:31] (03Abandoned) 10Rush: openstack: labtestmetal partmon raid1 recipe [puppet] - 10https://gerrit.wikimedia.org/r/419258 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [13:35:44] zeljkof, ok, I'll get commands that should be run [13:35:59] Urbanecm: great, thanks, reverting both then [13:36:17] zeljkof, ack, sorry for giving 2/3 patches "to-revert" :D [13:36:43] Urbanecm: no problem until it's like that _every_ time ;) [13:36:57] Ok :) [13:37:48] what happened? [13:38:10] jynus, how can we create tables for new wikidataclient wiki? [13:38:38] (03PS1) 10Rush: openstack: neutron initial components for labtestn [puppet] - 10https://gerrit.wikimedia.org/r/419424 (https://phabricator.wikimedia.org/T188266) [13:38:41] update.php is ok IF it only creates new tables [13:38:43] jynus, Urbanecm: I did not revert that one yet, if it's fixable we can still deploy it [13:38:59] BTW, as Urbanecm pointed out, "Error: 1146 Table 'ruwikimedia.wbc_entity_usage'"" [13:39:04] not altering exisiting things, or adding or removing indexes [13:39:12] (03CR) 10jerkins-bot: [V: 04-1] openstack: neutron initial components for labtestn [puppet] - 10https://gerrit.wikimedia.org/r/419424 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [13:39:19] jynus, I added new wiki to wikidataclient.dblist [13:39:23] Is that a new wiki? [13:39:23] hoo ^? [13:39:26] I guess you need to run client/sql/entity_usage.sql in the Wikibase repository? [13:39:27] no, existing one [13:39:32] (03CR) 10jenkins-bot: Revert "Publish throttle-analyze at noc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419422 (owner: 10Zfilipin) [13:39:46] Lucas_WMDE, I cannot run anything, I have no deploy access [13:39:47] then yes, a script to create those has to be run [13:40:07] jynus, WHICH script, can you provide exact syntax please? [13:40:07] normally it is documented on add a new wiki [13:40:11] Urbanecm: s/you/someone/ ;) [13:40:16] I have no ida, I do not handle that [13:40:21] can you tall me what is the outage? [13:40:34] https://ru.wikimedia.org/w/index.php?title=%D0%A8%D0%B0%D0%B1%D0%BB%D0%BE%D0%BD:Q/doc&action=edit with mwdebug1002 [13:40:42] There's fatal exception [13:40:53] jynus: there is no outage yet, I have deployed this at mwdebug1002 https://gerrit.wikimedia.org/r/#/c/415580/ [13:40:57] ok [13:41:05] I was pinged, didn't know the context [13:41:10] as wikidata is everywhere [13:41:21] This is chapter wiki FYI [13:41:34] so, I am never involved on wiki creation [13:41:50] update.php is also not to be run on creation [13:42:10] jynus: so, should we revert it, until we figure out how to properly deploy it? [13:42:48] !log zfilipin@tin Synchronized docroot/noc/conf/: SWAT: [[gerrit:419422|Revert "Publish throttle-analyze at noc" (T187894)]] (duration: 01m 15s) [13:42:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:54] T187894: Publish throttle-analyze.php to noc.wikimedia.org - https://phabricator.wikimedia.org/T187894 [13:43:15] databases seem on a good state, so up to de deployers- if it is going to create fatals, and nobody knows why, sure [13:43:26] Urbanecm: reverted 414758 with 419422 and deployed it [13:43:27] (03PS2) 10Rush: openstack: neutron initial components for labtestn [puppet] - 10https://gerrit.wikimedia.org/r/419424 (https://phabricator.wikimedia.org/T188266) [13:44:03] (03CR) 10jerkins-bot: [V: 04-1] openstack: neutron initial components for labtestn [puppet] - 10https://gerrit.wikimedia.org/r/419424 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [13:44:12] jynus: it's deployed only at mwdebug1002 for now, so since both Urbanecm and me (deployer) are not sure what to do, I'll revert until Urbanecm figures out what needs to be done :) [13:44:25] Fine from my side [13:44:30] I agree with that [13:44:39] Urbanecm: ok, reverting and deploying [13:44:40] line of thought [13:44:59] ask someone in the know how to deploy a new extension [13:45:11] hoo or aude ^ [13:45:23] jynus: swat should be for quick and simple deploys, if anything is strange, I usually revert [13:45:27] +1 [13:45:44] (03PS1) 10Zfilipin: Revert "Add ruwikimedia to wikidataclient" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419425 [13:45:48] +1 [13:45:50] if this is a new wiki, that normally is handled on its own window [13:45:56] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419425 (owner: 10Zfilipin) [13:46:05] becuase I remember problems happen always [13:47:16] (03Merged) 10jenkins-bot: Revert "Add ruwikimedia to wikidataclient" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419425 (owner: 10Zfilipin) [13:47:45] (03PS1) 10Rush: openstack: labtestn initial neutron components [puppet] - 10https://gerrit.wikimedia.org/r/419426 (https://phabricator.wikimedia.org/T188266) [13:48:05] If the errors is "X missing", likely some script to be run is missing [13:48:18] (03CR) 10jerkins-bot: [V: 04-1] openstack: labtestn initial neutron components [puppet] - 10https://gerrit.wikimedia.org/r/419426 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [13:49:04] !log zfilipin@tin Synchronized dblists/wikidataclient.dblist: SWAT: [[gerrit:419425|Revert "Add ruwikimedia to wikidataclient" (T188456)]] (duration: 01m 14s) [13:49:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:09] T188456: Need to use the Wikidata Q for the WMRU site (Wikibase Client) - https://phabricator.wikimedia.org/T188456 [13:49:12] Urbanecm: reverted 415580 with 419425 and deployed [13:49:17] ack [13:49:26] anything else for SWAT? (will wait a minute or two for replies) [13:49:41] (03Abandoned) 10Rush: openstack: neutron initial components for labtestn [puppet] - 10https://gerrit.wikimedia.org/r/419424 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [13:50:28] !log EU SWAT finished [13:50:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:52] so, if you see the message "did you run update.php?" on an error [13:51:01] do not run update.php :-) [13:51:39] it means something was missing or done out of order [13:52:05] (03PS2) 10Rush: openstack: labtestn initial neutron components [puppet] - 10https://gerrit.wikimedia.org/r/419426 (https://phabricator.wikimedia.org/T188266) [13:52:17] jynus: thanks for the tip! :) [13:52:17] (03CR) 10jenkins-bot: Revert "Add ruwikimedia to wikidataclient" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419425 (owner: 10Zfilipin) [13:52:36] (03CR) 10jerkins-bot: [V: 04-1] openstack: labtestn initial neutron components [puppet] - 10https://gerrit.wikimedia.org/r/419426 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [13:53:03] ^ zeljkof is that what was tried to be deployed? [13:53:18] (the revert) [13:53:53] (03CR) 10Ayounsi: [C: 031] "I don't know enough to comment on the Puppet part." [puppet] - 10https://gerrit.wikimedia.org/r/419421 (owner: 10Muehlenhoff) [13:53:54] jynus: yes [13:54:04] yeah, that indeed needs previous install [13:54:10] cannot just be added [13:54:56] normally it is done on addwiki, maybe some missed it? [13:55:07] jynus, the wiki was not wikidataclient originally [13:55:16] The task (T188456) is about make it WD client [13:55:16] T188456: Need to use the Wikidata Q for the WMRU site (Wikibase Client) - https://phabricator.wikimedia.org/T188456 [13:55:41] well, then there you have it [13:55:50] it needs install first [13:56:07] Wikidata doesn't add any new tables to the wiki itself [13:56:23] wikidata client does, doesn't it? [13:56:32] *wikibase [13:56:38] afaik yes, wbc_entity_usage [13:56:53] there is a process to it and I bet it is documented [13:57:04] Oh, hiding there [13:57:05] / most wikis are wikibase client wikis and no harm to adding this everywhere [13:57:05] $dbw->sourceFile( "$IP/extensions/Wikibase/client/sql/entity_usage.sql" ); [13:57:18] I am not argueing, just tring to point to what's missing [13:57:33] mwscript sql.php --wiki=ruwikimedia extensions/Wikibase/client/sql/entity_usage.sql [13:58:02] New wikis will have it... Older wikis that haven't had it enabled, won't have had it added [13:59:17] Is this the relevant bits https://wikitech.wikimedia.org/wiki/Add_a_wiki#Wikidata ? [13:59:32] That's just maintenance scripts [13:59:36] https://github.com/wikimedia/mediawiki-extensions-WikimediaMaintenance/blob/master/addWiki.php#L115 [13:59:47] addWiki.php does add that table when any new wiki is created [14:00:05] yes, I think they said it is not a new wiki [14:00:09] hence the issue [14:00:58] [13:57:33] mwscript sql.php --wiki=ruwikimedia extensions/Wikibase/client/sql/entity_usage.sql [14:01:01] Thats enough to fix it [14:01:23] that seems ok to me [14:02:09] Maybe it's worth adding it to the rest of the wikis that don't have it to save this problem in future [14:02:15] But then, that's a lot of empty tables [14:02:20] I don't like that [14:02:46] when we have a proper table coordination, it will be checked (extensions -> tables) [14:02:56] Other extensions do add their tables to all new wikis, but they're not used on many [14:03:10] and sometimes it is better to catch problems faster [14:04:08] I prefer to get an error and improve the documentation [14:04:21] "how to add wikidata to an existing wiki" [14:05:30] We can just add it to createExtensionTables.php [14:05:32] Makes life easier [14:05:52] I don't even know what that is [14:05:52] (03CR) 10Filippo Giunchedi: [C: 031] Enable base::service_auto_restart for lldpd [puppet] - 10https://gerrit.wikimedia.org/r/419421 (owner: 10Muehlenhoff) [14:06:03] A maintenance script that does what the name says [14:06:16] !log created wbc_entity_uages on ruwikimedia T188456 [14:06:22] on extensions/WikimediaMaintenance jynus [14:06:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:24] T188456: Need to use the Wikidata Q for the WMRU site (Wikibase Client) - https://phabricator.wikimedia.org/T188456 [14:06:35] very handy to create extension tables before deploying an extension [14:06:40] or after [14:07:03] so if that checks what extensions are enabled and creates the appropiate tables automatically [14:07:04] mwscript createExtensionTables.php --wiki=$wiki 'extension' [14:07:21] that seems to me like a big yeah [14:07:33] let me fetch the code for you [14:07:48] or the same, if you have to manually tell the extension [14:08:07] https://gerrit.wikimedia.org/r/#/c/419433/ [14:08:09] https://github.com/wikimedia/mediawiki-extensions-WikimediaMaintenance/blob/master/createExtensionTables.php [14:08:25] (03Abandoned) 10Ottomata: Remove force_protocol_version for cache webrequest varnishkafka [puppet] - 10https://gerrit.wikimedia.org/r/417308 (https://phabricator.wikimedia.org/T185136) (owner: 10Ottomata) [14:08:48] Reedy: I assume it does nothing if they are already there? [14:08:51] or it fails [14:09:17] Probably just fails [14:09:21] that is ok [14:09:26] unless the patch has CREATE TABLES IF NOT EXISTS (some do) [14:09:36] +1 [14:09:48] if you try to create a table that exists, I guess sql will fail as table already created? [14:09:53] yes [14:10:02] (03PS3) 10Ottomata: Point eventlogging varnishkafka at Kafka jumbo-eqiad with TLS [puppet] - 10https://gerrit.wikimedia.org/r/417319 (https://phabricator.wikimedia.org/T183297) [14:10:55] Reedy: although this would have failed anyway- so a reminder has to be set to run this just before enabling a new extension [14:10:58] seems that the violent wind has stopped, hopefully I won't get another power cut [14:11:11] (entity_usage.sql does have IF NOT EXISTS btw, but only on the CREATE TABLE and not on the indices) [14:11:34] Lucas_WMDE: the indexes will fail if they have a name- which they should [14:11:37] so that is ok [14:11:44] When I find some tasks that require creating the tables I usually add a note to the deployer to issue the script first [14:11:45] they’re named, yes [14:12:10] so let's send a reminder to deployers to do that after it is merged? [14:12:18] well,actually [14:12:20] not deployers [14:12:29] people that need things to be deployed [14:13:13] there was no problem here- an mistake was detected before full deployments- everyhing worked as indended :-D [14:13:27] so nothing to see here ! [14:13:30] :-) [14:13:37] heh [14:13:46] that, and we're making steps to make it easier to do this in future ;) [14:13:53] of course [14:14:02] Hey there -- had a question about getting a python module (from pypi, not one I wrote myself) in to production. I've created a debian/ dir and can build the package fine, wondering what the right set of next steps are to actually get it into apt and so on. Do I hand off the package built for both jessie and stretch? Or a repo (and if so, what should that repo contain)? [14:14:14] marlier: define production? [14:14:58] And where is the script to be used from etc [14:14:59] Wikimedia APT repository, so that it can be installed by puppet [14:15:21] It'll be used by a few different Perf team utilities [14:15:29] That run in different places [14:17:29] You'll have to ask ops to import them... I'm guessing they'll take prebuilt packages from WMF staff, and if everything is in a repo anyway... Might be easier to create a task for tracking of it [14:17:32] marlier: It depends, but normally you want to pass a source package, being compiled on the build server and then an op can upload it there [14:17:46] <_joe_> marlier: you need to have a repository and ask us to build and upload the package [14:18:11] <_joe_> where "us" == people with access to upload on the apt repo, which basically is ops [14:19:35] aside from that, you do not want to install packages, so you also probably need puppet code to install it [14:20:15] which may help giving context to why it is needed, if you understand what I mean [14:22:27] (03CR) 10Dzahn: [C: 032] "> Indeed I missed that as well. Sorry about that." [puppet] - 10https://gerrit.wikimedia.org/r/419341 (owner: 10Dzahn) [14:22:46] (03PS3) 10Rush: openstack: labtestn initial neutron components [puppet] - 10https://gerrit.wikimedia.org/r/419426 (https://phabricator.wikimedia.org/T188266) [14:26:11] jynus: Right, this is an update to coal, which is already in puppet (https://github.com/wikimedia/puppet/blob/production/modules/coal/manifests/init.pp). I have the puppet patch ready to go, but of course it won't work until the module that the new code relies on is actually available... [14:26:58] I'll get it into a repo, thanks all. [14:27:11] marlier: which python module btw? [14:27:20] aiokafka [14:27:34] (03PS2) 10Filippo Giunchedi: puppet: depool and reinstall puppetmaster2002 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/419173 (https://phabricator.wikimedia.org/T184562) [14:27:35] sadly not packaged for either jessie or stretch [14:29:43] marlier: I can see in their setup.py that they require kafka-python==1.3.5 [14:29:47] (03CR) 10Rush: "labtestcontrol2003.wikimedia.org,labtestneutron2001.codfw.wmnet,labtestneutron2002.codfw.wmet,labtestvirt2003.codfw.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/419426 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [14:30:49] (03CR) 10Filippo Giunchedi: [C: 032] puppet: depool and reinstall puppetmaster2002 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/419173 (https://phabricator.wikimedia.org/T184562) (owner: 10Filippo Giunchedi) [14:32:59] (03PS5) 10Muehlenhoff: Allow to selectively run time servers on Chrony (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/393581 [14:33:06] !log depool puppetmaster2002 for reimage [14:33:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:35] (03CR) 10jerkins-bot: [V: 04-1] Allow to selectively run time servers on Chrony (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/393581 (owner: 10Muehlenhoff) [14:34:20] (03PS4) 10Rush: openstack: labtestn initial neutron components [puppet] - 10https://gerrit.wikimedia.org/r/419426 (https://phabricator.wikimedia.org/T188266) [14:36:36] volans: ugh, you're right. [14:37:00] 1.3.5? [14:37:27] https://apt.wikimedia.org/wikimedia/pool/main/p/python-kafka/ [14:37:33] we could package for 1.3.5 v easily if you need [14:38:44] I can also just bump the requirement to 1.4.1 [14:38:56] that would be best [14:39:06] i think there was a strange bug in 1.3.1 that made us update recently [14:39:06] https://github.com/dpkp/kafka-python/pull/828 [14:39:10] or somethign [14:39:22] assuming it works :D [14:40:19] (03PS6) 10Muehlenhoff: Allow to selectively run time servers on Chrony (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/393581 [14:41:18] (03PS5) 10MarcoAurelio: Disable abusefilter from collecting private data on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416346 (https://phabricator.wikimedia.org/T188862) [14:41:23] (03PS6) 10MarcoAurelio: Disable abusefilter from collecting private data on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416346 (https://phabricator.wikimedia.org/T188862) [14:42:43] (03CR) 10jerkins-bot: [V: 04-1] Disable abusefilter from collecting private data on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416346 (https://phabricator.wikimedia.org/T188862) (owner: 10MarcoAurelio) [14:44:05] (03PS7) 10MarcoAurelio: Disable abusefilter from collecting private data on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416346 (https://phabricator.wikimedia.org/T188862) [14:44:48] (03PS3) 10Andrew Bogott: Move horizon and toolsadmin to labweb backends [puppet] - 10https://gerrit.wikimedia.org/r/419226 (https://phabricator.wikimedia.org/T168470) [14:44:56] !log beginning migration of eventlogging analtyics from Kafka analytics to Kafka jumbo: T183297 [14:45:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:02] T183297: Move EventLogging analytics processes to Kafka jumbo-eqiad cluster - https://phabricator.wikimedia.org/T183297 [14:45:12] (03CR) 10Ottomata: [C: 032] Point eventlogging varnishkafka at Kafka jumbo-eqiad with TLS [puppet] - 10https://gerrit.wikimedia.org/r/417319 (https://phabricator.wikimedia.org/T183297) (owner: 10Ottomata) [14:45:21] (03PS4) 10Ottomata: Point eventlogging varnishkafka at Kafka jumbo-eqiad with TLS [puppet] - 10https://gerrit.wikimedia.org/r/417319 (https://phabricator.wikimedia.org/T183297) [14:45:25] (03CR) 10Ottomata: [V: 032 C: 032] Point eventlogging varnishkafka at Kafka jumbo-eqiad with TLS [puppet] - 10https://gerrit.wikimedia.org/r/417319 (https://phabricator.wikimedia.org/T183297) (owner: 10Ottomata) [14:45:27] (03CR) 10Andrew Bogott: [C: 032] Move horizon and toolsadmin to labweb backends [puppet] - 10https://gerrit.wikimedia.org/r/419226 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [14:46:54] (03CR) 10Huji: [C: 031] Disable abusefilter from collecting private data on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416346 (https://phabricator.wikimedia.org/T188862) (owner: 10MarcoAurelio) [14:47:04] (03PS7) 10Ottomata: Point eventlogging analytics and webperf processes at Kafka jumbo [puppet] - 10https://gerrit.wikimedia.org/r/404773 (https://phabricator.wikimedia.org/T183297) [14:47:24] (03CR) 10Andrew Bogott: Move horizon and toolsadmin to labweb backends [puppet] - 10https://gerrit.wikimedia.org/r/419226 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [14:47:33] (03PS4) 10Andrew Bogott: Move horizon and toolsadmin to labweb backends [puppet] - 10https://gerrit.wikimedia.org/r/419226 (https://phabricator.wikimedia.org/T168470) [14:48:26] (03CR) 10Andrew Bogott: [C: 032] Move horizon and toolsadmin to labweb backends [puppet] - 10https://gerrit.wikimedia.org/r/419226 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [14:50:14] (03CR) 10Rush: "http://puppet-compiler.wmflabs.org/10446/" [puppet] - 10https://gerrit.wikimedia.org/r/419426 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [14:54:08] (03PS8) 10Ottomata: Point eventlogging analytics and webperf processes at Kafka jumbo [puppet] - 10https://gerrit.wikimedia.org/r/404773 (https://phabricator.wikimedia.org/T183297) [14:54:10] volans: I promise I'll actually test it! [14:54:11] marlier: FYI we are moving eventlogging to jumbo now, so in your tests you'll want to target those brokers [14:54:14] jumbo-eqiad cluster [14:54:25] Cool, thanks ottomata [14:54:29] :) [14:54:46] (03CR) 10Ottomata: [C: 032] Point eventlogging analytics and webperf processes at Kafka jumbo [puppet] - 10https://gerrit.wikimedia.org/r/404773 (https://phabricator.wikimedia.org/T183297) (owner: 10Ottomata) [14:54:54] (03PS9) 10Ottomata: Point eventlogging analytics and webperf processes at Kafka jumbo [puppet] - 10https://gerrit.wikimedia.org/r/404773 (https://phabricator.wikimedia.org/T183297) [14:54:56] (03CR) 10Ottomata: [V: 032 C: 032] Point eventlogging analytics and webperf processes at Kafka jumbo [puppet] - 10https://gerrit.wikimedia.org/r/404773 (https://phabricator.wikimedia.org/T183297) (owner: 10Ottomata) [14:55:01] (03CR) 10Filippo Giunchedi: "LGTM, though it looks like tests are failing due to os_version usage. hashar published https://gerrit.wikimedia.org/r/c/419410/ but I know" [puppet] - 10https://gerrit.wikimedia.org/r/419400 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [14:55:37] PROBLEM - Check systemd state on restbase-dev1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:55:47] PROBLEM - Check systemd state on restbase-dev1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:58:16] !log rebooting furud [14:58:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:45] (03CR) 10Rush: [C: 032] openstack: labtestn initial neutron components [puppet] - 10https://gerrit.wikimedia.org/r/419426 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [14:58:55] (03PS5) 10Rush: openstack: labtestn initial neutron components [puppet] - 10https://gerrit.wikimedia.org/r/419426 (https://phabricator.wikimedia.org/T188266) [14:59:02] PROBLEM - Host furud is DOWN: PING CRITICAL - Packet loss = 100% [14:59:06] !log installing virt-what updates from stretch point release [14:59:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:20] (03PS4) 10Jcrespo: dbproxy: Reenable firewall on passive m1 & m2 proxies with holes [puppet] - 10https://gerrit.wikimedia.org/r/419392 (https://phabricator.wikimedia.org/T189655) [15:01:29] (03CR) 10Filippo Giunchedi: [C: 031] wdqs: collect prometheus metrics for both wdqs clusters [puppet] - 10https://gerrit.wikimedia.org/r/419264 (https://phabricator.wikimedia.org/T187766) (owner: 10Gehel) [15:02:18] (03PS2) 10Gehel: wdqs: collect prometheus metrics for both wdqs clusters [puppet] - 10https://gerrit.wikimedia.org/r/419264 (https://phabricator.wikimedia.org/T187766) [15:02:57] !log disabling puppet in preparation for reimage of dbproxy1002 and 6 [15:03:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:15] (03CR) 10Gehel: [C: 032] wdqs: collect prometheus metrics for both wdqs clusters [puppet] - 10https://gerrit.wikimedia.org/r/419264 (https://phabricator.wikimedia.org/T187766) (owner: 10Gehel) [15:03:26] (03PS5) 10Jcrespo: dbproxy: Reenable firewall on passive m1 & m2 proxies with holes [puppet] - 10https://gerrit.wikimedia.org/r/419392 (https://phabricator.wikimedia.org/T189655) [15:03:42] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1976 bytes in 0.099 second response time [15:05:38] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#4050038 (10jcrespo) [15:05:41] 10Operations, 10DBA, 10Patch-For-Review: Firewall configurations for database hosts - https://phabricator.wikimedia.org/T104699#4050039 (10jcrespo) [15:05:44] 10Operations, 10DBA, 10Patch-For-Review: Reimage and upgrade to stretch all dbproxies - https://phabricator.wikimedia.org/T183249#4050035 (10jcrespo) 05stalled>03Open [15:06:01] 10Operations, 10DBA, 10Patch-For-Review: Reimage and upgrade to stretch all dbproxies - https://phabricator.wikimedia.org/T183249#3848018 (10jcrespo) a:03jcrespo [15:06:14] (03CR) 10Jcrespo: [C: 032] dbproxy: Reenable firewall on passive m1 & m2 proxies with holes [puppet] - 10https://gerrit.wikimedia.org/r/419392 (https://phabricator.wikimedia.org/T189655) (owner: 10Jcrespo) [15:06:57] (03PS2) 10Andrew Bogott: Revert "multiversion: add a transitional mapping for newwikitech.wikimedia.org" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417974 [15:08:20] 10Operations, 10DBA, 10Patch-For-Review: Reimage and upgrade to stretch all dbproxies - https://phabricator.wikimedia.org/T183249#4050054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['dbproxy1002.eqiad.wmnet'] ``` The log can be found in `/var/... [15:09:14] 10Operations, 10DBA, 10Patch-For-Review: Reimage and upgrade to stretch all dbproxies - https://phabricator.wikimedia.org/T183249#4050055 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['dbproxy1006.eqiad.wmnet'] ``` The log can be found in `/var/... [15:11:36] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419446 [15:12:28] !log Disabling BGP on cr2-codfw Zayo transit - T189452 [15:12:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:34] T189452: Interface errors on cr2-codfw: xe-5/3/1 - https://phabricator.wikimedia.org/T189452 [15:13:34] 10Operations, 10ops-codfw: attach furud's new arrays (furud-array[3-7]) - https://phabricator.wikimedia.org/T185153#4050068 (10faidon) a:05faidon>03Papaul I rebooted furud and is not booting right now, saying: ``` The total number of enclosures connected to connector 01, has exceeded the maximum allowable... [15:14:35] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for prometheus-hhvm-exporter [puppet] - 10https://gerrit.wikimedia.org/r/419447 (https://phabricator.wikimedia.org/T135991) [15:15:04] (03CR) 10jerkins-bot: [V: 04-1] Enable base::service_auto_restart for prometheus-hhvm-exporter [puppet] - 10https://gerrit.wikimedia.org/r/419447 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:15:34] (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for prometheus-hhvm-exporter [puppet] - 10https://gerrit.wikimedia.org/r/419447 (https://phabricator.wikimedia.org/T135991) [15:16:09] 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Elukey: rack/setup/install mw2259-mw2290 - https://phabricator.wikimedia.org/T188301#4050073 (10Joe) >>! In T188301#4047148, @Papaul wrote: > @Joe ok . > For now I have 5 new servers in A4 and 7 new servers in B3. so moving all the new server in A3 to B... [15:16:56] (03PS1) 10Andrew Bogott: horizon: one more cleanup of default 'newhorizon' arg [puppet] - 10https://gerrit.wikimedia.org/r/419449 [15:17:20] 10Operations, 10ops-codfw: attach furud's new arrays (furud-array[3-7]) - https://phabricator.wikimedia.org/T185153#4050082 (10Papaul) @faidon [15:17:44] (03CR) 10Andrew Bogott: [C: 032] Revert "multiversion: add a transitional mapping for newwikitech.wikimedia.org" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417974 (owner: 10Andrew Bogott) [15:17:58] (03CR) 10Andrew Bogott: [C: 032] horizon: one more cleanup of default 'newhorizon' arg [puppet] - 10https://gerrit.wikimedia.org/r/419449 (owner: 10Andrew Bogott) [15:19:22] papaul: ? [15:19:42] 10Operations, 10DBA, 10Patch-For-Review: Switchover m2 master from db1020 to db1051 - https://phabricator.wikimedia.org/T189656#4049095 (10Marostegui) [15:21:15] (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for lldpd [puppet] - 10https://gerrit.wikimedia.org/r/419421 [15:22:01] 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Elukey: rack/setup/install mw2259-mw2290 - https://phabricator.wikimedia.org/T188301#4050104 (10Papaul) @joe thanks [15:22:24] 10Operations: Connection timeout from 195.77.175.64/29 to text-lb.esams.wikimedia.org - https://phabricator.wikimedia.org/T189689#4050109 (10Samtar) [15:23:19] 10Operations: Connection timeout from 195.77.175.64/29 to text-lb.esams.wikimedia.org - https://phabricator.wikimedia.org/T189689#4050122 (10Samtar) [15:24:11] (03CR) 10Muehlenhoff: [C: 032] Enable base::service_auto_restart for lldpd [puppet] - 10https://gerrit.wikimedia.org/r/419421 (owner: 10Muehlenhoff) [15:24:18] paravoid: yes [15:25:09] !log Re-enabling BGP on cr2-codfw Zayo transit - T189452 [15:25:09] 10Operations, 10netops: Connection timeout from 195.77.175.64/29 to text-lb.esams.wikimedia.org - https://phabricator.wikimedia.org/T189689#4050134 (10Marostegui) [15:25:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:15] T189452: Interface errors on cr2-codfw: xe-5/3/1 - https://phabricator.wikimedia.org/T189452 [15:25:55] 10Operations, 10ops-codfw: rack/setup/install ms-be204[0-3] - https://phabricator.wikimedia.org/T189633#4050139 (10fgiunchedi) [15:26:20] 10Operations, 10ops-codfw: rack/setup/install ms-be204[0-3] - https://phabricator.wikimedia.org/T189633#4048384 (10fgiunchedi) a:05fgiunchedi>03Papaul @papaul racking plan looks good (i.e. one machine per row), thanks! [15:27:18] 10Operations, 10ops-codfw, 10netops: Interface errors on cr2-codfw: xe-5/3/1 - https://phabricator.wikimedia.org/T189452#4050146 (10ayounsi) 05Open>03Resolved Optic replaced by Papaul, will re-open if the errors shows up again after a few hours. [15:27:59] (03PS1) 10Andrew Bogott: Remove transitional newhorizon and newtoolsadmin domains [dns] - 10https://gerrit.wikimedia.org/r/419451 [15:28:57] (03PS14) 10Bstorm: wiki replicas: script index creation for easier maintenance [puppet] - 10https://gerrit.wikimedia.org/r/417357 (https://phabricator.wikimedia.org/T181650) [15:29:17] hashar zeljkof is there anything going on with CI? I have been waiting for almost 20 minutes now to get a patch checked - just asking in case there is some stuff going on :-) [15:30:16] (03CR) 10Bstorm: wiki replicas: script index creation for easier maintenance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/417357 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [15:31:41] marostegui: nothing I am aware of [15:31:58] <_joe_> marostegui: all normal! [15:32:02] <_joe_> https://cdn.meme.am/cache/instances/folder46/65289046.jpg [15:32:08] zeljkof: I am seeing some other jobs sitting there for 25 minutes :| [15:32:17] _joe_: XDDDDD [15:32:41] marostegui: hashar will know [15:33:37] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1950 bytes in 0.094 second response time [15:34:37] 10Operations, 10DBA, 10Patch-For-Review: Reimage and upgrade to stretch all dbproxies - https://phabricator.wikimedia.org/T183249#4050165 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['dbproxy1006.eqiad.wmnet'] ``` The log can be found in `/var/... [15:35:45] (03PS1) 10Filippo Giunchedi: Add puppetmaster2002 back, offline [puppet] - 10https://gerrit.wikimedia.org/r/419455 (https://phabricator.wikimedia.org/T184562) [15:36:29] PROBLEM - Apache HTTP on mw2200 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:37:19] RECOVERY - Apache HTTP on mw2200 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.103 second response time [15:37:19] PROBLEM - Host dbproxy1006 is DOWN: PING CRITICAL - Packet loss = 100% [15:40:29] (03PS2) 10Vgutierrez: add shell user dynkm/oliver keyes [puppet] - 10https://gerrit.wikimedia.org/r/416993 (https://phabricator.wikimedia.org/T188945) (owner: 10RobH) [15:41:27] 10Operations, 10DBA, 10Patch-For-Review: Reimage and upgrade to stretch all dbproxies - https://phabricator.wikimedia.org/T183249#4050207 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['dbproxy1002.eqiad.wmnet'] ``` and were **ALL** successful. [15:41:33] (03CR) 10Filippo Giunchedi: [C: 032] Add puppetmaster2002 back, offline [puppet] - 10https://gerrit.wikimedia.org/r/419455 (https://phabricator.wikimedia.org/T184562) (owner: 10Filippo Giunchedi) [15:42:30] RECOVERY - Host dbproxy1006 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [15:43:11] marostegui: zeljkof all of the labs integration slaves are currently full [15:43:19] PROBLEM - Host mw2097.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:43:26] lovely [15:43:31] apparently we now run 2 verisons of the mediawiki code coverage job (it takes 1 hour) so there are only 2 slaves running other stuff [15:43:40] and right now those 2 are running browser tests [15:43:40] addshore: there should be a task about that [15:43:49] I'm just gonna kill the coverage jobs for now [15:44:06] (03PS1) 10Jcrespo: dbproxy: Enable firewall on the active m1 and m2 proxies [puppet] - 10https://gerrit.wikimedia.org/r/419456 (https://phabricator.wikimedia.org/T189656) [15:44:24] (03CR) 10Marostegui: [C: 031] dbproxy: Enable firewall on the active m1 and m2 proxies [puppet] - 10https://gerrit.wikimedia.org/r/419456 (https://phabricator.wikimedia.org/T189656) (owner: 10Jcrespo) [15:44:30] (03Merged) 10jenkins-bot: Revert "multiversion: add a transitional mapping for newwikitech.wikimedia.org" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417974 (owner: 10Andrew Bogott) [15:44:47] (03CR) 10jenkins-bot: Revert "multiversion: add a transitional mapping for newwikitech.wikimedia.org" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417974 (owner: 10Andrew Bogott) [15:44:49] PROBLEM - MD RAID on dbproxy1006 is CRITICAL: Return code of 255 is out of bounds [15:44:52] killed, you should start to see stuff move again now [15:44:58] (03PS3) 10Vgutierrez: add shell user dynkm/oliver keyes [puppet] - 10https://gerrit.wikimedia.org/r/416993 (https://phabricator.wikimedia.org/T188945) (owner: 10RobH) [15:45:10] addshore: yeah, just got check - thanks a lot [15:45:14] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419446 (owner: 10Marostegui) [15:46:19] PROBLEM - configured eth on dbproxy1006 is CRITICAL: Return code of 255 is out of bounds [15:46:34] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419446 (owner: 10Marostegui) [15:46:43] (03CR) 10Andrew Bogott: [C: 032] Remove transitional newhorizon and newtoolsadmin domains [dns] - 10https://gerrit.wikimedia.org/r/419451 (owner: 10Andrew Bogott) [15:46:54] dbproxy1006 apparently took more time than expected on the reimage [15:47:13] the script should downtime it soon [15:47:36] !log andrew@tin Synchronized multiversion/MWMultiVersion.php: wikitech cleanup (duration: 01m 14s) [15:47:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:59] PROBLEM - Check size of conntrack table on dbproxy1006 is CRITICAL: Return code of 255 is out of bounds [15:48:00] PROBLEM - dhclient process on dbproxy1006 is CRITICAL: Return code of 255 is out of bounds [15:48:27] "Still waiting for reboot after 10.0 minutes" [15:48:56] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic for db1074 (duration: 01m 15s) [15:48:57] when I see that that always scares me [15:49:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:23] (03PS15) 10Bstorm: wiki replicas: script index creation for easier maintenance [puppet] - 10https://gerrit.wikimedia.org/r/417357 (https://phabricator.wikimedia.org/T181650) [15:50:27] marostegui: fyi https://phabricator.wikimedia.org/T189693 [15:50:40] (03PS1) 10Marostegui: db-eqiad.php: Restore db1074 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419458 [15:50:43] addshore: ah, thank you! [15:51:22] (03PS1) 10Ema: 5.1.3-1wm4: extrachance retry fixes from upstream [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/419460 [15:52:45] Jamesofur: Can Johnben be added to the staff group? [15:52:48] on wiki [15:53:16] (03CR) 10Bstorm: wiki replicas: script index creation for easier maintenance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/417357 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [15:55:17] By which I mean, User:JBennett_(WMF) [15:57:29] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad on kafka1020 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad/producer\.properties [15:58:02] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad on kafka1022 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad/producer\.properties [15:58:02] PROBLEM - Check systemd state on kafka1022 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:58:12] PROBLEM - Check systemd state on kafka1020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:58:16] we are working on it --^ [15:58:27] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419446 (owner: 10Marostegui) [15:59:01] PROBLEM - puppet last run on prometheus2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:59:51] that was me ^ and will recover shortly [15:59:52] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad on kafka1022 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad/producer\.properties [16:00:01] RECOVERY - MD RAID on dbproxy1006 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [16:00:01] RECOVERY - Check systemd state on kafka1022 is OK: OK - running: The system is fully operational [16:00:02] RECOVERY - Check size of conntrack table on dbproxy1006 is OK: OK: nf_conntrack is 0 % full [16:00:02] RECOVERY - dhclient process on dbproxy1006 is OK: PROCS OK: 0 processes with command name dhclient [16:00:12] RECOVERY - Check systemd state on kafka1020 is OK: OK - running: The system is fully operational [16:00:21] RECOVERY - configured eth on dbproxy1006 is OK: OK - interfaces up [16:00:31] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad on kafka1020 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad/producer\.properties [16:07:03] (03PS1) 10Alexandros Kosiaris: Enable ServiceAccount admission contoller in kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/419462 [16:07:50] 10Operations, 10DBA, 10Patch-For-Review: Reimage and upgrade to stretch all dbproxies - https://phabricator.wikimedia.org/T183249#4050305 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['dbproxy1006.eqiad.wmnet'] ``` and were **ALL** successful. [16:12:32] !log temporarily add back puppetmaster2002 as a low-weight backend [16:12:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:52] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288#4050320 (10RobH) Ok, IRC conversation update: Next Steps: * boot deploy1001 off live CD and check SMART attribu... [16:14:01] RECOVERY - puppet last run on prometheus2003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:16:01] PROBLEM - Disk space on kubernetes1004 is CRITICAL: DISK CRITICAL - /var/lib/kubelet/pods/9fe14494-27a2-11e8-8e48-aa0000fe6bdf/volumes/kubernetes.iosecret/tiller-token-xf04b is not accessible: Permission denied [16:16:46] bawolff: We require a couple sign offs for staff group/other user rights ( manager for any user rights, Maggie/c-level for anything that includes checkuser etc like the staff group does) and we do a log of everything so happy to look at it but will need him to email us at ca@ ( https://office.wikimedia.org/wiki/WMF_Staff_userrights_policy ) [16:17:47] Ok, I passed that on [16:18:04] The right of interest for right now is revisiondelete/suppress [16:19:54] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Restore db1074 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419458 (owner: 10Marostegui) [16:21:05] (03Merged) 10jenkins-bot: db-eqiad.php: Restore db1074 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419458 (owner: 10Marostegui) [16:21:19] (03CR) 10jenkins-bot: db-eqiad.php: Restore db1074 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419458 (owner: 10Marostegui) [16:22:43] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Restore db1074 original weight (duration: 01m 13s) [16:22:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:28] 10Operations: Integrate stretch 9.4 point update - https://phabricator.wikimedia.org/T189435#4050406 (10MoritzMuehlenhoff) None of the packages removed in the 9.4 update were present in our environment. These are fully rolled out: hdf5 ncurses ntp pdns-recursor python-mimeparse reportbug w3m [16:25:41] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1971 bytes in 0.113 second response time [16:28:01] RECOVERY - Disk space on kubernetes1004 is OK: DISK OK [16:29:25] PROBLEM - Disk space on kubernetes1003 is CRITICAL: DISK CRITICAL - /var/lib/kubelet/pods/76cd355f-27a4-11e8-8e48-aa0000fe6bdf/volumes/kubernetes.iosecret/tiller-token-xf04b is not accessible: Permission denied [16:29:38] 10Operations, 10Packaging, 10Scap, 10Patch-For-Review: Install git-lfs client (at least on scap targets & masters) - https://phabricator.wikimedia.org/T180628#4050430 (10demon) I can't think of any compelling reason why it would be *required* on the masters...but I could see it as being //useful// if you w... [16:32:10] (03PS1) 10Chad: Adding scap/log to .gitignore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419465 [16:32:12] (03CR) 10Chad: [C: 032] Adding scap/log to .gitignore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419465 (owner: 10Chad) [16:32:44] PROBLEM - puppet last run on wtp2005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:33:04] PROBLEM - puppet last run on mw2156 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:33:26] (03Merged) 10jenkins-bot: Adding scap/log to .gitignore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419465 (owner: 10Chad) [16:33:40] (03CR) 10jenkins-bot: Adding scap/log to .gitignore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419465 (owner: 10Chad) [16:34:34] PROBLEM - puppet last run on ms-be2025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:35:16] !log demon@tin Synchronized .gitignore: ignore scap logs (duration: 01m 15s) [16:35:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:25] RECOVERY - Disk space on kubernetes1003 is OK: DISK OK [16:37:55] 10Operations, 10Goal, 10HHVM: Complete the use of HHVM over Zend PHP on the Wikimedia cluster - https://phabricator.wikimedia.org/T86081#4050467 (10demon) [16:37:58] 10Operations, 10Deployments, 10Beta-Cluster-reproducible, 10HHVM, and 2 others: Switch mwscript from Zend PHP5 to default php alternative (e.g. HHVM or PHP7) - https://phabricator.wikimedia.org/T146285#4050464 (10demon) 05Open>03Resolved a:03demon [16:38:44] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288#4050484 (10RobH) p:05Normal>03High [16:38:59] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288#3588945 (10RobH) a:05RobH>03Cmjohnson [16:39:34] RECOVERY - puppet last run on ms-be2025 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [16:40:31] !log installing cron updates from stretch 9.4 point release [16:40:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:15] PROBLEM - Disk space on kubernetes1002 is CRITICAL: DISK CRITICAL - /var/lib/kubelet/pods/1d0346cc-27a6-11e8-8e48-aa0000fe6bdf/volumes/kubernetes.iosecret/tiller-token-xf04b is not accessible: Permission denied [16:41:44] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: CRITICAL - kubelet_operational_latencies is 34278 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [16:42:44] RECOVERY - puppet last run on wtp2005 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:43:04] RECOVERY - puppet last run on mw2156 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:43:32] that was me btw ^ [16:43:44] RECOVERY - kubelet operational latencies on kubernetes1004 is OK: OK - kubelet_operational_latencies is 1463 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [16:45:04] RECOVERY - Host furud is UP: PING OK - Packet loss = 0%, RTA = 36.14 ms [16:48:09] Hi SWAT folks! Any questions about the requested CentralNotice update? [16:48:16] (03PS1) 10Ottomata: Blacklist mediawiki.job topics from replication main -> jumbo [puppet] - 10https://gerrit.wikimedia.org/r/419472 (https://phabricator.wikimedia.org/T189464) [16:48:44] (03CR) 10jerkins-bot: [V: 04-1] Blacklist mediawiki.job topics from replication main -> jumbo [puppet] - 10https://gerrit.wikimedia.org/r/419472 (https://phabricator.wikimedia.org/T189464) (owner: 10Ottomata) [16:48:46] (03PS2) 10Ottomata: Blacklist mediawiki.job topics from replication main -> jumbo [puppet] - 10https://gerrit.wikimedia.org/r/419472 (https://phabricator.wikimedia.org/T189464) [16:49:15] RECOVERY - Disk space on kubernetes1002 is OK: DISK OK [16:49:49] (03CR) 10Ottomata: [C: 032] Blacklist mediawiki.job topics from replication main -> jumbo [puppet] - 10https://gerrit.wikimedia.org/r/419472 (https://phabricator.wikimedia.org/T189464) (owner: 10Ottomata) [16:52:30] (03PS1) 10Ottomata: Whitelist not set, but still getting validation error from puppet, trying false [puppet] - 10https://gerrit.wikimedia.org/r/419475 (https://phabricator.wikimedia.org/T189464) [16:52:52] (03CR) 10Jcrespo: "Not much to review here, just a heads up this is going to production now." [puppet] - 10https://gerrit.wikimedia.org/r/419456 (https://phabricator.wikimedia.org/T189656) (owner: 10Jcrespo) [16:53:00] (03PS2) 10Jcrespo: dbproxy: Enable firewall on the active m1 and m2 proxies [puppet] - 10https://gerrit.wikimedia.org/r/419456 (https://phabricator.wikimedia.org/T189656) [16:53:04] PROBLEM - puppet last run on kafka1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:54:05] (03CR) 10Jcrespo: [C: 032] dbproxy: Enable firewall on the active m1 and m2 proxies [puppet] - 10https://gerrit.wikimedia.org/r/419456 (https://phabricator.wikimedia.org/T189656) (owner: 10Jcrespo) [16:54:34] PROBLEM - puppet last run on kafka1023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:54:44] PROBLEM - puppet last run on kafka1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:55:30] was that mine? [16:55:31] (03CR) 10Ottomata: "https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/10448/console" [puppet] - 10https://gerrit.wikimedia.org/r/419475 (https://phabricator.wikimedia.org/T189464) (owner: 10Ottomata) [16:55:33] (03CR) 10Ottomata: [C: 032] Whitelist not set, but still getting validation error from puppet, trying false [puppet] - 10https://gerrit.wikimedia.org/r/419475 (https://phabricator.wikimedia.org/T189464) (owner: 10Ottomata) [16:55:39] (03PS2) 10Ottomata: Whitelist not set, but still getting validation error from puppet, trying false [puppet] - 10https://gerrit.wikimedia.org/r/419475 (https://phabricator.wikimedia.org/T189464) [16:55:39] seems it is not [16:55:41] (03CR) 10Ottomata: [V: 032 C: 032] Whitelist not set, but still getting validation error from puppet, trying false [puppet] - 10https://gerrit.wikimedia.org/r/419475 (https://phabricator.wikimedia.org/T189464) (owner: 10Ottomata) [16:56:08] !log deploying new firewall rules to dbproxy1001 and 7 [16:56:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:25] PROBLEM - Disk space on kubernetes1003 is CRITICAL: DISK CRITICAL - /var/lib/kubelet/pods/39d00d3e-27a8-11e8-8e48-aa0000fe6bdf/volumes/kubernetes.iosecret/tiller-token-cw3sp is not accessible: Permission denied [16:58:04] RECOVERY - puppet last run on kafka1022 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:58:24] PROBLEM - Host furud is DOWN: PING CRITICAL - Packet loss = 100% [16:58:44] RECOVERY - Host furud is UP: PING OK - Packet loss = 0%, RTA = 36.14 ms [16:58:48] !log Manually running extensions/Wikibase/repo/maintenance/dispatchChanges.php on terbium, so that dispatching can catch up [16:58:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:33] mutante: check if you see something weird [16:59:34] RECOVERY - puppet last run on kafka1023 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:59:44] RECOVERY - puppet last run on kafka1020 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: It is that lovely time of the day again! You are hereby commanded to deploy Morning SWAT (Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180314T1700). [17:00:04] subbu, MaxSem, ejegg, AndyRussG, Urbanecm, and MatmaRex: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:17] nothing for parsoid today [17:00:27] is gerrit up for everyone? [17:00:35] jynus: i don't, gerrit is running [17:00:43] and service says it's fine [17:00:47] Here! [17:00:55] mutante: connections didn't get closed [17:01:06] so probably ok for that [17:01:26] i can also logout and login again [17:01:26] we could be afecting some other more obscure service I didn't catched connected [17:01:33] that should be a new connection right [17:01:52] ok, keeping an eye out [17:02:04] mutante: if for some reason you see a mysql conection that failed on some service [17:02:29] https://gerrit.wikimedia.org/r/#/c/419392/5/modules/profile/manifests/mariadb/ferm_misc.pp [17:02:35] ^this is where to add extra rules [17:03:00] you can see there already ones for rt, gerrit and librenms [17:03:05] jynus: ok! gotcha, yes [17:03:27] others should be available by default on the general 10.x rule [17:03:37] checks netmon1003 /servermon [17:03:40] thank you for the help [17:03:46] ok, np [17:03:49] oh 1003? [17:04:00] I though it was only 1002 [17:04:07] yea, servermon has it's own server [17:04:12] because it cant run on stretch [17:04:18] while the other services are already on stretch [17:04:21] but I only saw conenctions from 1002 [17:04:28] yea, so https://servermon.wikimedia.org/ [17:04:28] should I add 1003? [17:04:32] Hi all, lmk if you have any questions about the CentralNotice deploy [17:04:34] yes, we need to add it [17:04:38] 10Operations, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1009 - https://phabricator.wikimedia.org/T189216#4050605 (10RobH) [17:04:42] it has a proxy error [17:04:48] ok, adding it [17:05:00] I guess it is not an outage-level problem? [17:05:17] no, it's just SRE-internal tool to track our servers/patches [17:05:29] ok, adding it, of course [17:05:38] just understanding how bad it was [17:06:00] 10Operations, 10Wikimedia-Apache-configuration, 10User-Joe: Gain visibility into httpd mod_proxy actions - https://phabricator.wikimedia.org/T188601#4050615 (10fgiunchedi) See also https://httpd.apache.org/docs/2.4/mod/mod_proxy.html#proxystatus [17:06:26] ejegg: What...about CentralNotice? [17:06:40] uh, I can deploy but I have SoS in 30 mins [17:06:47] no_justification: scheduled 4 SWAT deploy [17:06:55] 10Operations, 10DBA: Decommission db1009 - https://phabricator.wikimedia.org/T189216#4050617 (10jcrespo) a:05Marostegui>03jcrespo wait, robh, I will take this for now- not yet ready for decom. [17:07:50] 10Operations, 10DBA: Decommission db1009 - https://phabricator.wikimedia.org/T189216#4050629 (10RobH) I wasn't taking it, was merely tagging in all decom requests with #hw-requests. I left it assigned to @Marostegui ;] [17:08:02] mostly no-op, only two small actual things + gerrited versions of security patches that are already on production [17:08:05] AndyRussG: Oh, that's nbd...in fact I was gonna do that during the train today later anyway but you'll beat me to it [17:08:08] no_justification: oh, it's in the request list for the current SWAT window [17:08:24] 10Operations, 10DBA, 10hardware-requests: Decommission db1009 - https://phabricator.wikimedia.org/T189216#4050631 (10jcrespo) ok. [17:08:26] derp, sorry, just read the hightlighted line [17:08:35] jynus: i can make the patch [17:08:44] 10Operations, 10ops-codfw: attach furud's new arrays (furud-array[3-7]) - https://phabricator.wikimedia.org/T185153#4050633 (10faidon) Figured this out with @Papaul on IRC (thanks!). We have now: - connector0 connected to arrays 3, 4, 5 - connector1 connected to arrays 6, 7 megacli shows 6 arrays with 12 dis... [17:08:46] subbu: did I get it right that you don't need that patch deployed? [17:09:00] oh sorry .. which one? [17:09:05] mutante: I am on it, just got distracted [17:09:08] * subbu spaced out [17:09:12] ok [17:09:13] https://gerrit.wikimedia.org/r/#/c/416489/ [17:09:22] 10:00 nothing for parsoid today [17:09:25] oops .. i mistakenly thought this was the parsoid window :) [17:09:33] no, this is the swat window, isn't it. [17:09:43] yes, i want that deployed .. MaxSem thanks for checking :) [17:09:56] mutante: just one question, I thought librenms ran on 1002, what is on 1002 and what is on 1003? [17:09:58] no_justification: ah cool! Yeah remember CN is still a snowflake.. With the updates to the wmf_deploy branch (linked on Deploy page) you should get updated submodule pointers in core repo [17:10:04] just for naming the rules [17:10:11] (MaxSem: ^, if you're deplloying) [17:10:15] thanks all!!!!! [17:10:19] (03PS1) 10Chad: Group1 to wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419481 [17:10:21] (03CR) 10Chad: [C: 04-2] Group1 to wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419481 (owner: 10Chad) [17:10:26] AndyRussG: Oh trust me, I know all too well. [17:10:26] (03PS1) 10Ottomata: Burrow should monitor eventlogging groups from jumbo [puppet] - 10https://gerrit.wikimedia.org/r/419482 (https://phabricator.wikimedia.org/T183297) [17:10:34] I'm reminded every week with the branching script ;-) [17:10:46] subbu: The change could not be rebased due to a conflict during merge. [17:10:59] hmm .. ok. let me fix that. [17:10:59] awww apologies we need to fix that, indeed [17:11:11] (03PS2) 10MaxSem: Disable ArticleCreationWorkflow, ACTRIAL ends on the 14th [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419077 (https://phabricator.wikimedia.org/T186570) [17:11:16] (03CR) 10MaxSem: [C: 032] Disable ArticleCreationWorkflow, ACTRIAL ends on the 14th [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419077 (https://phabricator.wikimedia.org/T186570) (owner: 10MaxSem) [17:12:24] jynus: netmon1003 = servermon on jessie netmon1002 = everything else on stretch [17:12:41] reason: servermon not working on stretch yet [17:12:57] AndyRussG: The technical part of the fix is easy. It's about communicating workflow changes ;-) [17:13:05] so it's "jessie VM for servermon until it supports stretch" [17:13:08] ok, that makes sense [17:13:20] (03PS1) 10Jcrespo: dbproxy: Add netmon1003 to the list of misc allowed services [puppet] - 10https://gerrit.wikimedia.org/r/419483 [17:13:33] mutante: see what this looks like https://gerrit.wikimedia.org/r/#/c/419483/ [17:13:37] (03PS2) 10Subramanya Sastry: Enable RemexHTML on kowiki, mznwiki, warwiki, cebwiki, nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416489 (https://phabricator.wikimedia.org/T188869) [17:13:52] (03Merged) 10jenkins-bot: Disable ArticleCreationWorkflow, ACTRIAL ends on the 14th [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419077 (https://phabricator.wikimedia.org/T186570) (owner: 10MaxSem) [17:13:57] 10Operations, 10DBA, 10hardware-requests: Decommission db1009 - https://phabricator.wikimedia.org/T189216#4050652 (10Marostegui) Yeah I never use the other tags till we have it ready from the DBA side, to avoid all the noise for the DC Ops :) [17:14:05] no_justification: hmmm that sounds good... I guess we just have to talk out requirements for a proper deploy workflow on the team (not a big deal, we just need to get to it.....) [17:14:09] (03CR) 10Dzahn: "call the service "servermon". it's not librenms here" [puppet] - 10https://gerrit.wikimedia.org/r/419483 (owner: 10Jcrespo) [17:14:19] jynus: looks good except ^ [17:14:38] 10Operations, 10DBA, 10hardware-requests: Decommission db1009 - https://phabricator.wikimedia.org/T189216#4050666 (10jcrespo) Oh, if they want the noise, they will get it here :-P [17:15:05] jynus: i would say "netmon-tools" (1002) and "servermon" (1003) [17:15:13] MaxSem, fixed. [17:15:14] 1002 has more than just librenms [17:15:15] is the SWAT happening? [17:15:36] (03PS2) 10Jcrespo: dbproxy: Add netmon1003 to the list of misc allowed services [puppet] - 10https://gerrit.wikimedia.org/r/419483 [17:16:08] MatmaRex: I'm deploying for the next 15 min [17:16:18] (03PS3) 10Jcrespo: dbproxy: Add netmon1003 to the list of misc allowed services [puppet] - 10https://gerrit.wikimedia.org/r/419483 [17:16:21] (03PS2) 10Alexandros Kosiaris: Enable ServiceAccount admission contoller in kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/419462 [17:16:42] (03CR) 10Dzahn: [C: 031] dbproxy: Add netmon1003 to the list of misc allowed services [puppet] - 10https://gerrit.wikimedia.org/r/419483 (owner: 10Jcrespo) [17:17:12] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests: Decommission db1043 - https://phabricator.wikimedia.org/T187542#4050670 (10RobH) [17:17:31] jynus: yea, that's right. netmon = server name for a server with (multiple) network monitoring tools servermon = name for a specific software just running on netmon1003 alone [17:17:37] lgtm now [17:18:26] 10Operations, 10Wikimedia-Apache-configuration, 10User-Joe: Gain visibility into httpd mod_proxy actions - https://phabricator.wikimedia.org/T188601#4050672 (10fgiunchedi) Including something like this might help as well `LogLevel warn proxy:info proxy_http:info proxy_balancer:info` [17:18:29] well, note that this rules are not based on the app service [17:18:30] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/419077 (duration: 01m 15s) [17:18:34] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/10449/kafkamon1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/419482 (https://phabricator.wikimedia.org/T183297) (owner: 10Ottomata) [17:18:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:38] unlike host names [17:18:57] these wer for specific applications, and I supposed only librenms database was used [17:19:01] (03CR) 10jenkins-bot: Disable ArticleCreationWorkflow, ACTRIAL ends on the 14th [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419077 (https://phabricator.wikimedia.org/T186570) (owner: 10MaxSem) [17:19:15] (03CR) 10Jcrespo: [C: 032] dbproxy: Add netmon1003 to the list of misc allowed services [puppet] - 10https://gerrit.wikimedia.org/r/419483 (owner: 10Jcrespo) [17:19:26] (03PS4) 10Jcrespo: dbproxy: Add netmon1003 to the list of misc allowed services [puppet] - 10https://gerrit.wikimedia.org/r/419483 [17:19:34] hmm, right, i would have to double check if any other of the tools use DB [17:19:44] but servermon the software is not on 1002, so that part is right [17:19:59] (03PS3) 10MaxSem: Enable RemexHTML on kowiki, mznwiki, warwiki, cebwiki, nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416489 (https://phabricator.wikimedia.org/T188869) (owner: 10Subramanya Sastry) [17:20:05] names here are irrelevant, so no much pain on changing it [17:20:05] (03CR) 10MaxSem: [C: 032] Enable RemexHTML on kowiki, mznwiki, warwiki, cebwiki, nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416489 (https://phabricator.wikimedia.org/T188869) (owner: 10Subramanya Sastry) [17:20:17] ack :) [17:22:17] (03Merged) 10jenkins-bot: Enable RemexHTML on kowiki, mznwiki, warwiki, cebwiki, nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416489 (https://phabricator.wikimedia.org/T188869) (owner: 10Subramanya Sastry) [17:22:37] (03PS1) 10BryanDavis: toolforge: Update role and hiera for elasticsearch 5.5 [puppet] - 10https://gerrit.wikimedia.org/r/419485 (https://phabricator.wikimedia.org/T181531) [17:22:52] jynus: it fixed it. https://servermon.wikimedia.org/ is working [17:22:57] cool [17:23:07] (03PS1) 10Madhuvishy: dumps: Enable ipv6 connectivity for C3SL mirror host [puppet] - 10https://gerrit.wikimedia.org/r/419486 [17:23:12] if you find someone complaining, you know know the drill [17:23:14] (03PS1) 10RobH: decom db1043 [dns] - 10https://gerrit.wikimedia.org/r/419487 (https://phabricator.wikimedia.org/T187542) [17:23:23] we are really bad at keeping track of those dependencies [17:23:29] subbu: pulled on mwdebug1002 [17:23:52] ok. testing. [17:24:03] jynus: yep, if i find anything i'll add it [17:24:20] one last think Long running screen/tmux is bad for a just reinstalled host [17:24:28] do you know what could be it [17:24:33] some missing dependency? [17:24:34] (03CR) 10jenkins-bot: Enable RemexHTML on kowiki, mznwiki, warwiki, cebwiki, nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416489 (https://phabricator.wikimedia.org/T188869) (owner: 10Subramanya Sastry) [17:24:43] "Return code of 255 is out of bounds" [17:24:55] MaxSem, looks good on nlwiki and kowiki .. as long as there are no log errors you see .. good to go. [17:25:37] (03PS3) 10Alexandros Kosiaris: Enable ServiceAccount admission contoller in kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/419462 [17:25:47] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Enable ServiceAccount admission contoller in kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/419462 (owner: 10Alexandros Kosiaris) [17:26:22] jynus: there is a ticket for it and i'm pretty sure it's because the interval between checks is set to something really high, like hours instead of minutes [17:26:22] (03PS1) 10RobH: decom db1043 [puppet] - 10https://gerrit.wikimedia.org/r/419488 (https://phabricator.wikimedia.org/T187542) [17:26:38] so what happens is it tries.. it fails.. then it works again [17:26:39] ok, so if I schedule it now, it should go away? [17:26:42] but it won't notice for hours [17:26:45] yes [17:26:46] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/416489 (duration: 01m 14s) [17:26:49] thanks [17:26:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:51] yw [17:26:53] subbu: ^ [17:26:58] \o/ thanks. [17:27:00] I was worrying it could be some missinstallation [17:27:12] no, it's "normal" [17:27:15] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1043 - https://phabricator.wikimedia.org/T187542#4050687 (10RobH) a:05RobH>03Cmjohnson Please note that the switch port for this host was not labeled & doesn't show in ethernet switching table. So @Cmjohnson we... [17:27:30] and it should go away [17:27:36] (03PS2) 10RobH: decom db1043 [dns] - 10https://gerrit.wikimedia.org/r/419487 (https://phabricator.wikimedia.org/T187542) [17:27:43] I just forced the check to make it go away [17:27:48] the bonus thing is that if you have error code 255 (and not 0 or 1) then it's considered UNKNOWN or something [17:27:52] OK, that's it - I've a meeting. would appreciate if someone else continued with the SWAT [17:27:59] (03CR) 10RobH: [C: 032] decom db1043 [dns] - 10https://gerrit.wikimedia.org/r/419487 (https://phabricator.wikimedia.org/T187542) (owner: 10RobH) [17:27:59] and that means unlike an normal CRIT.. it won't be silenced by downtime [17:28:02] or so [17:28:04] (03CR) 10RobH: [C: 032] decom db1043 [puppet] - 10https://gerrit.wikimedia.org/r/419488 (https://phabricator.wikimedia.org/T187542) (owner: 10RobH) [17:28:16] (03PS2) 10RobH: decom db1043 [puppet] - 10https://gerrit.wikimedia.org/r/419488 (https://phabricator.wikimedia.org/T187542) [17:28:17] mutante: It is actually identified as critical, not UNK [17:28:30] so that could be problem with icinga [17:28:48] hrm [17:29:29] then i need to be more specific even.. return code 255 from NRPE check [17:29:34] manifesting as CRIT [17:30:52] Hmmm should I ping people listed for the SWAT deploy? [17:31:11] ? [17:31:30] all the NRPE checks would fail during install.. simply because it can't connect to nrpe server before it's running [17:31:42] the difference is this one has the long check interval [17:32:09] so it takes a lot longer to notice than the other checks [17:32:32] Reedy SWAT deploy has become deployerless... We could also just book a separate slot I guess [17:32:42] Ah, I see [17:32:45] I see [17:32:48] I've gotta go into a meeting in a few minutes... [17:33:03] Yeah I think this is the slot where that happens ;p [17:33:18] No worries :) [17:33:50] AndyRussG: TBH, you don't necessarily need to deploy yours... [17:34:02] You can just make sure releng are aware they don't need to take the patch forward [17:34:53] Reedy: you mean about the security stuff? Right... However there is also a minor new thingy that we were hoping to get out [17:35:08] Also not terribly urgent tho [17:35:10] AndyRussG: I only see one patch? [17:35:14] https://gerrit.wikimedia.org/r/#/c/419313/ [17:35:19] (03CR) 10ArielGlenn: [C: 031] "Thanks for this." [puppet] - 10https://gerrit.wikimedia.org/r/419486 (owner: 10Madhuvishy) [17:36:14] (03PS2) 10ArielGlenn: Store all dataset/dumps mirrors info in one hiera structure, and use it [puppet] - 10https://gerrit.wikimedia.org/r/419390 (https://phabricator.wikimedia.org/T189657) [17:36:16] Reedy: yes... I mean, that's a merge of CN master into the deploy branch. Includes security stuff that's already on prod, a few no-ops, and the new bit of code we wanted to put out [17:36:23] Aha [17:36:37] I didn't look at the actual patch [17:36:52] Ah an done other very minor improvement [17:37:40] Reedy: yeah CN is allll snowflake still these days [17:40:56] (03PS4) 10Herron: naggen2: add support for puppetdb v4 settings and api [puppet] - 10https://gerrit.wikimedia.org/r/413435 (https://phabricator.wikimedia.org/T188032) [17:41:47] (03CR) 10Herron: naggen2: add support for puppetdb v4 settings and api (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/413435 (https://phabricator.wikimedia.org/T188032) (owner: 10Herron) [17:43:23] AndyRussG: staged... [17:43:31] Reedy: ah cool thx! [17:43:35] Do you want them on mwdebug? Or just pushing live? [17:43:37] (03CR) 10Volans: [C: 031] "LGTM, I guess we can drop the except part after the migration is completed." [puppet] - 10https://gerrit.wikimedia.org/r/413435 (https://phabricator.wikimedia.org/T188032) (owner: 10Herron) [17:43:57] Reedy: safer I guess to go the mwdebug route if it's not burdensome, thanks much!!! [17:44:45] (03CR) 10Filippo Giunchedi: [C: 031] naggen2: add support for puppetdb v4 settings and api [puppet] - 10https://gerrit.wikimedia.org/r/413435 (https://phabricator.wikimedia.org/T188032) (owner: 10Herron) [17:46:51] AndyRussG: should be on mwdebug1001... [17:47:15] Reedy: testing.... [17:48:05] though, I'm guessing you're really gonna need a full scap [17:50:33] Reedy: looks good! No full scap needed really [17:50:41] Oh wait [17:50:41] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1955 bytes in 0.090 second response time [17:50:47] Sorry maybe yeah, there are new i18n messages [17:50:49] you're changing en.json I think I saw :P [17:50:54] yeah [17:51:14] Not sure I'm gonna have time to scap [17:51:19] But I can sync the CN tree at least [17:51:38] Reedy: ok cool! [17:51:58] It'd all get scapp'd with the train deploy in 1 hour anyway, right? [17:52:14] (03PS1) 10MaxSem: Undeploy disable ArticleCreationWorkflow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419492 (https://phabricator.wikimedia.org/T186570) [17:52:37] I mean, 1 hour of ugly non-messages just on a small corner of the CN admin interface is fine, I think [17:53:30] i guess SWAT is dead? [17:53:34] Depends when someone is gonna be running scap [17:53:54] Reedy: doesn't it normally run with the train? no_justification ^ ? [17:54:03] Yeah, but the train won't be running scap today [17:54:07] last two items were not deployed. but i guess it might be too late now [17:54:09] No new branch deployed (it's already partly out) [17:54:09] Ah hmmm ok [17:54:21] AndyRussG: No [17:54:22] I can probably run scap in a bit, but I've gotta leave where I am in a few [17:54:32] !log reedy@tin Synchronized php-1.31.0-wmf.24/extensions/CentralNotice: updates! (duration: 01m 18s) [17:54:35] I can run it now [17:54:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:40] no_justification: thanks! [17:54:44] Reedy: also thanks! [17:54:57] !log demon@tin scap failed: LockFailedError Failed to acquire lock "/var/lock/scap.operations_mediawiki-config.lock"; owner is "reedy"; reason is "updates!" (duration: 00m 00s) [17:55:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:03] lol [17:55:11] Wait, you finished right? [17:55:14] One branch [17:55:16] Just doing the other [17:55:18] Oh [17:55:25] it's sync-master-ing [17:55:43] canaries [17:55:57] !log reedy@tin Synchronized php-1.31.0-wmf.25/extensions/CentralNotice: updates! (duration: 01m 16s) [17:56:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:07] !log demon@tin Started scap: rebuilding l10n [17:56:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:38] chirp chirp [17:57:17] (03CR) 10BryanDavis: toolforge: Update role and hiera for elasticsearch 5.5 (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/419485 (https://phabricator.wikimedia.org/T181531) (owner: 10BryanDavis) [17:57:20] (canary sounds) [17:58:32] (03PS2) 10Andrew Bogott: toolforge: Update role and hiera for elasticsearch 5.5 [puppet] - 10https://gerrit.wikimedia.org/r/419485 (https://phabricator.wikimedia.org/T181531) (owner: 10BryanDavis) [17:58:56] 10Operations, 10DBA, 10Patch-For-Review: Reimage and upgrade to stretch all dbproxies - https://phabricator.wikimedia.org/T183249#4050907 (10jcrespo) All proxies are now on stretch except the ones for labsdbs (10 and 11). [17:59:08] 10Operations, 10DBA, 10Patch-For-Review: Reimage and upgrade to stretch all dbproxies - https://phabricator.wikimedia.org/T183249#4050908 (10jcrespo) a:05jcrespo>03None [17:59:55] (03CR) 10Andrew Bogott: [C: 032] toolforge: Update role and hiera for elasticsearch 5.5 [puppet] - 10https://gerrit.wikimedia.org/r/419485 (https://phabricator.wikimedia.org/T181531) (owner: 10BryanDavis) [18:00:08] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180314T1800) [18:00:08] No GERRIT patches in the queue for this window AFAICS. [18:00:18] (03PS1) 10Ottomata: Migrate eventbus camus job to Kafka jumbo [puppet] - 10https://gerrit.wikimedia.org/r/419493 (https://phabricator.wikimedia.org/T189713) [18:03:46] (03PS2) 10MaxSem: Undeploy the disabled ArticleCreationWorkflow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419492 (https://phabricator.wikimedia.org/T186570) [18:07:30] (03PS1) 10Andrew Bogott: wikitech-static: monitor using check_https_url_at_address_for_string [puppet] - 10https://gerrit.wikimedia.org/r/419494 (https://phabricator.wikimedia.org/T189584) [18:08:11] (03CR) 10Andrew Bogott: [C: 032] wikitech-static: monitor using check_https_url_at_address_for_string [puppet] - 10https://gerrit.wikimedia.org/r/419494 (https://phabricator.wikimedia.org/T189584) (owner: 10Andrew Bogott) [18:09:09] (03PS2) 10Mforns: Remove sensitive fields from whitelist for QuickSurvey schemas [puppet] - 10https://gerrit.wikimedia.org/r/405727 (https://phabricator.wikimedia.org/T174386) (owner: 10Fdans) [18:10:23] no_justification: is running a full scap just as simple as going to /srv/mediawiki-staging and running 'scap sync'? I do have rights there, so I could potentially do that... Since it's been a while since I actually did any deploys myself on prod, I confess I'm a bit terrified of it, but if it's just that one command that's left, I can do it... (Especially if folks are around to help if something [18:10:25] goes awry!!!) https://wikitech.wikimedia.org/wiki/How_to_deploy_code#More_complex_changes:_sync_everything [18:12:07] (03PS2) 10Madhuvishy: dumps: Enable ipv6 connectivity for C3SL mirror host [puppet] - 10https://gerrit.wikimedia.org/r/419486 [18:13:02] (03CR) 10Madhuvishy: [C: 032] dumps: Enable ipv6 connectivity for C3SL mirror host [puppet] - 10https://gerrit.wikimedia.org/r/419486 (owner: 10Madhuvishy) [18:14:47] (03PS3) 10Mforns: Remove sensitive fields from whitelist for QuickSurvey schemas [puppet] - 10https://gerrit.wikimedia.org/r/405727 (https://phabricator.wikimedia.org/T174386) (owner: 10Fdans) [18:16:26] !log running pt-table-checksum on all m1, some lag will happen on passive replicas [18:16:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:31] (03PS4) 10Mforns: Remove sensitive fields from whitelist for QuickSurvey schemas [puppet] - 10https://gerrit.wikimedia.org/r/405727 (https://phabricator.wikimedia.org/T174386) (owner: 10Fdans) [18:20:18] !log running pt-table-checksum on all m2, some lag will happen on passive replicas [18:20:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:28] AndyRussG: Yes, that's the process. I'm already doing it though [18:23:34] (l10n rebuild is going slowly :\) [18:24:29] no_justification: ah gotcha... thanks!!!!! [18:25:04] (03CR) 10Mforns: [C: 031] Remove sensitive fields from whitelist for QuickSurvey schemas [puppet] - 10https://gerrit.wikimedia.org/r/405727 (https://phabricator.wikimedia.org/T174386) (owner: 10Fdans) [18:25:46] no_justification: ah right, I see the log message now.... [18:26:03] I've got this theory that the first scap of the day is the slowest, but never got data to confirm my suspicions [18:27:36] hmm [18:28:19] (03PS5) 10Ottomata: Remove sensitive fields from whitelist for QuickSurvey schemas [puppet] - 10https://gerrit.wikimedia.org/r/405727 (https://phabricator.wikimedia.org/T174386) (owner: 10Fdans) [18:28:23] (03CR) 10Ottomata: [V: 032 C: 032] Remove sensitive fields from whitelist for QuickSurvey schemas [puppet] - 10https://gerrit.wikimedia.org/r/405727 (https://phabricator.wikimedia.org/T174386) (owner: 10Fdans) [18:29:10] no_justification: hmmmm, maybe some due to some caching somewhere for rysnc in hashes or checksums to see which files have changed? [18:30:10] 10Operations, 10DNS, 10Release-Engineering-Team, 10Traffic, and 2 others: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches - https://phabricator.wikimedia.org/T188776#4051066 (10Varnent) >>! In T188776#4046204, @jhsoby wrote: >>>! In T188776#4021634, @Varnent wrote: >>>>! In T... [18:30:23] I think it's because l10nupdate comes along and updates all the localization stuff overnight [18:30:35] Then first scap comes along and helpfully sees itself as outta date, so rebuilds everything [18:30:50] hmmmmm [18:31:12] Considering the end result is identical, it's mostly an exercise in wasting time [18:31:17] (03PS1) 10Ottomata: Remove eventlogging-analytics camus job [puppet] - 10https://gerrit.wikimedia.org/r/419498 (https://phabricator.wikimedia.org/T183297) [18:31:18] :) [18:31:19] i18nupdate and scap, two trains passing in the night [18:31:54] (03PS2) 10Ottomata: Remove eventlogging-analytics camus job [puppet] - 10https://gerrit.wikimedia.org/r/419498 (https://phabricator.wikimedia.org/T183297) [18:31:56] (03CR) 10Ottomata: [V: 032 C: 032] Remove eventlogging-analytics camus job [puppet] - 10https://gerrit.wikimedia.org/r/419498 (https://phabricator.wikimedia.org/T183297) (owner: 10Ottomata) [18:34:44] (03PS1) 10Awight: Update ORES threshold config to the new syntax [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419499 (https://phabricator.wikimedia.org/T181159) [18:51:08] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288#4051147 (10Cmjohnson) The errors are disk related, same as bast1002. In bast1002 we replaced dev/sda and the I/... [18:51:36] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1073 - https://phabricator.wikimedia.org/T189403#4051149 (10Cmjohnson) Disk replaced and rebuilding Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Sp... [18:53:29] 10Operations, 10ops-eqiad, 10Analytics-Kanban: DIMM errors for analytics1062 - https://phabricator.wikimedia.org/T187164#4051156 (10Cmjohnson) No errors again today [18:55:51] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1073 - https://phabricator.wikimedia.org/T189403#4051161 (10Marostegui) Thanks!!! [18:56:17] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install labvirt102[12] - https://phabricator.wikimedia.org/T183937#4051162 (10RobH) a:05RobH>03Cmjohnson Chris, Can you install the two NICs that came in on T188297 into labvirt102[12]? They will replace the current Intel daughter cards with the... [18:58:06] 10Operations, 10ops-eqiad: rack/setup/install labvirt102[12] - https://phabricator.wikimedia.org/T183937#4051168 (10RobH) [18:59:13] 10Operations, 10DBA, 10hardware-requests: Decommission db1009 - https://phabricator.wikimedia.org/T189216#4051170 (10RobH) >>! In T189216#4050652, @Marostegui wrote: > Yeah I never use the other tags till we have it ready from the DBA side, to avoid all the noise for the DC Ops :) no worries, I wasnt sure... [19:00:04] no_justification: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180314T1900). [19:00:04] No GERRIT patches in the queue for this window AFAICS. [19:04:36] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack frpig1001 - https://phabricator.wikimedia.org/T187365#4051190 (10Cmjohnson) a:03ayounsi @ayounsi can you help setup network ports please and assign to Jeff Green once finished. Thanks! [19:05:02] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack frdata1001 - https://phabricator.wikimedia.org/T187364#4051193 (10Cmjohnson) a:03ayounsi @ayounsi can you help setup network ports please and assign to Jeff Green once finished. Thanks! [19:05:31] 10Operations, 10ops-eqiad, 10fundraising-tech-ops, 10Patch-For-Review: Rack/setup frmon1001 - https://phabricator.wikimedia.org/T186073#4051197 (10Cmjohnson) a:03ayounsi @ayounsi can you help setup network ports please and assign to Jeff Green once finished. Thanks! [19:06:03] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack frbast1001 - https://phabricator.wikimedia.org/T187363#4051201 (10Cmjohnson) a:03ayounsi @ayounsi can you help setup network ports please and assign to Jeff Green once finished. Thanks! [19:06:23] !log demon@tin scap failed: LockFailedError Failed to acquire lock "/var/lock/scap.operations_mediawiki-config.lock"; owner is "demon"; reason is "rebuilding l10n" (duration: 00m 00s) [19:06:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:31] !log demon@tin Started scap: scapping, pt. 2. prior one failed because i tested something [19:06:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:02] 10Operations, 10ops-eqiad: OfflineUncorrectableSector on mw1256 sda - https://phabricator.wikimedia.org/T186535#4051214 (10Cmjohnson) 05Open>03Resolved This has been completed. resolving [19:07:36] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1953 bytes in 0.089 second response time [19:08:06] 10Operations, 10ops-codfw: mc2036 mainboard fuse failure - https://phabricator.wikimedia.org/T185587#4051218 (10Papaul) a:05Papaul>03MoritzMuehlenhoff @MoritzMuehlenhoff - main board replacement - ILO configuration - new nic 1 MAC address e0:07:1b:f7:63:68 [19:08:08] Hmmm scap didn't or something [19:08:27] 10Operations, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install notebook100[34] - https://phabricator.wikimedia.org/T183935#4051220 (10Cmjohnson) Removing ops-eqiad project tag [19:09:51] 10Operations, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Memory test failure on elastic1021 - https://phabricator.wikimedia.org/T188595#4051224 (10Cmjohnson) @gehel do you want to decommission this server then? [19:12:00] 10Operations, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Memory test failure on elastic1021 - https://phabricator.wikimedia.org/T188595#4051241 (10RobH) [19:13:36] 10Operations, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): decommission elastic1021 - https://phabricator.wikimedia.org/T189727#4051243 (10RobH) p:05Triage>03Normal [19:15:59] 10Operations, 10ops-eqiad, 10DC-Ops, 10hardware-requests, 10Discovery-Search (Current work): decommission elastic1021 - https://phabricator.wikimedia.org/T189727#4051266 (10RobH) So this system was already offline for memory testing, but I want to confirm with @gehel we're good to start decommission, whi... [19:17:17] PROBLEM - Check systemd state on puppetmaster2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:17:38] 10Operations, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Memory test failure on elastic1021 - https://phabricator.wikimedia.org/T188595#4051274 (10RobH) 05Open>03Resolved a:03RobH Discussion on T189223 resulted in the decision to decommission elastic1021. I've created T189727 listing... [19:25:32] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: Rack/cable/configure asw2-b-eqiad switch stack - https://phabricator.wikimedia.org/T183585#4051301 (10RobH) disabling coppers port for decom on T176957. robh@asw-b-eqiad# show | compare [edit interfaces ge-4/0/5] + disable; [19:25:52] 10Operations, 10ops-eqiad, 10Packaging, 10hardware-requests: Decommission host copper.eqiad.wmnet - https://phabricator.wikimedia.org/T176957#3642514 (10RobH) a:05RobH>03Cmjohnson [19:26:51] 10Operations, 10ops-eqiad, 10Packaging, 10hardware-requests: Decommission host copper.eqiad.wmnet - https://phabricator.wikimedia.org/T176957#3642514 (10RobH) [19:27:37] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1971 bytes in 0.116 second response time [19:34:36] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1973 bytes in 0.108 second response time [19:37:11] dr0ptp4kt: heyas, i saw your email to ops list [19:37:26] i didnt wanna reply back jsut to point out running any non enterprise idea (like MBP) is not scalable. [19:37:40] i assume someone else will reply to more points than just that on an email reply [19:38:08] but a bunch of laptops running a rendering service is something that would have a high degree of administrative overhead (power, cost, security, software updates, etc) [19:38:12] "Request from 88.97.96.89 via cp1099 cp1099, Varnish XID 502155216 [19:38:12] Error: 429, Too Many Requests at Wed, 14 Mar 2018 19:37:16 GMT" [19:38:21] What's 429? [19:38:25] 10Operations: Build .deb package of python3-typing for jessie - https://phabricator.wikimedia.org/T189729#4051340 (10Imarlier) [19:38:36] http://lmgtfy.com/?q=http+429 [19:39:17] I don't understand why I got it [19:39:23] As I'm not running automated tools [19:39:38] https://upload.wikimedia.org/wikipedia/commons/thumb/4/48/The_New_Testament_of_Iesvs_Christ_faithfvlly_translated_into_English%2C_ovt_of_the_authentical_Latin%2C_diligently_conferred_with_the_Greek%2C_%26_other_Editions_in_diuers_languages.pdf/page179-6038px-thumbnail.pdf.jpg [19:39:41] It's noy you specifically [19:40:14] 10Operations, 10Performance-Team, 10Traffic, 10Beta-Cluster-reproducible: PHP fatal errors causing Varnish to return 503 - "Junk after gzip data" - https://phabricator.wikimedia.org/T125938#4051353 (10Krinkle) [19:40:17] BTW There was a concern the Tumbnailer might not be liking the relevant PDF on that page [19:42:32] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1073 - https://phabricator.wikimedia.org/T189403#4051365 (10Marostegui) 05Open>03Resolved All good! Thanks a lot! ``` root@db1073:~# megacli -LDInfo -L0 -a0 Adapter 0 -- Virtual Drive Information: Virtual Drive: 0 (Target Id: 0) Name... [19:43:49] Reedy: Weird, I'm also getting HTTP 429 on that, even with a random query. [19:44:02] Happening to hit a busy appserver? [19:44:05] no_justification: it looks like the scap failed and restarted? sorry if this is getting in the way of train departure [19:44:06] Looks like whatever issued it in the backend, might be using the wrong IP to decide. [19:44:21] Maybe Thumbor is considering Varnish's IP or something like that. [19:44:24] AndyRussG: Train today is easy [19:44:28] Just wikiversions swap [19:44:44] no_justification: ah cool [19:45:43] Although, it should *not* be taking 35 minutes [19:45:46] This is off... [19:45:55] Hmmm... the CN code itself is working great, we've put it through its paces a few times [19:46:03] It's not CN's fault [19:46:08] There's some trickery afoot [19:46:11] hmmm [19:46:29] * ShakespeareFan00 doesn't understand half of the chat :( [19:47:15] ShakespeareFan00: train is the weekly Mediawiki train deploy, CN is CentralNotice, an extension that was updated a little while ago, scap is a command to update the servers [19:48:02] thcipriani: I feel like aborting it is just gonna give me the same result... [19:48:23] I wonder what would happen if I kill the php processes and let scap continue? [19:48:41] ShakespeareFan00: If you're curious, https://wikitech.wikimedia.org/wiki/Deployments, https://wikitech.wikimedia.org/wiki/How_to_deploy_code [19:49:07] RECOVERY - MegaRAID on db1073 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [19:49:37] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1970 bytes in 0.104 second response time [19:50:16] hrm, you'd probably just sync out whatever is in php-[versions]/cache/l10 already [19:50:31] Which is...probably fine for now? [19:50:57] scap makes a tmp dir and moves it /tmp/scap_l10n_1097381672 in this case [19:51:26] it would be interesting to see what all those processes are doing [19:51:54] It would indeed [19:51:58] Oh wait [19:52:01] Finished [19:52:04] generating md5 files now [19:52:14] 19:10:34 Updating ExtensionMessages-1.31.0-wmf.25.php [19:52:14] 19:10:40 Updating LocalisationCache for 1.31.0-wmf.25 using 10 thread(s) [19:52:14] 19:51:43 Generating JSON versions and md5 files [19:52:14] 19:52:01 Finished l10n-update (duration: 42m 33s) [19:52:19] Moving on to sync-masters now [19:52:22] Well *that* was weird. [19:52:30] thcipriani: last log on the scap dashboard on logstash is 19:10 UTC, scap restarted 19:06 [19:52:32] We shouldn't have had to rebuild all of l10n.... [19:52:51] that's bizarre. Normally that takes like 5-10 minutes. [19:54:11] Still don't see the new i18n messages on prod [19:54:13] https://meta.wikimedia.org/w/index.php?title=Special:CentralNotice&subaction=noticeDetail¬ice=WMIL+-+Writing+contest+-+electricity [19:54:32] they're still syncing [19:54:39] Ah ok [19:55:46] (in case it's useful: see the last checkbox in the "Extra campaign features" area, where it still says "centralnotice-impression-events-sample-rate" instead of the actual message) [20:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: That opportune time is upon us again. Time for a Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180314T2000). [20:00:04] No GERRIT patches in the queue for this window AFAICS. [20:00:22] Nothing for ORES. [20:00:42] Postponing the services deploy anyway, train's running a little late [20:04:07] (03PS1) 10Chad: Revert "Stop forcing php5 in `mwscript`" [puppet] - 10https://gerrit.wikimedia.org/r/419515 [20:04:13] thcipriani: ^^ [20:09:45] no_justification: can you set PHP=php5 in the environment for that call? [20:10:06] I could maybe [20:10:20] That'd allow me to test the theory before merging ^^^ [20:10:23] yeah [20:10:46] PROBLEM - High CPU load on API appserver on mw1225 is CRITICAL: CRITICAL - load average: 50.73, 25.46, 18.14 [20:11:01] also, we've got to change it somehow since a php5 binary doesn't exist in stretch [20:11:26] Easy fix: add a symlink in the puppet manifest :) [20:11:27] PROBLEM - High CPU load on API appserver on mw1290 is CRITICAL: CRITICAL - load average: 63.59, 33.52, 24.69 [20:11:28] the other thing that could be done is to stop setting hhvm as the active php alternate [20:11:40] I think this was the last blocker to that anyway [20:11:46] RECOVERY - High CPU load on API appserver on mw1225 is OK: OK - load average: 29.48, 23.92, 18.06 [20:12:11] no_justification thcipriani assuming the scap problem is indeed not related to the CN update.... I have be AFK for a while (kid pickup)... If by any odd twist of fate there were some issue related to CN, pls ping ejegg, jump onto #wikimedia-fundraising, or even call..... thx for the help & good luck! [20:12:27] It's not CN, for sure :) [20:12:32] Just sorry it took so long [20:13:22] CN code seems to have gotten out just fine [20:13:27] RECOVERY - High CPU load on API appserver on mw1290 is OK: OK - load average: 27.13, 30.95, 24.93 [20:13:39] Yeah, code's live [20:13:44] Even before the l10n stuff finished, I was able to enable the new EventLogging impressions for the test campaign [20:13:44] It's just wrapping stuff up now [20:13:59] and I saw the /beacon/event call on aabooks [20:14:07] no_justification: no worries, thanks so much!!!! [20:15:19] This is the first hour+ scap I've had in ages. [20:15:23] I do *not* miss these days [20:15:42] no_justification: heh. used to be so normal though. things do get better! [20:15:59] my first branch deploy took about 5 hours :) [20:16:14] !log demon@tin Finished scap: scapping, pt. 2. prior one failed because i tested something (duration: 69m 43s) [20:16:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:22] !log demon@tin Started scap: trying a php5/hhvm theory [20:17:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:56] hmm 69 min [20:19:36] sadness [20:19:50] <_joe_> bd808: we should *NEVER* use php5 anymore [20:19:54] <_joe_> where is that needed? [20:20:31] _joe_: running cli jobs where hhvm is significantly slower [20:20:54] in this case the MediaWiki l10n data build [20:20:58] rebuilding l10n stage of scap specifically [20:21:02] <_joe_> bd808: uhm, so by april 9, we can think of replacing terbium [20:21:03] Took >42mins [20:21:08] This is tin [20:21:10] <_joe_> no_justification: oh that's hhvm? [20:21:13] Which is caught in hardware hell [20:21:20] <_joe_> that's that slow? [20:21:23] Yes [20:21:27] It should be like ~6-10mins [20:21:37] tops [20:21:42] <_joe_> anyone tried to debug it? [20:21:44] it always has been, that's why we pinned mwscript to php5 [20:21:45] !log mholloway-shell@tin Started deploy [mobileapps/deploy@0f9625a]: Update mobileapps to 9f4a80c [20:21:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:55] _joe_: Well, the theory is the php5 -> hhvm swap [20:21:55] <_joe_> bd808: nope [20:21:57] Which I'm testing now [20:22:07] _joe_: I suspect one issue is stat cache [20:22:09] <_joe_> we pinned it to php5 because some scripts would fail [20:22:12] 20:22:06 Finished l10n-update (duration: 02m 37s) [20:22:20] Yeah I'm pretty convinced ^ [20:22:25] <_joe_> yeah, I'm telling you to debug the hhvm run [20:22:25] the l10nupdate process calls stat about 11 jillion times [20:22:42] <_joe_> there might be some very very low-hanging fruit we could use to speed that up [20:22:58] Help welcome :) [20:22:59] _joe_: like any zend interpreter :) [20:24:00] !log demon@tin Finished scap: trying a php5/hhvm theory (duration: 06m 37s) [20:24:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:23] _joe_: I can't remember if ori and I profiled this ... 3years ago... or not [20:24:23] bd808: That was running it with PHP=php5 [20:24:54] no_justification: that's a fairly easy work around then I guess. [20:25:54] Remembering to do it? [20:26:32] heh. I was thinking putting it into scap directly [20:26:48] could put it in the environment in a scap run like we do with SSH_AUTH_SOCK [20:27:22] !log mholloway-shell@tin Finished deploy [mobileapps/deploy@0f9625a]: Update mobileapps to 9f4a80c (duration: 05m 37s) [20:27:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:55] at least over the short term, not sure what the timeline is for that workaround to fall on its face. Seems nigh. [20:28:27] It'll break as soon as we have a non-php5 host. [20:28:29] :\ [20:29:14] (03CR) 10Chad: [C: 032] Group1 to wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419481 (owner: 10Chad) [20:29:53] no_justification: we have two! Wikitech is on stretch [20:30:13] we jumped from being weird by being old to being weird by being new [20:30:29] (03Merged) 10jenkins-bot: Group1 to wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419481 (owner: 10Chad) [20:30:31] Well, soon as we run a mwscript there... ;-) [20:31:23] (03PS1) 10Madhuvishy: dumps: Setup web server config in distribution hosts [puppet] - 10https://gerrit.wikimedia.org/r/419522 (https://phabricator.wikimedia.org/T188641) [20:31:37] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1966 bytes in 0.076 second response time [20:31:49] no_justification: I have `export PHP=php7.0` in my .bashrc there ;) [20:32:01] !log demon@tin Synchronized php: symlink bump to wmf.25 (duration: 01m 14s) [20:32:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:34] !log demon@tin rebuilt and synchronized wikiversions files: group1 to wmf.25 [20:34:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:17] (03CR) 10Andrew Bogott: [C: 032] ":(" [puppet] - 10https://gerrit.wikimedia.org/r/419515 (owner: 10Chad) [20:35:23] (03PS2) 10Andrew Bogott: Revert "Stop forcing php5 in `mwscript`" [puppet] - 10https://gerrit.wikimedia.org/r/419515 (owner: 10Chad) [20:38:28] 10Operations: Build .deb package of python3-aiokafka - https://phabricator.wikimedia.org/T189741#4051552 (10Imarlier) [20:43:49] (03PS1) 10RobH: updating labvirt102[12] mac addresses [puppet] - 10https://gerrit.wikimedia.org/r/419524 (https://phabricator.wikimedia.org/T183937) [20:44:01] (03PS2) 10RobH: updating labvirt102[12] mac addresses [puppet] - 10https://gerrit.wikimedia.org/r/419524 (https://phabricator.wikimedia.org/T183937) [20:44:18] (03CR) 10RobH: [C: 032] updating labvirt102[12] mac addresses [puppet] - 10https://gerrit.wikimedia.org/r/419524 (https://phabricator.wikimedia.org/T183937) (owner: 10RobH) [20:45:39] (03PS2) 10Madhuvishy: dumps: Setup web server config in distribution hosts [puppet] - 10https://gerrit.wikimedia.org/r/419522 (https://phabricator.wikimedia.org/T188641) [20:48:48] (03PS3) 10Andrew Bogott: openstack: dns-floating-ip-updater.py [puppet] - 10https://gerrit.wikimedia.org/r/419336 (owner: 10BryanDavis) [20:49:19] (03CR) 10Andrew Bogott: [C: 032] openstack: dns-floating-ip-updater.py [puppet] - 10https://gerrit.wikimedia.org/r/419336 (owner: 10BryanDavis) [20:50:15] (03PS3) 10Madhuvishy: dumps: Setup web server config in distribution hosts [puppet] - 10https://gerrit.wikimedia.org/r/419522 (https://phabricator.wikimedia.org/T188641) [20:50:35] (03CR) 10jenkins-bot: Group1 to wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419481 (owner: 10Chad) [20:52:07] (03PS5) 10Andrew Bogott: labtestweb: refactor to more closely resemble the labweb* deploy [puppet] - 10https://gerrit.wikimedia.org/r/419180 (https://phabricator.wikimedia.org/T168470) [20:52:10] (03CR) 10ArielGlenn: "Do you deal with ipv6 service anywhere? You'll want it, on the current web servers we do" [puppet] - 10https://gerrit.wikimedia.org/r/419522 (https://phabricator.wikimedia.org/T188641) (owner: 10Madhuvishy) [20:53:00] (03CR) 10Andrew Bogott: [C: 032] labtestweb: refactor to more closely resemble the labweb* deploy [puppet] - 10https://gerrit.wikimedia.org/r/419180 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [20:53:59] (03CR) 10Madhuvishy: "Right, yeah I was going to investigate that in the next step!" [puppet] - 10https://gerrit.wikimedia.org/r/419522 (https://phabricator.wikimedia.org/T188641) (owner: 10Madhuvishy) [20:55:01] (03PS4) 10Madhuvishy: dumps: Setup web server config in distribution hosts [puppet] - 10https://gerrit.wikimedia.org/r/419522 (https://phabricator.wikimedia.org/T188641) [20:56:49] (03PS1) 10Esanders: Enable wgCiteResponsiveReferences on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419528 (https://phabricator.wikimedia.org/T189658) [21:01:52] (03PS5) 10Madhuvishy: dumps: Setup web server config in distribution hosts [puppet] - 10https://gerrit.wikimedia.org/r/419522 (https://phabricator.wikimedia.org/T188641) [21:02:46] (03PS1) 10Andrew Bogott: labtestweb2001: move to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/419532 [21:03:47] 10Operations, 10ops-codfw: attach furud's new arrays (furud-array[3-7]) - https://phabricator.wikimedia.org/T185153#4051695 (10faidon) 05Open>03Resolved These are now attached and configured, resolving. [21:04:10] (03CR) 10Andrew Bogott: [C: 032] labtestweb2001: move to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/419532 (owner: 10Andrew Bogott) [21:08:05] (03PS6) 10Madhuvishy: dumps: Setup web server config in distribution hosts [puppet] - 10https://gerrit.wikimedia.org/r/419522 (https://phabricator.wikimedia.org/T188641) [21:08:22] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install labvirt102[12] - https://phabricator.wikimedia.org/T183937#4051709 (10RobH) Ok, the new network cards are installed. They have the first 2 ports as 10G, and the second 2 as 1G. When I go to boot labvirt1021, it PXE boots, and hits the inst... [21:11:37] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1943 bytes in 0.082 second response time [21:11:43] (03PS7) 10Madhuvishy: dumps: Setup web server config in distribution hosts [puppet] - 10https://gerrit.wikimedia.org/r/419522 (https://phabricator.wikimedia.org/T188641) [21:12:29] (03CR) 10Madhuvishy: [C: 032] dumps: Setup web server config in distribution hosts [puppet] - 10https://gerrit.wikimedia.org/r/419522 (https://phabricator.wikimedia.org/T188641) (owner: 10Madhuvishy) [21:15:56] (03PS1) 10Andrew Bogott: californium: mark as spare system [puppet] - 10https://gerrit.wikimedia.org/r/419534 (https://phabricator.wikimedia.org/T168470) [21:16:28] 10Operations, 10Cloud-Services, 10Cloud-VPS, 10Traffic: Move californium to an internal host? - https://phabricator.wikimedia.org/T133149#4051724 (10Andrew) 05Open>03declined This is moot, californium is moving into the spare pool. [21:17:04] (03PS2) 10Andrew Bogott: californium: mark as spare system [puppet] - 10https://gerrit.wikimedia.org/r/419534 (https://phabricator.wikimedia.org/T168470) [21:17:55] 10Operations: Build .deb package of python3-typing for jessie - https://phabricator.wikimedia.org/T189729#4051741 (10Imarlier) If it helps, this repo just needs `dpkg-buildpackage` run at the root in order to generate the deb: https://github.com/marlier/python-typing [21:26:07] !log rebuilding labtestweb2001 with Debian Stretch [21:26:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:23] (03PS1) 10Madhuvishy: dumps: Set xmldumps server as localhost for testing [puppet] - 10https://gerrit.wikimedia.org/r/419590 (https://phabricator.wikimedia.org/T188641) [21:27:43] !log Ran scap pull on mwdebug1001 after testing https://gerrit.wikimedia.org/r/417180 [21:27:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:54] (03CR) 10Madhuvishy: [C: 032] dumps: Set xmldumps server as localhost for testing [puppet] - 10https://gerrit.wikimedia.org/r/419590 (https://phabricator.wikimedia.org/T188641) (owner: 10Madhuvishy) [21:38:32] (03CR) 10Bstorm: [C: 032] toolsdb: Remove stale accounts if present in maintain-dbusers [puppet] - 10https://gerrit.wikimedia.org/r/418709 (https://phabricator.wikimedia.org/T188680) (owner: 10BryanDavis) [21:38:45] (03PS3) 10Bstorm: toolsdb: Remove stale accounts if present in maintain-dbusers [puppet] - 10https://gerrit.wikimedia.org/r/418709 (https://phabricator.wikimedia.org/T188680) (owner: 10BryanDavis) [21:38:58] 10Operations, 10Cloud-Services, 10Cloud-VPS, 10Traffic: Move californium to an internal host? - https://phabricator.wikimedia.org/T133149#2223542 (10bd808) It's slightly off topic with californium being decommed, but things that are allocated for Cloud Services infrastructure are generally being moved into... [21:42:07] (03PS1) 10Rush: openstack: neutron file permissions adjustments [puppet] - 10https://gerrit.wikimedia.org/r/419593 (https://phabricator.wikimedia.org/T188266) [21:42:19] (03PS2) 10Rush: openstack: neutron file permissions adjustments [puppet] - 10https://gerrit.wikimedia.org/r/419593 (https://phabricator.wikimedia.org/T188266) [21:42:21] (03PS2) 10Dzahn: base/icinga: add Hiera override to skip systemd monitoring [puppet] - 10https://gerrit.wikimedia.org/r/419084 (https://phabricator.wikimedia.org/T176532) [21:42:58] (03PS3) 10Rush: openstack: neutron file permissions adjustments [puppet] - 10https://gerrit.wikimedia.org/r/419593 (https://phabricator.wikimedia.org/T188266) [21:43:11] (03CR) 10jerkins-bot: [V: 04-1] base/icinga: add Hiera override to skip systemd monitoring [puppet] - 10https://gerrit.wikimedia.org/r/419084 (https://phabricator.wikimedia.org/T176532) (owner: 10Dzahn) [21:43:38] (03Draft1) 10Ahmed123: Enable rollbacker user right at arwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419535 (https://phabricator.wikimedia.org/T189732) [21:43:51] (03CR) 10Rush: [C: 032] openstack: neutron file permissions adjustments [puppet] - 10https://gerrit.wikimedia.org/r/419593 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [21:46:53] (03PS1) 10RobH: setting labvirt102[12] to tftp load (not http) [puppet] - 10https://gerrit.wikimedia.org/r/419594 (https://phabricator.wikimedia.org/T183937) [21:47:48] (03CR) 10RobH: [C: 032] setting labvirt102[12] to tftp load (not http) [puppet] - 10https://gerrit.wikimedia.org/r/419594 (https://phabricator.wikimedia.org/T183937) (owner: 10RobH) [21:54:45] (03PS2) 10Ahmed123: Enable rollbacker user right at arwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419535 (https://phabricator.wikimedia.org/T189732) [22:02:03] (03CR) 10Zoranzoki21: [C: 031] Enable rollbacker user right at arwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419535 (https://phabricator.wikimedia.org/T189732) (owner: 10Ahmed123) [22:02:45] (03PS1) 10Chico Venancio: Toolforge:puppet-apt-pinning upgrading kubernetes version for PAWS [puppet] - 10https://gerrit.wikimedia.org/r/419599 (https://phabricator.wikimedia.org/T189680) [22:03:33] (03PS1) 10Bstorm: Revert "toolsdb: Remove stale accounts if present in maintain-dbusers" [puppet] - 10https://gerrit.wikimedia.org/r/419600 [22:05:04] (03PS2) 10Bstorm: Revert "toolsdb: Remove stale accounts if present in maintain-dbusers" [puppet] - 10https://gerrit.wikimedia.org/r/419600 [22:05:49] (03CR) 10Bstorm: [C: 032] Revert "toolsdb: Remove stale accounts if present in maintain-dbusers" [puppet] - 10https://gerrit.wikimedia.org/r/419600 (owner: 10Bstorm) [22:06:00] (03PS1) 10RobH: fixing entries in dhcp lease file [puppet] - 10https://gerrit.wikimedia.org/r/419602 [22:06:19] (03PS2) 10RobH: fixing entries in dhcp lease file [puppet] - 10https://gerrit.wikimedia.org/r/419602 [22:06:30] (03CR) 10RobH: [C: 032] fixing entries in dhcp lease file [puppet] - 10https://gerrit.wikimedia.org/r/419602 (owner: 10RobH) [22:10:04] (03CR) 10MarcoAurelio: [C: 04-1] "Missing entry at $wgRemoveGroups so the permission can be removed locally as well." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419535 (https://phabricator.wikimedia.org/T189732) (owner: 10Ahmed123) [22:13:06] !log reedy@tin Synchronized php-1.31.0-wmf.25/extensions/Thanks: T189752 (duration: 01m 16s) [22:13:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:13] T189752: ThanksHooks::insertThankLink: Call to a member function getUser() on a non-object (null) - https://phabricator.wikimedia.org/T189752 [22:18:49] (03PS3) 10Ahmed123: Enable rollbacker user right at arwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419535 (https://phabricator.wikimedia.org/T189732) [22:19:28] (03CR) 10Zoranzoki21: [C: 031] "> Missing entry at $wgRemoveGroups so the permission can be removed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419535 (https://phabricator.wikimedia.org/T189732) (owner: 10Ahmed123) [22:26:19] (03PS3) 10Dzahn: base/icinga: add Hiera override to skip systemd monitoring [puppet] - 10https://gerrit.wikimedia.org/r/419084 (https://phabricator.wikimedia.org/T176532) [22:27:01] (03CR) 10jerkins-bot: [V: 04-1] base/icinga: add Hiera override to skip systemd monitoring [puppet] - 10https://gerrit.wikimedia.org/r/419084 (https://phabricator.wikimedia.org/T176532) (owner: 10Dzahn) [22:29:31] (03PS4) 10Dzahn: base/icinga: add Hiera override to skip systemd monitoring [puppet] - 10https://gerrit.wikimedia.org/r/419084 (https://phabricator.wikimedia.org/T176532) [22:31:22] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install labvirt102[12] - https://phabricator.wikimedia.org/T183937#4051914 (10RobH) IRC Update: After chatting with @faidon it was determined this was failing due to the recent change to serve kernel images via http but the labs ACLs don't allow that... [22:35:15] (03PS1) 10RobH: setting new labvirts to spare role [puppet] - 10https://gerrit.wikimedia.org/r/419609 (https://phabricator.wikimedia.org/T183937) [22:35:50] (03PS2) 10RobH: setting new labvirts to spare role [puppet] - 10https://gerrit.wikimedia.org/r/419609 (https://phabricator.wikimedia.org/T183937) [22:35:54] (03CR) 10RobH: [C: 032] setting new labvirts to spare role [puppet] - 10https://gerrit.wikimedia.org/r/419609 (https://phabricator.wikimedia.org/T183937) (owner: 10RobH) [22:38:34] 10Operations, 10ops-eqiad, 10cloud-services-team: rack/setup/install labvirt102[12] - https://phabricator.wikimedia.org/T183937#4051948 (10RobH) [22:39:57] 10Operations, 10ops-eqiad, 10cloud-services-team: rack/setup/install labvirt102[12] - https://phabricator.wikimedia.org/T183937#3867914 (10RobH) a:05Cmjohnson>03chasemp Ok, escalating this to @chasemp for completion. The systems are installed and calling into puppet. Their 1G ports are showing as eth2/... [22:40:08] 10Operations, 10cloud-services-team: rack/setup/install labvirt102[12] - https://phabricator.wikimedia.org/T183937#4051958 (10RobH) [22:41:15] (03PS1) 10Subramanya Sastry: Enable RemexHtml on all wikiversity wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419611 (https://phabricator.wikimedia.org/T188880) [22:41:28] PROBLEM - Disk space on labtestnet2001 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=55%) [22:42:18] Reedy: when creating new wikis which was first again, apache config or create_wiki [22:42:56] did the ServerAlias wait for another step [22:44:08] ok, i see on "Add a wiki" [22:45:04] (03PS2) 10Dzahn: Add hi.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/417200 (https://phabricator.wikimedia.org/T188366) (owner: 10Urbanecm) [22:50:29] (03PS3) 10Dzahn: mediawiki/apache: Add hi.wikimedia.org ServerAlias [puppet] - 10https://gerrit.wikimedia.org/r/417200 (https://phabricator.wikimedia.org/T188366) (owner: 10Urbanecm) [22:50:52] (03CR) 10Dzahn: [C: 032] mediawiki/apache: Add hi.wikimedia.org ServerAlias [puppet] - 10https://gerrit.wikimedia.org/r/417200 (https://phabricator.wikimedia.org/T188366) (owner: 10Urbanecm) [22:53:20] 10Operations, 10ops-eqiad, 10netops: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960#4051974 (10ayounsi) [22:54:48] (03CR) 10Dzahn: [C: 032] "confirmed on mwdebug1001" [puppet] - 10https://gerrit.wikimedia.org/r/417200 (https://phabricator.wikimedia.org/T188366) (owner: 10Urbanecm) [23:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I, the Bot under the Fountain, allow thee, The Deployer, to do Evening SWAT (Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180314T2300). [23:00:04] tgr, awight, and MatmaRex: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:19] o/ [23:00:21] :) [23:00:37] hi [23:01:14] I have two things for SWAT, hope I'm not too late [23:01:20] they'll be quick [23:01:35] (i'll be back in a minute, i forgot about the DST and thought it would happen an hour from now) [23:03:53] no one deploying? :( [23:04:20] why is everyone confused about DST today? :P [23:04:35] SWAT suffers from bystander effect [23:04:47] I can deploy I guess [23:06:06] Platonides: it's different in EU and in US [23:06:10] during just this one week in the year [23:06:17] (or perhaps two weeks?) [23:06:26] did US move to DST already? [23:06:41] there's still two weeks for EU [23:06:57] * Platonides is looking for US rules [23:07:16] tgr: I'm around if you want help [23:07:23] (or too busy/ etc.) [23:07:25] ok, found and yes [23:07:25] Platonides: yes. all of my meetings and all of the dpeloyments shifted by one hour for me ;) [23:07:32] https://en.wikipedia.org/wiki/Daylight_saving_time_in_the_United_States [23:08:00] MatmaRex: can those patches go out in one batch? [23:08:05] set all meetings on UTC :P [23:08:16] tgr: yes [23:08:23] tgr: well, two batches, it's on two branches [23:08:36] Amir1: are your patches on the Deployments page? [23:08:44] or do you want to deploy them yourself? [23:08:48] I'm adding them [23:08:52] I can deploy them [23:09:07] (03PS2) 10Gergő Tisza: Update ORES threshold config to the new syntax [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419499 (https://phabricator.wikimedia.org/T181159) (owner: 10Awight) [23:09:19] this self-service SWAT [23:10:48] (03PS1) 10Andrew Bogott: labtestweb2001: change raid config to resemble labweb1001 [puppet] - 10https://gerrit.wikimedia.org/r/419620 [23:11:37] 10Operations, 10netops: Connection timeout from 195.77.175.64/29 to text-lb.esams.wikimedia.org - https://phabricator.wikimedia.org/T189689#4051996 (10ayounsi) First, the questions and troubleshooting commands listed on https://wikitech.wikimedia.org/wiki/Reporting_a_connectivity_issue are useful to us. Espec... [23:12:13] (03CR) 10Andrew Bogott: [C: 032] labtestweb2001: change raid config to resemble labweb1001 [puppet] - 10https://gerrit.wikimedia.org/r/419620 (owner: 10Andrew Bogott) [23:12:20] 10Operations, 10netops: Connection timeout from 195.77.175.64/29 to text-lb.esams.wikimedia.org - https://phabricator.wikimedia.org/T189689#4051997 (10ayounsi) a:03Samtar [23:12:22] (03CR) 10Gergő Tisza: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419499 (https://phabricator.wikimedia.org/T181159) (owner: 10Awight) [23:13:33] (03Merged) 10jenkins-bot: Update ORES threshold config to the new syntax [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419499 (https://phabricator.wikimedia.org/T181159) (owner: 10Awight) [23:14:39] tgr: I’m wondering about how to test my config change on the canary: I have to clear a few getMainWANObjectCache keys, but can that be isolated to just mwdebug? [23:14:57] and thank you for swatting! [23:15:05] awight: I don't think so [23:15:16] is the object change backwards compatible? [23:15:28] yes, the cache entries are compatible [23:15:49] I’m happy with skipping the canary test, I’ve smoke tested on beta [23:15:49] it's on mwdebug1002 btw [23:16:35] ok [23:16:51] tgr: ok, I’m doing a sanity test hitting mwdebug, one moment please [23:17:31] tgr: rcfilters look good on mwdebug1002 [23:17:37] thx [23:18:20] Once it’s live, I’ll clear the cache entry for sqwiki to test the new config [23:18:21] (03PS2) 10Gergő Tisza: Enable Wikidata description override on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419083 (https://phabricator.wikimedia.org/T184000) [23:18:53] (03CR) 10jenkins-bot: Update ORES threshold config to the new syntax [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419499 (https://phabricator.wikimedia.org/T181159) (owner: 10Awight) [23:18:57] !log tgr@tin Synchronized wmf-config/InitialiseSettings-labs.php: T181159 Update ORES threshold config to the new syntax (duration: 01m 15s) [23:19:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:04] T181159: Migrate ORES extension threshold config from old to new syntax - https://phabricator.wikimedia.org/T181159 [23:20:30] awight: should we start dropping the old syntax part? [23:20:40] Amir1: not in code, quite yet. [23:20:45] Maybe in a day :) [23:20:53] \o/ [23:21:02] (03PS1) 10Andrew Bogott: labtestweb: fuly/qualify/path to partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/419623 [23:21:15] (03CR) 10Gergő Tisza: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419083 (https://phabricator.wikimedia.org/T184000) (owner: 10Gergő Tisza) [23:21:19] !log tgr@tin Synchronized wmf-config/InitialiseSettings.php: T181159 Update ORES threshold config to the new syntax (duration: 01m 20s) [23:21:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:30] awight: it's live [23:21:37] ty, testing... [23:21:54] (03CR) 10Andrew Bogott: [C: 032] labtestweb: fuly/qualify/path to partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/419623 (owner: 10Andrew Bogott) [23:22:30] tgr: https://gerrit.wikimedia.org/r/#/c/419501/ failed because of broken Thanks tests, you'll have to +2 it again [23:22:33] (03Merged) 10jenkins-bot: Enable Wikidata description override on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419083 (https://phabricator.wikimedia.org/T184000) (owner: 10Gergő Tisza) [23:23:35] tgr: lgtm! [23:24:24] (03CR) 10jenkins-bot: Enable Wikidata description override on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419083 (https://phabricator.wikimedia.org/T184000) (owner: 10Gergő Tisza) [23:27:18] duh, I messed up that Wikibase config change [23:28:41] 10Operations, 10ops-codfw, 10DBA, 10hardware-requests: Decommission db2012 - https://phabricator.wikimedia.org/T187543#4052028 (10RobH) [23:28:56] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack frpig1001 - https://phabricator.wikimedia.org/T187365#4052029 (10ayounsi) a:05ayounsi>03Jgreen Switch port configured. [23:29:54] tgr: Eh, wmgWikibaseAllowLocalShortDesc is not set in prod, right? [23:30:16] Seems like that'd E_notice [23:30:21] yeah, I haven't deployed it yet [23:30:24] kk [23:31:44] (03PS1) 10RobH: db2012 decom [puppet] - 10https://gerrit.wikimedia.org/r/419628 (https://phabricator.wikimedia.org/T187543) [23:32:12] (03CR) 10RobH: [C: 032] db2012 decom [puppet] - 10https://gerrit.wikimedia.org/r/419628 (https://phabricator.wikimedia.org/T187543) (owner: 10RobH) [23:32:20] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack frdata1001 - https://phabricator.wikimedia.org/T187364#4052040 (10ayounsi) a:05ayounsi>03Jgreen Switch port configured. [23:32:26] (03PS1) 10Gergő Tisza: Fix I6f31e91ed4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419629 [23:32:43] (03PS1) 10Bstorm: toolsdb: Remove stale accounts if present in maintain-dbusers [puppet] - 10https://gerrit.wikimedia.org/r/419630 (https://phabricator.wikimedia.org/T188680) [23:34:37] 10Operations, 10ops-eqiad, 10fundraising-tech-ops, 10Patch-For-Review: Rack/setup frmon1001 - https://phabricator.wikimedia.org/T186073#4052047 (10ayounsi) a:05ayounsi>03Jgreen Switch port configured. [23:35:01] (03CR) 10Gergő Tisza: [C: 032] "SWAT hotfix" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419629 (owner: 10Gergő Tisza) [23:36:14] (03Merged) 10jenkins-bot: Fix I6f31e91ed4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419629 (owner: 10Gergő Tisza) [23:36:29] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack frbast1001 - https://phabricator.wikimedia.org/T187363#4052049 (10ayounsi) a:05ayounsi>03Jgreen Switch ports configured. [23:37:03] (03PS1) 10RobH: db2012 decom, removing prod dns [dns] - 10https://gerrit.wikimedia.org/r/419631 (https://phabricator.wikimedia.org/T187543) [23:37:39] (03CR) 10RobH: [C: 032] db2012 decom, removing prod dns [dns] - 10https://gerrit.wikimedia.org/r/419631 (https://phabricator.wikimedia.org/T187543) (owner: 10RobH) [23:38:42] 10Operations, 10ops-codfw, 10DBA, 10hardware-requests: Decommission db2012 - https://phabricator.wikimedia.org/T187543#4052057 (10RobH) [23:38:54] 10Operations, 10ops-codfw, 10DBA, 10hardware-requests: Decommission db2012 - https://phabricator.wikimedia.org/T187543#3978461 (10RobH) a:05RobH>03Papaul Ok, this is now ready for onsite wipe. [23:38:56] (03CR) 10jenkins-bot: Fix I6f31e91ed4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419629 (owner: 10Gergő Tisza) [23:41:02] !log tgr@tin Synchronized wmf-config/InitialiseSettings.php: T184000 Enable Wikidata description override on beta cluster (duration: 01m 15s) [23:41:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:09] T184000: Magic word on English WP to override display of Wikidata short description - https://phabricator.wikimedia.org/T184000 [23:43:18] andrewbogott: scap is yelling at me because /etc/ssh/ssh_known_hosts is out of date with the labtestweb2001 SSH host key [23:43:25] !log tgr@tin Synchronized wmf-config/InitialiseSettings-labs.php: T184000 Enable Wikidata description override on beta cluster (duration: 01m 15s) [23:43:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:43:34] is that just some kind of puppet delay or does it need to be fixed manually? [23:43:37] tgr: yeah, reedy pinged me about that too but I don't know how to reset it [23:43:41] It's a host I'm re-imaging [23:43:50] and I'm about to do it again so best to hold off :) [23:44:20] Does it actually break the deploy or just complain? [23:44:31] only to that host [23:44:32] ok [23:44:43] you might want to do a scap pull when it's fixed [23:44:44] In theory puppet will update all the known host keys but it takes a couple of cycles [23:44:52] so let's see if it still happens next time [23:45:26] !log tgr@tin Synchronized wmf-config/Wikibase.php: T184000 Enable Wikidata description override on beta cluster (duration: 01m 14s) [23:45:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:48:37] (03PS5) 10Dzahn: mediawiki/apache: Add romd.wikimedia.org ServerAlias [puppet] - 10https://gerrit.wikimedia.org/r/412898 (https://phabricator.wikimedia.org/T187184) (owner: 10Urbanecm) [23:50:01] MatmaRex: on mwdebug1002 (both branches) [23:50:32] tgr: thanks, give me a moment to test it all [23:51:45] ack [23:52:09] Amir1: there is no special trick to Wikibase deploys, is there? [23:52:32] tgr: nope but they both need to be deployed together [23:53:11] core first, by the look of it? [23:55:27] tgr: yup [23:58:19] tgr: everything seems fine!