[20:24:51] (03CR) 10Filippo Giunchedi: [C: 032] logstash: temp stop managing indices [puppet] - 10https://gerrit.wikimedia.org/r/472247 (owner: 10Filippo Giunchedi) [20:24:52] !log thcipriani@deploy1001 rebuilt and synchronized wikiversions files: rollback labswiki to 1.33.0-wmf.2 [20:24:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:55] greg-g: ^ done, problem gone? [20:26:44] (03PS1) 10Thcipriani: labswiki rollback to 1.33.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472250 [20:27:06] (03CR) 10Anomie: [C: 031] Allow Cloud VPS 172.16.0.0/16 for $wmgAllowLabsAnonEdits wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472243 (https://phabricator.wikimedia.org/T208986) (owner: 10BryanDavis) [20:28:48] Oh oops, I never !log-ed that I ran namespaceDupes.php. Next time I'll remember. [20:29:21] (03CR) 10Thcipriani: [C: 032] labswiki rollback to 1.33.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472250 (owner: 10Thcipriani) [20:29:59] (03PS1) 10Effie Mouzeli: Change rdb1005 to spare:system [puppet] - 10https://gerrit.wikimedia.org/r/472251 (https://phabricator.wikimedia.org/T206450) [20:30:34] (03Merged) 10jenkins-bot: labswiki rollback to 1.33.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472250 (owner: 10Thcipriani) [20:31:53] thcipriani: Got a couple of error fixes prepared whenever it's good – https://gerrit.wikimedia.org/r/#/q/status:open+branch:wmf/1.33.0-wmf.3 [20:34:18] Krinkle: train rollout seems stable now [20:36:14] OK [20:40:02] (03CR) 10jenkins-bot: labswiki rollback to 1.33.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472250 (owner: 10Thcipriani) [20:40:40] thcipriani: sorry, yes [20:41:40] I deduced from the error logs :) [20:43:02] thcipriani: humans are fallible [20:43:40] (03PS1) 10Andrew Bogott: Nova: add cloudvirt1017 to the scheduler pool [puppet] - 10https://gerrit.wikimedia.org/r/472253 (https://phabricator.wikimedia.org/T208733) [20:53:09] (03PS2) 10Banyek: wiki replicas: depool lasbdb1010 for view changes [puppet] - 10https://gerrit.wikimedia.org/r/471295 (https://phabricator.wikimedia.org/T189158) (owner: 10Bstorm) [20:53:15] (03CR) 10Banyek: [V: 032 C: 032] wiki replicas: depool lasbdb1010 for view changes [puppet] - 10https://gerrit.wikimedia.org/r/471295 (https://phabricator.wikimedia.org/T189158) (owner: 10Bstorm) [20:55:04] !log depool labsdb1010 (T189158) [20:55:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:06] T189158: Change `image` view to properly expose the new `img_description_id` field - https://phabricator.wikimedia.org/T189158 [20:56:21] I am getting a fatal error " „UnexpectedValueException“ for https://commons.wikimedia.org/w/index.php?title=Category:Argenta_(company)&action=edit [20:59:21] thcipriani ^^ [20:59:41] Raymond_: disable TwoColConflict for now [20:59:52] MaxSem: thanks [21:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181107T2100). [21:00:24] yep. works now :) [21:00:45] I'll submit the bug [21:01:03] * Krinkle stating on mwdebug1002 [21:01:09] (03PS2) 10Bstorm: sonofgridengine: remove ldapconfig materials [puppet] - 10https://gerrit.wikimedia.org/r/472241 (https://phabricator.wikimedia.org/T200557) [21:01:18] I deploy something for ores [21:01:41] Amir1: mw? [21:01:43] (03PS2) 10Cwhite: diamond: ensure Nginx collector absent [puppet] - 10https://gerrit.wikimedia.org/r/471360 (https://phabricator.wikimedia.org/T183454) [21:01:57] nope, ores It's services deploy window [21:01:58] (03PS3) 10Cwhite: diamond: ensure Nginx collector absent [puppet] - 10https://gerrit.wikimedia.org/r/471360 (https://phabricator.wikimedia.org/T183454) [21:02:01] OK [21:02:58] PROBLEM - Disk space on elastic1025 is CRITICAL: DISK CRITICAL - free space: /srv 28734 MB (5% inode=99%) [21:03:01] (03CR) 10Bstorm: [C: 032] sonofgridengine: remove ldapconfig materials [puppet] - 10https://gerrit.wikimedia.org/r/472241 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [21:03:21] MaxSem: I see this error in logstash, is this something that I should be rolling back for? What is the impact? [21:03:28] !log ladsgroup@deploy1001 Started deploy [ores/deploy@25dfa4f]: T191842 T197096 [21:03:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:33] T191842: Deployment git server can't supply ORES hosts in parallel - https://phabricator.wikimedia.org/T191842 [21:03:34] T197096: [Epic] Use LFS for large ORES files - https://phabricator.wikimedia.org/T197096 [21:03:59] !log krinkle@deploy1001 Synchronized php-1.33.0-wmf.3/includes/jobqueue/jobs/RefreshLinksJob.php: T208147 -I7f5fafe9439d8a7b4 (duration: 00m 54s) [21:04:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:02] T208147: PHP Fatal Error from RefreshLinksJob: Argument to runForTitle() must be Title - https://phabricator.wikimedia.org/T208147 [21:04:06] (03CR) 10Cwhite: [C: 032] diamond: ensure Nginx collector absent [puppet] - 10https://gerrit.wikimedia.org/r/471360 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [21:04:11] thcipriani: I believe the fix is already up for SWAT in a few hours, not a new error. [21:04:23] (03PS4) 10Cwhite: diamond: ensure Nginx collector absent [puppet] - 10https://gerrit.wikimedia.org/r/471360 (https://phabricator.wikimedia.org/T183454) [21:04:32] thcipriani: 48 errors over the last 1 hour [21:04:37] Krinkle: ah, ok. I hadn't seen it before afaicr. [21:04:52] Might be a different error from TwoColConflict, that's possible [21:05:09] https://phabricator.wikimedia.org/T205942 [21:06:08] !log stopping replication on db2072 (T208954) [21:06:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:12] T208954: Missing row in enwiki.archive on sanitarium - https://phabricator.wikimedia.org/T208954 [21:06:49] Canary is happy, moving to all [21:06:58] !log krinkle@deploy1001 Synchronized php-1.33.0-wmf.3/extensions/AbuseFilter/includes/AbuseFilter.php: T208144 - I0fdda51010243 (duration: 00m 53s) [21:07:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:01] T208144: Fatal error on file upload: "Argument to AbuseFilter::filterAction() must be Title, null given" - https://phabricator.wikimedia.org/T208144 [21:11:48] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:16:48] !log krinkle@deploy1001 Synchronized php-1.33.0-wmf.3/extensions/VipsScaler: Id9f82afd (duration: 00m 55s) [21:16:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:56] !log krinkle@deploy1001 Synchronized php-1.33.0-wmf.2/extensions/AbuseFilter/includes/AbuseFilter.php: T208144 - I0fdda510102436 (duration: 00m 53s) [21:19:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:02] T208144: Fatal error on file upload: "Argument to AbuseFilter::filterAction() must be Title, null given" - https://phabricator.wikimedia.org/T208144 [21:20:52] !log ladsgroup@deploy1001 Finished deploy [ores/deploy@25dfa4f]: T191842 T197096 (duration: 17m 24s) [21:21:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:02] T191842: Deployment git server can't supply ORES hosts in parallel - https://phabricator.wikimedia.org/T191842 [21:21:03] T197096: [Epic] Use LFS for large ORES files - https://phabricator.wikimedia.org/T197096 [21:21:49] RECOVERY - Disk space on elastic1025 is OK: DISK OK [21:23:04] 10Operations, 10Release-Engineering-Team, 10Scap, 10Patch-For-Review, and 2 others: Deployment git server can't supply ORES hosts in parallel - https://phabricator.wikimedia.org/T191842 (10Ladsgroup) Now deployment time has been reduced from 22 minutes to 17 minutes. I will increase the number of parallel... [21:27:38] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 3.539 second response time [21:31:08] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:44:52] thcipriani: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/LdapAuthentication/+/472336 should allow the train to go back to wikitechwiki [21:45:14] !log arlolra@deploy1001 Started deploy [parsoid/deploy@4edc771]: Updating Parsoid to 970751a [21:45:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:38] James_F: thanks for backporting, I'll go ahead and push that out now and try to re-roll-forward [21:51:19] PROBLEM - IPsec on rdb2005 is CRITICAL: Strongswan CRITICAL - ok: 1 not-conn: rdb1005_v4 [21:52:00] Yay. [21:52:07] (03CR) 10Cwhite: [C: 032] add socket_bufsize option to make SO_RCVBUF tunable [debs/statsd-proxy] (wmf_v0.0.10) - 10https://gerrit.wikimedia.org/r/470512 (https://phabricator.wikimedia.org/T196484) (owner: 10Cwhite) [21:52:29] (03CR) 10Cwhite: [V: 032 C: 032] add socket_bufsize option to make SO_RCVBUF tunable [debs/statsd-proxy] (wmf_v0.0.10) - 10https://gerrit.wikimedia.org/r/470512 (https://phabricator.wikimedia.org/T196484) (owner: 10Cwhite) [21:52:40] thcipriani: There's also https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/472243/ to fix the Cloud VPS issue. [21:52:58] I said I'd do it once the train was done, but I'm happy to leave it to you. ;-) [21:54:31] oh good :) [21:54:35] sure, I'll get it [21:54:48] PROBLEM - Check health of redis instance on 6378 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 628 600 - REDIS 2.8.17 on 127.0.0.1:6378 has 1 databases (db0) with 6 keys, up 127 days 19 hours - replication_delay is 628 [21:54:48] !log arlolra@deploy1001 Finished deploy [parsoid/deploy@4edc771]: Updating Parsoid to 970751a (duration: 09m 34s) [21:54:49] PROBLEM - Check health of redis instance on 6381 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 631 600 - REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 3178 keys, up 127 days 19 hours - replication_delay is 631 [21:54:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:59] PROBLEM - Check health of redis instance on 6380 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 645 600 - REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 2888 keys, up 127 days 19 hours - replication_delay is 645 [21:55:15] (03CR) 10Thcipriani: [C: 032] Allow Cloud VPS 172.16.0.0/16 for $wmgAllowLabsAnonEdits wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472243 (https://phabricator.wikimedia.org/T208986) (owner: 10BryanDavis) [21:55:19] PROBLEM - Check health of redis instance on 6379 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 665 600 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 3088 keys, up 127 days 19 hours - replication_delay is 665 [21:55:19] ACKNOWLEDGEMENT - IPsec on rdb2005 is CRITICAL: Strongswan CRITICAL - ok: 1 not-conn: rdb1005_v4 Effie Mouzeli T206450: rdb1005 is being reimaged [21:56:32] (03Merged) 10jenkins-bot: Allow Cloud VPS 172.16.0.0/16 for $wmgAllowLabsAnonEdits wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472243 (https://phabricator.wikimedia.org/T208986) (owner: 10BryanDavis) [21:58:09] ACKNOWLEDGEMENT - Check health of redis instance on 6378 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 830 600 - REDIS 2.8.17 on 127.0.0.1:6378 has 1 databases (db0) with 6 keys, up 127 days 19 hours - replication_delay is 830 Effie Mouzeli T206450: rdb1005 is being reimaged [21:58:09] ACKNOWLEDGEMENT - Check health of redis instance on 6379 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 801 600 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 3088 keys, up 127 days 19 hours - replication_delay is 801 Effie Mouzeli T206450: rdb1005 is being reimaged [21:58:09] ACKNOWLEDGEMENT - Check health of redis instance on 6380 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 780 600 - REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 2888 keys, up 127 days 19 hours - replication_delay is 780 Effie Mouzeli T206450: rdb1005 is being reimaged [21:58:09] ACKNOWLEDGEMENT - Check health of redis instance on 6381 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 834 600 - REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 3178 keys, up 127 days 19 hours - replication_delay is 834 Effie Mouzeli T206450: rdb1005 is being reimaged [22:02:07] !log Updated Parsoid to 970751a (T206940) [22:02:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:10] T206940: Quote marks in "alt" text break media attribute parsing - https://phabricator.wikimedia.org/T206940 [22:02:12] !log thcipriani@deploy1001 Synchronized wmf-config/CommonSettings.php: [[gerrit:472243|Allow Cloud VPS 172.16.0.0/16 for $wmgAllowLabsAnonEdits wikis]] T208986 (duration: 00m 54s) [22:02:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:15] T208986: WDQS tests can no longer edit test.wikidata.org - https://phabricator.wikimedia.org/T208986 [22:07:08] RECOVERY - IPsec on rdb2005 is OK: Strongswan OK - 2 ESP OK [22:07:08] RECOVERY - Check health of redis instance on 6378 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6378 has 1 databases (db0) with 6 keys, up 127 days 20 hours - replication_delay is 9 [22:07:24] !log thcipriani@deploy1001 Synchronized php-1.33.0-wmf.3/extensions/LdapAuthentication/LdapAuthenticationPlugin.php: [[gerrit:472336|Expose methods used by OpenStackManager]] T208995 (duration: 00m 54s) [22:07:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:26] T208995: PHP Fatal Error: Call to private method LdapAuthenticationPlugin::bindAs() from context 'OpenStackNovaLdapConnection' - https://phabricator.wikimedia.org/T208995 [22:07:40] (03CR) 10jenkins-bot: Allow Cloud VPS 172.16.0.0/16 for $wmgAllowLabsAnonEdits wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472243 (https://phabricator.wikimedia.org/T208986) (owner: 10BryanDavis) [22:08:33] (03PS1) 10Thcipriani: Revert "labswiki rollback to 1.33.0-wmf.2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472339 [22:08:58] RECOVERY - Check health of redis instance on 6379 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 3088 keys, up 127 days 20 hours - replication_delay is 7 [22:09:29] RECOVERY - Check health of redis instance on 6381 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 3178 keys, up 127 days 20 hours - replication_delay is 2 [22:10:49] RECOVERY - Check health of redis instance on 6380 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 2888 keys, up 127 days 20 hours - replication_delay is 2 [22:13:41] !log thcipriani@deploy1001 rebuilt and synchronized wikiversions files: Revert "labswiki rollback to 1.33.0-wmf.2" [22:13:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:49] (03CR) 10Thcipriani: [C: 032] Revert "labswiki rollback to 1.33.0-wmf.2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472339 (owner: 10Thcipriani) [22:16:36] (03Merged) 10jenkins-bot: Revert "labswiki rollback to 1.33.0-wmf.2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472339 (owner: 10Thcipriani) [22:22:15] (03CR) 10jenkins-bot: Revert "labswiki rollback to 1.33.0-wmf.2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472339 (owner: 10Thcipriani) [22:23:26] (03CR) 10Effie Mouzeli: puppet:Reduce cronspam from modules/mediawiki/ (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/470877 (https://phabricator.wikimedia.org/T150375) (owner: 10Thifranc) [22:24:08] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 4.622 second response time [22:24:41] thcipriani: Let me know when you're done doing stuff to wmf.2/wmf.3, I'd like to deploy https://gerrit.wikimedia.org/r/c/mediawiki/core/+/472340 so that I can safely add GrowthExperiments to extension-list [22:24:55] RoanKattouw: I'm all finished [22:24:56] (Not for deployment in prod yet, just beta, but extension-list isn't split between prod and beta) [22:24:58] OK cool thanks [22:27:00] After Roan I have an extension to drop from production (yay) if that's OK. [22:27:29] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:34:08] (03CR) 10Effie Mouzeli: [C: 032] Change rdb1005 to spare:system [puppet] - 10https://gerrit.wikimedia.org/r/472251 (https://phabricator.wikimedia.org/T206450) (owner: 10Effie Mouzeli) [22:34:19] (03PS2) 10Effie Mouzeli: Change rdb1005 to spare:system [puppet] - 10https://gerrit.wikimedia.org/r/472251 (https://phabricator.wikimedia.org/T206450) [22:37:21] 10Operations, 10ops-codfw: unrack/decom cr1-eqord - https://phabricator.wikimedia.org/T208049 (10Papaul) [22:37:37] 10Operations, 10ops-codfw: unrack/decom cr1-eqord - https://phabricator.wikimedia.org/T208049 (10Papaul) [22:38:48] 10Operations, 10ops-codfw, 10netops, 10Patch-For-Review: codfw row C recable and add QFX - https://phabricator.wikimedia.org/T208272 (10Papaul) [22:41:05] 10Operations, 10DBA: db2061 has predictive disk errors - https://phabricator.wikimedia.org/T208957 (10Papaul) p:05Triage>03Normal [22:41:36] (03PS2) 10Catrope: Add GrowthExperiments extension to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470885 (https://phabricator.wikimedia.org/T208449) [22:41:47] (03CR) 10Catrope: [C: 032] Add GrowthExperiments extension to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470885 (https://phabricator.wikimedia.org/T208449) (owner: 10Catrope) [22:42:58] (03Merged) 10jenkins-bot: Add GrowthExperiments extension to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470885 (https://phabricator.wikimedia.org/T208449) (owner: 10Catrope) [22:43:35] 10Operations, 10Patch-For-Review, 10User-Joe, 10User-jijiki: Reorganize our redis rdb1/rdb2 clusters - https://phabricator.wikimedia.org/T206450 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ``` rdb1005.eqiad.wmnet ``` The log can be found in `/var/... [22:43:48] (03CR) 10Gehel: [C: 04-1] wdqs: separation of concerns (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/471665 (https://phabricator.wikimedia.org/T208394) (owner: 10Mathew.onipe) [22:47:29] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 1.389 second response time [22:48:39] (03PS1) 10Bstorm: sonofgridengine: correct some issues for stretch bastions [puppet] - 10https://gerrit.wikimedia.org/r/472343 (https://phabricator.wikimedia.org/T200557) [22:50:08] (03CR) 10Bstorm: [C: 032] sonofgridengine: correct some issues for stretch bastions [puppet] - 10https://gerrit.wikimedia.org/r/472343 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [22:50:29] PROBLEM - IPsec on rdb2005 is CRITICAL: Strongswan CRITICAL - ok: 1 not-conn: rdb1005_v4 [22:50:36] (03CR) 10jenkins-bot: Add GrowthExperiments extension to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470885 (https://phabricator.wikimedia.org/T208449) (owner: 10Catrope) [22:53:19] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:53:39] PROBLEM - Check health of redis instance on 6378 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 604 600 - REDIS 2.8.17 on 127.0.0.1:6378 has 1 databases (db0) with 6 keys, up 127 days 20 hours - replication_delay is 604 [22:53:48] PROBLEM - Check health of redis instance on 6381 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 609 600 - REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 3178 keys, up 127 days 20 hours - replication_delay is 609 [22:53:59] PROBLEM - Check health of redis instance on 6380 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 623 600 - REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 2888 keys, up 127 days 20 hours - replication_delay is 623 [22:54:28] PROBLEM - Check health of redis instance on 6379 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 645 600 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 3088 keys, up 127 days 20 hours - replication_delay is 645 [22:57:53] !log catrope@deploy1001 Started scap: Full scap to rebuild i18n for the addition of the GrowthExperiments extension [22:57:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:35] PROBLEM - puppet last run on rdb1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:03:34] those redis alerts are also due to reinstall of rdb1005 [23:14:01] RECOVERY - Long running screen/tmux on an-coord1001 is OK: OK: SCREEN detected but not long running. [23:16:16] 10Operations, 10Patch-For-Review, 10User-Joe, 10User-jijiki: Reorganize our redis rdb1/rdb2 clusters - https://phabricator.wikimedia.org/T206450 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['rdb1005.eqiad.wmnet'] ``` and were **ALL** successful. [23:16:48] How long does a full scap take nowadays? [23:17:14] jiji: ^ success :) [23:19:35] :D [23:21:21] !log Disabled nagios checks on rdb1006 and rdb2005 due to rdb1005 reimaging - T206450 [23:21:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:25] T206450: Reorganize our redis rdb1/rdb2 clusters - https://phabricator.wikimedia.org/T206450 [23:23:53] ACKNOWLEDGEMENT - MariaDB Slave Lag: s1 on db2094 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 6738.22 seconds Banyek T208954 [23:24:41] (03CR) 10Nuria: [C: 031] "Let's please document this here: https://wikitech.wikimedia.org/wiki/X-Analytics" [puppet] - 10https://gerrit.wikimedia.org/r/471257 (https://phabricator.wikimedia.org/T208795) (owner: 10Dr0ptp4kt) [23:32:21] PROBLEM - High CPU load on API appserver on mw2251 is CRITICAL: Return code of 255 is out of bounds [23:32:21] PROBLEM - High CPU load on API appserver on mw2262 is CRITICAL: Return code of 255 is out of bounds [23:32:21] PROBLEM - High CPU load on API appserver on mw2285 is CRITICAL: Return code of 255 is out of bounds [23:33:22] RECOVERY - High CPU load on API appserver on mw2251 is OK: OK - load average: 0.79, 4.77, 3.09 [23:33:22] RECOVERY - High CPU load on API appserver on mw2262 is OK: OK - load average: 0.62, 1.99, 1.16 [23:33:22] RECOVERY - High CPU load on API appserver on mw2285 is OK: OK - load average: 0.37, 1.62, 1.02 [23:34:23] James_F: full scap has been like 25 minutes [23:34:25] recently [23:34:41] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:37:11] thcipriani: Right. [23:37:33] !log catrope@deploy1001 Finished scap: Full scap to rebuild i18n for the addition of the GrowthExperiments extension (duration: 39m 40s) [23:37:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:37:40] or 40 :) [23:38:02] That was with i18n rebuild for both branches [23:38:23] (03CR) 10Catrope: [C: 032] GrowthExperiments, part I: Add extension flag to InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470889 (https://phabricator.wikimedia.org/T208449) (owner: 10Catrope) [23:40:35] Oy. [23:41:29] (03PS2) 10Catrope: GrowthExperiments, part I: Add extension flag to InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470889 (https://phabricator.wikimedia.org/T208449) [23:41:38] (03CR) 10Catrope: [C: 032] GrowthExperiments, part I: Add extension flag to InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470889 (https://phabricator.wikimedia.org/T208449) (owner: 10Catrope) [23:42:44] (03Merged) 10jenkins-bot: GrowthExperiments, part I: Add extension flag to InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470889 (https://phabricator.wikimedia.org/T208449) (owner: 10Catrope) [23:43:31] (03CR) 10jenkins-bot: GrowthExperiments, part I: Add extension flag to InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470889 (https://phabricator.wikimedia.org/T208449) (owner: 10Catrope) [23:44:14] (03PS3) 10Catrope: GrowthExperiments, part II: make extension flag operative in CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470890 (https://phabricator.wikimedia.org/T208449) [23:44:19] (03CR) 10Catrope: [C: 032] GrowthExperiments, part II: make extension flag operative in CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470890 (https://phabricator.wikimedia.org/T208449) (owner: 10Catrope) [23:44:55] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Add flag for GrowthExperiments to InitialiseSettings (duration: 00m 53s) [23:44:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:45:32] (03Merged) 10jenkins-bot: GrowthExperiments, part II: make extension flag operative in CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470890 (https://phabricator.wikimedia.org/T208449) (owner: 10Catrope) [23:46:01] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:48:26] !log catrope@deploy1001 Synchronized wmf-config/CommonSettings.php: Make GrowthExperiments flag operative in CommonSettings (duration: 00m 53s) [23:48:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:27] (03PS3) 10Catrope: GrowthExperiments, part III: Enable on English and Korean beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470891 (https://phabricator.wikimedia.org/T208449) [23:49:31] (03CR) 10Catrope: [C: 032] GrowthExperiments, part III: Enable on English and Korean beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470891 (https://phabricator.wikimedia.org/T208449) (owner: 10Catrope) [23:49:43] James_F: OK I'm done messing with production, all yours now [23:50:14] (03CR) 10Jforrester: [C: 032] Drop the Petition extension: Part I - disable in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472213 (https://phabricator.wikimedia.org/T208081) (owner: 10Jforrester) [23:50:36] (03Merged) 10jenkins-bot: GrowthExperiments, part III: Enable on English and Korean beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470891 (https://phabricator.wikimedia.org/T208449) (owner: 10Catrope) [23:54:26] (03PS1) 10Dzahn: icinga: fix path to retention.dat file on stretch [puppet] - 10https://gerrit.wikimedia.org/r/472352 (https://phabricator.wikimedia.org/T202782) [23:55:27] (03CR) 10jerkins-bot: [V: 04-1] icinga: fix path to retention.dat file on stretch [puppet] - 10https://gerrit.wikimedia.org/r/472352 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [23:55:30] (03PS2) 10Jforrester: Drop the Petition extension: Part I - disable in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472213 (https://phabricator.wikimedia.org/T208081) [23:55:36] (03CR) 10Jforrester: [C: 032] "…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472213 (https://phabricator.wikimedia.org/T208081) (owner: 10Jforrester) [23:55:43] (03PS2) 10Jforrester: Drop the Petition extension: Part II - disable in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472214 (https://phabricator.wikimedia.org/T208081) [23:55:49] (03CR) 10Jforrester: [C: 032] Drop the Petition extension: Part II - disable in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472214 (https://phabricator.wikimedia.org/T208081) (owner: 10Jforrester) [23:55:58] (03PS2) 10Jforrester: Drop the Petition extension: Part III - drop related user-rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472215 (https://phabricator.wikimedia.org/T208081) [23:56:04] (03CR) 10Jforrester: [C: 032] Drop the Petition extension: Part III - drop related user-rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472215 (https://phabricator.wikimedia.org/T208081) (owner: 10Jforrester) [23:56:22] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 3.300 second response time [23:57:08] (03Merged) 10jenkins-bot: Drop the Petition extension: Part I - disable in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472213 (https://phabricator.wikimedia.org/T208081) (owner: 10Jforrester) [23:57:34] (03Merged) 10jenkins-bot: Drop the Petition extension: Part II - disable in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472214 (https://phabricator.wikimedia.org/T208081) (owner: 10Jforrester) [23:57:40] (03Merged) 10jenkins-bot: Drop the Petition extension: Part III - drop related user-rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472215 (https://phabricator.wikimedia.org/T208081) (owner: 10Jforrester) [23:58:08] (03CR) 10jenkins-bot: GrowthExperiments, part II: make extension flag operative in CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470890 (https://phabricator.wikimedia.org/T208449) (owner: 10Catrope) [23:58:10] (03CR) 10jenkins-bot: GrowthExperiments, part III: Enable on English and Korean beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470891 (https://phabricator.wikimedia.org/T208449) (owner: 10Catrope) [23:58:12] (03CR) 10jenkins-bot: Drop the Petition extension: Part I - disable in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472213 (https://phabricator.wikimedia.org/T208081) (owner: 10Jforrester) [23:58:14] (03CR) 10jenkins-bot: Drop the Petition extension: Part II - disable in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472214 (https://phabricator.wikimedia.org/T208081) (owner: 10Jforrester) [23:58:36] (03CR) 10jenkins-bot: Drop the Petition extension: Part III - drop related user-rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472215 (https://phabricator.wikimedia.org/T208081) (owner: 10Jforrester) [23:59:52] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:59:52] I'll SWAT, given I'm doing it already.