[00:00:05] twentyafterfour: Dear anthropoid, the time has come. Please deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170824T0000). [00:23:29] (03PS1) 10Smalyshev: Enable access to arbitrary namespaces for WDQS [puppet] - 10https://gerrit.wikimedia.org/r/373404 (https://phabricator.wikimedia.org/T157676) [00:56:54] 10Operations, 10ops-codfw, 10DC-Ops, 10Data-Services: Split up labstore external shelf storage available in codfw between labstore2001 and 2 - https://phabricator.wikimedia.org/T171623#3547547 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['labstore2001.codfw.wmnet'] ``` Of which those **FAI... [01:04:13] PROBLEM - Check health of redis instance on 6381 on rdb2003 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6381 [01:04:43] PROBLEM - Check health of redis instance on 6480 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1503536670 600 - REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 5201118 keys, up 4 minutes 27 seconds - replication_delay is 1503536670 [01:04:53] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1503536683 600 - REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 5194878 keys, up 4 minutes 40 seconds - replication_delay is 1503536683 [01:05:13] RECOVERY - Check health of redis instance on 6381 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 9792820 keys, up 5 minutes 5 seconds - replication_delay is 0 [01:05:14] PROBLEM - Check health of redis instance on 6481 on rdb2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:05:43] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 5193506 keys, up 5 minutes 35 seconds - replication_delay is 0 [01:06:04] RECOVERY - Check health of redis instance on 6481 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 5193153 keys, up 5 minutes 57 seconds - replication_delay is 0 [01:06:43] RECOVERY - Check health of redis instance on 6480 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 5195161 keys, up 6 minutes 31 seconds - replication_delay is 0 [01:28:33] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 6 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3547580 (10aaron) From ``` mwscript maintenance/runJobs.php wikidatawiki --type htmlCacheUpdate --nothrottle --maxjobs 100 | grep "IsSelf=1" ``` I can... [01:29:42] Krinkle: hmm, didn't notice that https://gerrit.wikimedia.org/r/#/c/373390/ wasn't deployed [01:46:36] !log aaron@tin Synchronized php-1.30.0-wmf.15/includes/jobqueue: Deploy 752580637 (T173710) (duration: 00m 49s) [01:46:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:46:53] T173710: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710 [01:57:30] always so hard to judge if those recursive jobs are clearing out, because running them creates more jobs. But of course we don't even know how many jobs they will turn into ... so not something easily tracked [01:58:19] I can see more superseded jobs as I'd expect from my runs, so that's good [01:58:48] but yeah, it will take while to get down with all the recursion [02:01:52] * AaronSchulz is stuck in the office monitoring, with soda and cheetos ;) [02:02:02] well, I burned through the later... [02:05:53] fun! [02:28:24] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.14) (duration: 09m 50s) [02:28:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:36:50] * AaronSchulz is off [02:43:36] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.15) (duration: 05m 56s) [02:43:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:47:37] (03PS1) 10Andrew Bogott: labs puppetmasters: allow horizon access via ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/373421 (https://phabricator.wikimedia.org/T173982) [02:50:15] (03CR) 10Andrew Bogott: [C: 032] labs puppetmasters: allow horizon access via ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/373421 (https://phabricator.wikimedia.org/T173982) (owner: 10Andrew Bogott) [02:50:40] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Aug 24 02:50:40 UTC 2017 (duration 7m 4s) [02:50:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:53:43] PROBLEM - puppet last run on labpuppetmaster1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:01:13] PROBLEM - puppet last run on labpuppetmaster1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:01:40] (03PS1) 10Andrew Bogott: add ipv6 record for californium [dns] - 10https://gerrit.wikimedia.org/r/373422 [03:02:35] (03CR) 10Andrew Bogott: [C: 032] add ipv6 record for californium [dns] - 10https://gerrit.wikimedia.org/r/373422 (owner: 10Andrew Bogott) [03:04:44] RECOVERY - puppet last run on labpuppetmaster1001 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [03:05:13] RECOVERY - puppet last run on labpuppetmaster1002 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [03:26:23] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 882.09 seconds [04:05:43] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 211.86 seconds [05:26:28] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename supervision request: Papa1234 → Karl-Heinz Jansen - https://phabricator.wikimedia.org/T173859#3547688 (10Marostegui) >>! In T173859#3545916, @Steinsplitter wrote: >>>! In T173859#3544614, @Marostegui wrote: >> @MarcoAurelio See: T172207#3544611 >>... [05:26:45] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename supervision request: Papa1234 → Karl-Heinz Jansen - https://phabricator.wikimedia.org/T173859#3547689 (10Marostegui) 05stalled>03Open [05:44:09] (03CR) 10Marostegui: "What about prometheus files to add db1069 to the list of monitored hosts? Or you will do that in a separate commit?" [puppet] - 10https://gerrit.wikimedia.org/r/373311 (https://phabricator.wikimedia.org/T169514) (owner: 10Jcrespo) [06:02:19] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: RAID crashed on db1078 - https://phabricator.wikimedia.org/T173365#3547707 (10Marostegui) >>! In T173365#3545014, @Cmjohnson wrote: > @Marostegui The ssd has been replaced. Please resolve after rebuild Should we close this ticket and create a new one f... [06:03:46] (03PS1) 10Marostegui: s7.hosts: Remove db1041 [software] - 10https://gerrit.wikimedia.org/r/373434 (https://phabricator.wikimedia.org/T173915) [06:06:14] (03CR) 10Marostegui: [C: 032] s7.hosts: Remove db1041 [software] - 10https://gerrit.wikimedia.org/r/373434 (https://phabricator.wikimedia.org/T173915) (owner: 10Marostegui) [06:06:47] (03Merged) 10jenkins-bot: s7.hosts: Remove db1041 [software] - 10https://gerrit.wikimedia.org/r/373434 (https://phabricator.wikimedia.org/T173915) (owner: 10Marostegui) [06:07:36] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Decommission db1041 - https://phabricator.wikimedia.org/T173915#3547710 (10Marostegui) a:03Cmjohnson This host is now ready to be decommissioned and ready for @Cmjohnson do the DC-Ops part [06:37:54] (03PS4) 10Smalyshev: Add RDF dumps for categories [puppet] - 10https://gerrit.wikimedia.org/r/373354 (https://phabricator.wikimedia.org/T173892) [07:24:56] !log starting the run for rebuildTermIndex (T171460) [07:25:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:12] T171460: Populate term_full_entity_id on www.wikidata.org - https://phabricator.wikimedia.org/T171460 [07:47:00] (03PS2) 10Volans: Upstream release 1.0.0 [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/373381 [07:49:19] (03CR) 10Volans: [C: 032] Upstream release 1.0.0 [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/373381 (owner: 10Volans) [07:51:17] (03Merged) 10jenkins-bot: Upstream release 1.0.0 [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/373381 (owner: 10Volans) [07:59:29] (03CR) 10Jcrespo: "The dhcp is already correct, the yaml and prometheus monitoring is missing." [puppet] - 10https://gerrit.wikimedia.org/r/373311 (https://phabricator.wikimedia.org/T169514) (owner: 10Jcrespo) [08:00:31] (03PS3) 10Filippo Giunchedi: statsite: don't track statsd client traffic [puppet] - 10https://gerrit.wikimedia.org/r/373253 (https://phabricator.wikimedia.org/T173731) [08:00:33] (03PS3) 10Filippo Giunchedi: swift: don't track connections to swift backend services on frontend machines [puppet] - 10https://gerrit.wikimedia.org/r/373039 (https://phabricator.wikimedia.org/T173731) [08:01:54] (03CR) 10Filippo Giunchedi: [C: 032] statsite: don't track statsd client traffic [puppet] - 10https://gerrit.wikimedia.org/r/373253 (https://phabricator.wikimedia.org/T173731) (owner: 10Filippo Giunchedi) [08:03:33] PROBLEM - Nginx local proxy to apache on mw1278 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.006 second response time [08:03:54] (03PS2) 10Jcrespo: mariadb: reimage db1069 as a core host, remove sanitarium old role [puppet] - 10https://gerrit.wikimedia.org/r/373311 (https://phabricator.wikimedia.org/T169514) [08:04:33] RECOVERY - Nginx local proxy to apache on mw1278 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.043 second response time [08:08:34] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 6 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3547826 (10Ladsgroup) The jobqueue has slowed down but still increasing, and cirrusSearchIncomingLinkCount still increases the jobqueue with rate of 100... [08:12:36] (03PS2) 10Jcrespo: mariadb: Repool db1078 with full weight after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373363 (https://phabricator.wikimedia.org/T173365) [08:14:36] 10Operations, 10monitoring, 10Patch-For-Review: Fix Icinga checks for test/decom servers - https://phabricator.wikimedia.org/T151632#3547831 (10akosiaris) @Dzhan, that's a fair question. On my part I see the following. There's definitely the benefit of getting hardware monitoring. Both at the host level (RAI... [08:29:57] (03PS8) 10Ladsgroup: mediawiki: Add puppetized cronjob for rebuildTermSqlIndex [puppet] - 10https://gerrit.wikimedia.org/r/370626 (https://phabricator.wikimedia.org/T171460) [08:30:38] (03CR) 10Jcrespo: [C: 032] mariadb: Repool db1078 with full weight after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373363 (https://phabricator.wikimedia.org/T173365) (owner: 10Jcrespo) [08:30:55] (03CR) 10jenkins-bot: mariadb: Repool db1078 with full weight after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373363 (https://phabricator.wikimedia.org/T173365) (owner: 10Jcrespo) [08:42:10] (03PS9) 10Jcrespo: mediawiki: Add puppetized cronjob for rebuildTermSqlIndex [puppet] - 10https://gerrit.wikimedia.org/r/370626 (https://phabricator.wikimedia.org/T171460) (owner: 10Ladsgroup) [08:42:15] (03CR) 10Jcrespo: [C: 031] mediawiki: Add puppetized cronjob for rebuildTermSqlIndex [puppet] - 10https://gerrit.wikimedia.org/r/370626 (https://phabricator.wikimedia.org/T171460) (owner: 10Ladsgroup) [08:44:46] (03CR) 10Jcrespo: [C: 032] mediawiki: Add puppetized cronjob for rebuildTermSqlIndex [puppet] - 10https://gerrit.wikimedia.org/r/370626 (https://phabricator.wikimedia.org/T171460) (owner: 10Ladsgroup) [08:52:10] !log Drop tables article_assessment, article_assessment_pages, article_assessment_ratings tables from testwiki - T173590 [08:52:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:40] !log 16 minute test run of rebuildTermSqlIndex.php on terbium [08:54:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:44] (03CR) 10Marostegui: [C: 031] mariadb: reimage db1069 as a core host, remove sanitarium old role [puppet] - 10https://gerrit.wikimedia.org/r/373311 (https://phabricator.wikimedia.org/T169514) (owner: 10Jcrespo) [09:04:24] (03PS3) 10Jcrespo: mariadb: reimage db1069 as a core host, remove sanitarium old role [puppet] - 10https://gerrit.wikimedia.org/r/373311 (https://phabricator.wikimedia.org/T169514) [09:04:26] (03PS1) 10Jcrespo: wikidata-maintenance: Emergency stop of rebuildTermSqlIndex [puppet] - 10https://gerrit.wikimedia.org/r/373507 (https://phabricator.wikimedia.org/T171460) [09:04:39] (03PS2) 10Jcrespo: wikidata-maintenance: Emergency stop of rebuildTermSqlIndex [puppet] - 10https://gerrit.wikimedia.org/r/373507 (https://phabricator.wikimedia.org/T171460) [09:05:02] (03CR) 10jerkins-bot: [V: 04-1] wikidata-maintenance: Emergency stop of rebuildTermSqlIndex [puppet] - 10https://gerrit.wikimedia.org/r/373507 (https://phabricator.wikimedia.org/T171460) (owner: 10Jcrespo) [09:06:20] (03CR) 10Ladsgroup: wikidata-maintenance: Emergency stop of rebuildTermSqlIndex (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/373507 (https://phabricator.wikimedia.org/T171460) (owner: 10Jcrespo) [09:06:22] (03PS3) 10Jcrespo: wikidata-maintenance: Emergency stop of rebuildTermSqlIndex [puppet] - 10https://gerrit.wikimedia.org/r/373507 (https://phabricator.wikimedia.org/T171460) [09:08:17] (03CR) 10Jcrespo: "See https://gerrit.wikimedia.org/r/373309" [puppet] - 10https://gerrit.wikimedia.org/r/373311 (https://phabricator.wikimedia.org/T169514) (owner: 10Jcrespo) [09:09:14] (03CR) 10Jcrespo: "In theory absent is a keyword, or a instance of a type, so that should be ok." [puppet] - 10https://gerrit.wikimedia.org/r/373507 (https://phabricator.wikimedia.org/T171460) (owner: 10Jcrespo) [09:19:17] I think there is a need to spin up one more jobrunner: T173710 [09:19:18] T173710: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710 [09:26:53] PROBLEM - puppet last run on db1033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:27:04] PROBLEM - puppet last run on rdb1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:27:13] PROBLEM - puppet last run on elastic1045 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:27:13] PROBLEM - puppet last run on mw1287 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:27:13] PROBLEM - puppet last run on mw1288 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:27:14] PROBLEM - puppet last run on wtp1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:27:23] PROBLEM - puppet last run on cp3049 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:27:23] PROBLEM - puppet last run on mc1027 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:27:23] PROBLEM - puppet last run on mw1219 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:27:24] PROBLEM - puppet last run on es1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:27:24] PROBLEM - puppet last run on cp1063 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:27:33] PROBLEM - puppet last run on elastic1032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:27:33] PROBLEM - puppet last run on cp1058 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:27:43] PROBLEM - puppet last run on db1086 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:27:43] PROBLEM - puppet last run on mc1029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:27:43] PROBLEM - puppet last run on wtp1030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:27:44] PROBLEM - puppet last run on netmon1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:27:53] PROBLEM - puppet last run on db1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:28:13] PROBLEM - puppet last run on mw1234 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:28:14] PROBLEM - puppet last run on mw1280 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:28:42] 10Operations, 10monitoring, 10User-fgiunchedi: Monitor and alarm on SMART attributes - https://phabricator.wikimedia.org/T86552#3547992 (10fgiunchedi) [09:29:03] PROBLEM - puppet last run on scb1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:29:11] puppetdb seems to have issues, looking [09:32:35] (03PS1) 10Gehel: elasticsearch - switch to using logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/373509 [09:33:06] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch - switch to using logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/373509 (owner: 10Gehel) [09:33:14] puppetdb got killed by OOM killer, and restarted by systemd [09:33:29] (03PS2) 10Gehel: elasticsearch - switch to using logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/373509 [09:33:33] !log executing https://wikitech.wikimedia.org/wiki/Cumin#Run_Puppet_only_if_last_run_failed to get rid of failed ones [09:33:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:02] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch - switch to using logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/373509 (owner: 10Gehel) [09:34:43] RECOVERY - puppet last run on db1086 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [09:34:44] RECOVERY - puppet last run on netmon1002 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [09:35:23] RECOVERY - puppet last run on mw1287 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [09:35:23] RECOVERY - puppet last run on mw1288 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [09:35:24] RECOVERY - puppet last run on mw1219 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [09:35:33] RECOVERY - puppet last run on es1012 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [09:35:33] RECOVERY - puppet last run on elastic1032 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [09:35:44] RECOVERY - puppet last run on wtp1030 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [09:35:56] (03PS3) 10Gehel: elasticsearch - switch to using logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/373509 [09:36:23] RECOVERY - puppet last run on mw1280 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [09:36:23] RECOVERY - puppet last run on mc1027 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [09:36:24] RECOVERY - puppet last run on cp3049 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [09:36:33] RECOVERY - puppet last run on cp1063 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [09:36:43] RECOVERY - puppet last run on mc1029 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [09:36:54] RECOVERY - puppet last run on db1033 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [09:37:13] RECOVERY - puppet last run on rdb1006 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [09:37:13] RECOVERY - puppet last run on mw1234 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [09:37:33] RECOVERY - puppet last run on cp1058 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [09:37:54] RECOVERY - puppet last run on db1022 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [09:38:03] RECOVERY - puppet last run on scb1004 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [09:38:14] RECOVERY - puppet last run on elastic1045 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [09:39:23] RECOVERY - puppet last run on wtp1008 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [09:39:45] (03PS1) 10Gehel: apertium - switch to logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/373510 [09:43:14] (03PS1) 10Volans: Fix installation of example files [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/373513 (https://phabricator.wikimedia.org/T174008) [09:47:17] (03CR) 10Volans: [C: 032] Fix installation of example files [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/373513 (https://phabricator.wikimedia.org/T174008) (owner: 10Volans) [09:49:02] (03Merged) 10jenkins-bot: Fix installation of example files [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/373513 (https://phabricator.wikimedia.org/T174008) (owner: 10Volans) [09:56:01] !log jynus@tin Synchronized wmf-config/db-eqiad.php: repool db1078 with full weight (duration: 00m 46s) [09:56:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:23] PROBLEM - MariaDB Slave Lag: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:06:37] (03PS1) 10Gehel: base - switch to logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/373515 [10:06:39] (03PS1) 10Gehel: camus - switch to logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/373516 [10:06:41] (03PS1) 10Gehel: confd - switch to logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/373517 [10:06:43] (03PS1) 10Gehel: dumps - switch to logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/373518 [10:33:13] PROBLEM - MariaDB Slave Lag: s2 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1483.75 seconds [10:33:43] PROBLEM - MariaDB Slave Lag: s7 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 617.39 seconds [10:36:07] (03Draft1) 10Paladox: Gerrit: Set autoMigrate so that changes are stored in both reviewdb and notedb [puppet] - 10https://gerrit.wikimedia.org/r/373520 [10:36:09] (03PS2) 10Paladox: Gerrit: Set autoMigrate so that changes are stored in both reviewdb and notedb [puppet] - 10https://gerrit.wikimedia.org/r/373520 [10:37:46] (03PS3) 10Paladox: Gerrit: Set autoMigrate so that changes are stored in both reviewdb and notedb [puppet] - 10https://gerrit.wikimedia.org/r/373520 [10:44:15] (03PS4) 10Paladox: Gerrit: Set autoMigrate so that changes are stored in both reviewdb and notedb [puppet] - 10https://gerrit.wikimedia.org/r/373520 [10:45:04] (03CR) 10Paladox: "I created this as i forgot where the doc was for this. But this will make everything safe when upstream remove reviewdb support. Also will" [puppet] - 10https://gerrit.wikimedia.org/r/373520 (owner: 10Paladox) [10:47:52] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 6 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3548080 (10Ladsgroup) I take that back, I ran runJobs on terbium to see what's going on there and most jobs gets passed easily (including cirrusSearchInc... [10:49:28] 10Operations, 10Citoid, 10VisualEditor, 10Services (watching), 10User-mobrovac: Wiley requests for DOI and some other publishers don't work in production - https://phabricator.wikimedia.org/T165105#3548088 (10Mvolz) Shall we re-enable the other DC now? [10:59:47] * Steinsplitter eyes marostegui [11:02:58] Steinsplitter: Hi!! [11:03:04] Steinsplitter: Give me a sec to get ready [11:03:09] ok :) [11:03:16] (I was in a meeting) [11:03:24] (03CR) 10Alexandros Kosiaris: [C: 032] "I 've tested this as well in a VM, looks like it's working quite fine. Tested with the files in hand (for bounces at least) and the result" [puppet] - 10https://gerrit.wikimedia.org/r/372848 (https://phabricator.wikimedia.org/T173733) (owner: 10Alexandros Kosiaris) [11:03:28] (03PS3) 10Alexandros Kosiaris: mail::mx: Ship bounce/warn message files [puppet] - 10https://gerrit.wikimedia.org/r/372848 (https://phabricator.wikimedia.org/T173733) [11:03:31] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] mail::mx: Ship bounce/warn message files [puppet] - 10https://gerrit.wikimedia.org/r/372848 (https://phabricator.wikimedia.org/T173733) (owner: 10Alexandros Kosiaris) [11:03:43] PROBLEM - MariaDB Slave Lag: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:03:54] PROBLEM - MariaDB Slave Lag: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:03:54] PROBLEM - MariaDB Slave Lag: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:04:00] :P [11:04:41] maybe those are the backups running, actually [11:04:55] 10Operations, 10monitoring, 10User-fgiunchedi: Monitor and alarm on SMART attributes - https://phabricator.wikimedia.org/T86552#3548110 (10fgiunchedi) [11:05:37] Steinsplitter: You can now proceed, can you send me the meta URL to follow the progress of the rename once you start it? [11:07:23] : ok, thanks. See https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress/Karl-Heinz_Jansen [11:07:40] Thanks! [11:07:59] !log Global rename Papa1234 → Karl-Heinz Jansen - T173859 [11:08:01] Steinsplitter: meow, big rename going on? [11:08:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:13] T173859: Global rename supervision request: Papa1234 → Karl-Heinz Jansen - https://phabricator.wikimedia.org/T173859 [11:08:19] TabbyCat: yepp [11:08:41] 100k edits at dewiki, that'll be fun :P [11:10:04] 10Operations, 10Mail, 10OTRS, 10Patch-For-Review: Automatically merge bounces/DSNs in ticket - https://phabricator.wikimedia.org/T173733#3548114 (10akosiaris) 05Open>03stalled Above patch has been merged, puppet has ran and exim reloaded. Setting it to stalled to reflect I 'll be monitoring the situat... [11:12:28] 10Operations, 10monitoring, 10User-fgiunchedi: Monitor and alarm on SMART attributes - https://phabricator.wikimedia.org/T86552#3548119 (10fgiunchedi) Of the plan above the next hairy step is gathering the list of physical disks. For directly attached disks it is easy since block devices presented by linux a... [11:13:47] marostegui: dewiki db is lagging high :) [11:13:57] yep [11:13:58] I am seeing it [11:14:07] db1045 is now recover [11:14:13] db1026 still there [11:15:44] it is now decreasing [11:15:48] and gone :) [11:18:54] O_O [11:19:34] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1946 bytes in 0.107 second response time [11:21:12] marostegui: maybe we can provision 2 hosts with copies of db1045 and db1026 [11:21:32] as s8 content in the future will be very similar [11:22:33] Yeah, maybe it is time to do that [11:22:58] I will do db1069 for s7, if you are ok with that [11:23:05] this afternoon [11:23:09] sure [11:24:54] PROBLEM - MariaDB Slave Lag: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:25:03] PROBLEM - MariaDB Slave Lag: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:25:03] PROBLEM - MariaDB Slave Lag: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:28:10] 10Operations, 10Mail, 10OTRS, 10Patch-For-Review: Automatically merge bounces/DSNs in ticket - https://phabricator.wikimedia.org/T173733#3538151 (10Thibaut120094) [11:29:08] 10Operations, 10DNS, 10Traffic, 10Wikimedia-Apache-configuration: Redirect nan.wikipedia.org to zh-min-nan.wikipedia.org - https://phabricator.wikimedia.org/T173966#3548137 (10Aklapper) p:05Normal>03Triage What is "it" in "it has been the default behavior in the last several months"? [[ https://www.med... [11:29:28] Steinsplitter: We are done, no? [11:29:34] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1958 bytes in 0.132 second response time [11:30:15] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename supervision request: Papa1234 → Karl-Heinz Jansen - https://phabricator.wikimedia.org/T173859#3548140 (10Steinsplitter) 05Open>03Resolved done: https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress/Karl-Heinz_Jansen Thanks @Marostegui [11:30:18] sepp :) [11:30:23] *y [11:32:18] Thanks! [11:36:07] !log disable puppet on db1069 in preparation for reimage [11:36:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:11] jynus: Any objection if I take db110 to replace db1026? https://phabricator.wikimedia.org/T172679 [11:37:46] db1100 [11:38:07] no, although mark it as future s8 on a comment on puppet or something [11:38:12] yep! [11:38:23] we need to reserve 4 of those at least [11:38:27] yeah [11:38:37] have a look at physical position [11:38:47] of the remaining ones, if it makes sense [11:39:02] so we do not leave the last 4 on the same rack [11:39:29] actually, can you use db1099 ? [11:39:30] hehe yeah [11:39:34] oh let me check [11:39:56] so we stop referring it as "old pc host" [11:40:02] sure, I can use that one [11:40:07] let me check the rack position of the others [11:40:14] if you do not mind reimaging [11:40:21] anything will work [11:40:29] so do not take my suggestion for granted [11:40:37] No no, I don't mind which one [11:40:52] db1096 is in use? [11:41:15] let me delete those comments, they are confusing [11:41:45] I was going to ask you, as you were using it :) [11:41:47] So not sure [11:42:06] I think that is clearer: https://phabricator.wikimedia.org/T172679 [11:42:13] knowing that, use anyone you want [11:42:45] I will take db1096 [11:43:00] I woudl take a couple and pool them on s5 [11:43:11] yep [11:43:12] for later s8 assignment [11:43:14] That is the plan :) [11:43:16] good [11:43:49] I think 1 master + 3 slaves + multisource rcs should be enough [11:43:55] if they are all new [11:44:48] yeah, that looks about right [11:44:57] maybe something with vslow [11:45:45] dbstore2001 is lagging because I am running the dump script [11:46:16] but what is the issue on dbstore1001? [11:46:47] there is one dump process running [11:46:56] interesting [11:48:41] (03PS7) 10Jcrespo: mariadb: Adding rack allocations, some formatting fixes, read-only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371444 (https://phabricator.wikimedia.org/T172459) [11:48:43] (03PS1) 10Jcrespo: mariadb: Repool es2013 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373524 (https://phabricator.wikimedia.org/T172265) [11:49:06] (03PS1) 10Marostegui: mariadb: Add db1096 as s5 slave [puppet] - 10https://gerrit.wikimedia.org/r/373525 (https://phabricator.wikimedia.org/T172679) [11:50:33] (03PS2) 10Marostegui: mariadb: Add db1096 as s5 slave [puppet] - 10https://gerrit.wikimedia.org/r/373525 (https://phabricator.wikimedia.org/T172679) [11:53:24] I remember, I used db1096 to test the multisource for a while [11:53:32] that is why it us on stretch [11:53:58] hehe yeah I was surprised it was on stretch [11:55:09] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/compiler02/7592/" [puppet] - 10https://gerrit.wikimedia.org/r/373525 (https://phabricator.wikimedia.org/T172679) (owner: 10Marostegui) [11:56:09] (03CR) 10Marostegui: [C: 032] mariadb: Add db1096 as s5 slave [puppet] - 10https://gerrit.wikimedia.org/r/373525 (https://phabricator.wikimedia.org/T172679) (owner: 10Marostegui) [11:59:47] 10Operations, 10monitoring, 10User-fgiunchedi: Monitor and alarm on SMART attributes - https://phabricator.wikimedia.org/T86552#3548171 (10Volans) The above should not be needed on megaraid hosts where `smartctl --scan-open` works well AFAICT (see on `ms-be2014`). [12:02:13] PROBLEM - MariaDB Slave Lag: s3 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 620.77 seconds [12:04:33] RECOVERY - MariaDB Slave Lag: s7 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 37.69 seconds [12:15:30] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 6 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3537269 (10daniel) >>! In T173710#3545392, @Esc3300 wrote: > Are these originating also in clients or initially coming from Wikidata? What triggers them?... [12:16:27] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 6 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3548216 (10daniel) >>! In T173710#3542688, @aaron wrote: > Mostly htmlCacheUpdate jobs on wikidatawiki: > > htmlCacheUpdate: 6014947 queued; 5 claimed (... [12:24:03] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 6 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3548220 (10daniel) So, @Ladsgroup told me that he observed HtmlCacheUpdate jobs for 100 pages taking more than one minute. Given that the purging process... [12:24:21] (03PS1) 10Marostegui: db-eqiad.php: Depool db1026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373528 (https://phabricator.wikimedia.org/T172679) [12:26:17] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373528 (https://phabricator.wikimedia.org/T172679) (owner: 10Marostegui) [12:26:31] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 6 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3548223 (10daniel) >>! In T173710#3547580, @aaron wrote: > In other words, base jobs for entities that will divide up and purge all backlinks to the give... [12:27:32] (03PS1) 10Marostegui: s5.hosts: Add db1096 [software] - 10https://gerrit.wikimedia.org/r/373529 (https://phabricator.wikimedia.org/T172679) [12:27:39] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373530 (https://phabricator.wikimedia.org/T128546) [12:27:41] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373528 (https://phabricator.wikimedia.org/T172679) (owner: 10Marostegui) [12:27:51] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373528 (https://phabricator.wikimedia.org/T172679) (owner: 10Marostegui) [12:29:02] (03CR) 10Marostegui: [C: 032] s5.hosts: Add db1096 [software] - 10https://gerrit.wikimedia.org/r/373529 (https://phabricator.wikimedia.org/T172679) (owner: 10Marostegui) [12:29:28] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1026 to clone db1096 from it T172679 (duration: 00m 47s) [12:29:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:42] T172679: Productionize 11 new eqiad database servers - https://phabricator.wikimedia.org/T172679 [12:29:45] (03Merged) 10jenkins-bot: s5.hosts: Add db1096 [software] - 10https://gerrit.wikimedia.org/r/373529 (https://phabricator.wikimedia.org/T172679) (owner: 10Marostegui) [12:30:26] !log Stop MySQL on db1026 to clone db1096 from it - T172679 [12:30:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:44] (03PS1) 10Phuedx: pagePreviews: Re-enable on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373531 (https://phabricator.wikimedia.org/T171853) [12:35:15] (03PS2) 10Phuedx: pagePreviews: Re-enable on enwiki and dewiki (BC-only) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373531 (https://phabricator.wikimedia.org/T171853) [12:41:17] (03PS1) 10Marostegui: mariadb: Remove yaml files from db1015 and db1041 [puppet] - 10https://gerrit.wikimedia.org/r/373534 (https://phabricator.wikimedia.org/T173915) [12:45:39] !log killing discovery report updater on stats1005 (stuck since Aug 15) - T173333 [12:45:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:52] T173333: Reportupdater outputs files with restricted permissions - https://phabricator.wikimedia.org/T173333 [12:45:57] gehel: come on, I am sure it was about to finish :p [12:46:30] marostegui: :) I'm sure your optimism works in some cases :) [12:46:36] xddddd [12:46:48] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/compiler02/7593/" [puppet] - 10https://gerrit.wikimedia.org/r/373534 (https://phabricator.wikimedia.org/T173915) (owner: 10Marostegui) [12:47:40] (03CR) 10Marostegui: [C: 032] mariadb: Remove yaml files from db1015 and db1041 [puppet] - 10https://gerrit.wikimedia.org/r/373534 (https://phabricator.wikimedia.org/T173915) (owner: 10Marostegui) [13:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170824T1300). [13:00:04] jan_drewniak and phuedx: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:48] i'm here [13:01:01] i could deploy but i honestly can't remember how to do the portal deploys [13:01:23] phuedx: there are docs, let me find them for you :) [13:01:39] zeljkof: was just looking on wikitech [13:01:43] LET'S RACE [13:01:47] o/ [13:01:49] https://www.mediawiki.org/wiki/Wikipedia.org_Portal#Portal_Deployment [13:02:08] phuedx: the link is in the deployment calendar :D [13:02:12] oh mibad [13:02:15] the link's in the calendar [13:02:17] derp [13:02:34] phuedx: want to do SWAT today? since it's your commit and portals? [13:02:39] (I can SWAT if you prefer) [13:02:40] zeljkof: sure [13:02:49] my commit is beta cluster only [13:02:52] ok, ping me if you need me, I'm around [13:02:52] so it's low impact [13:02:59] <3 [13:03:04] do i have to log that i'm deploying? [13:03:25] (03CR) 10Phuedx: [C: 032] "SWAT!!1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373530 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [13:03:26] you can just leave a comment here, I just usually say "I can SWAT today" [13:03:40] i'm swatting today y'all [13:03:46] https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#General_Advice [13:03:48] HOLD ON TO YOUR HATS [13:04:02] (especially your rare tf2 hats) [13:04:24] I can SWAT today! [13:04:45] 👍 [13:04:56] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373530 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [13:06:07] jan_drewniak: can portals deployments be tested on the mwdebugs [13:06:17] or should i just push and you give me urls to purge (if necessary) [13:06:18] (03CR) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373530 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [13:07:28] phuedx: It can be tested on the debugs (though I'm not sure how that works) but urls will be purged automatically with the sync-portals script [13:08:19] jan_drewniak: has the previous sop been skipping pulling onto the mwdebugs and sync-portals straight away? [13:08:30] looks like it from the instructions [13:10:16] phuedx: that might have been the case :P [13:10:31] i see [13:10:48] ok, well if you're happy, then i'll run the sync-portals script [13:10:57] there don't seem to be instructions about testing it on mwdebugs [13:11:04] ^ jan_drewniak [13:11:25] (03CR) 10Phuedx: [C: 032] "SWAT!!1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373531 (https://phabricator.wikimedia.org/T171853) (owner: 10Phuedx) [13:11:35] phuedx: yeah these are pretty non-eventful updates [13:11:42] ok, [13:11:44] running the script [13:12:40] !log phuedx@tin Synchronized portals/prod/wikipedia.org/assets: (no justification provided) (duration: 00m 49s) [13:12:48] (03Merged) 10jenkins-bot: pagePreviews: Re-enable on enwiki and dewiki (BC-only) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373531 (https://phabricator.wikimedia.org/T171853) (owner: 10Phuedx) [13:12:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:57] PROBLEM - HHVM rendering on mw2245 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:13:01] (03CR) 10jenkins-bot: pagePreviews: Re-enable on enwiki and dewiki (BC-only) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373531 (https://phabricator.wikimedia.org/T171853) (owner: 10Phuedx) [13:13:29] !log phuedx@tin Synchronized portals: (no justification provided) (duration: 00m 49s) [13:13:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:48] RECOVERY - HHVM rendering on mw2245 is OK: HTTP OK: HTTP/1.1 200 OK - 75559 bytes in 1.557 second response time [13:13:52] phuedx: yup! that worked :) [13:14:02] jan_drewniak: done and urls reported as purged [13:14:12] the (no justification provided) was mibad, i didn't realise there was another argument for the script [13:14:17] should've linked it to the task [13:14:19] cool [13:14:51] ok, on to 373531 [13:15:01] phuedx: thanks! [13:15:26] I have this for SWAT: https://gerrit.wikimedia.org/r/373539 if there is some time [13:15:31] not testable [13:15:35] jan_drewniak: no worries [13:15:49] Amir1: i'm just testing a small beta cluster only change [13:15:54] then i'll sync it [13:15:59] then the queue is drained [13:16:01] thanks. It can wait [13:16:02] so i reckon there's time [13:16:37] 10Operations, 10Citoid, 10VisualEditor, 10Services (watching), 10User-mobrovac: Wiley requests for DOI and some other publishers don't work in production - https://phabricator.wikimedia.org/T165105#3548361 (10akosiaris) >>! In T165105#3537789, @akosiaris wrote: > I think we have been unblocked btw > > `... [13:17:31] ok, https://gerrit.wikimedia.org/r/#/c/373531/ lgtm on the beta cluster [13:17:32] syncing [13:19:21] !log phuedx@tin Synchronized wmf-config/InitialiseSettings-labs.php: T171853: Re-enable Page Previews for enwiki and dewiki on the Beta Cluster (duration: 00m 47s) [13:19:26] AAmir1 , Zzeljkof : it's been a while since i deployed an extension change [13:19:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:33] T171853: Create A/B test strategy for en- and dewiki tests - https://phabricator.wikimedia.org/T171853 [13:19:36] is the submodule bump done automatically now? [13:20:32] phuedx: no [13:20:39] I've done it recently [13:20:48] 10Operations, 10Ops-Access-Requests, 10Release-Engineering-Team (Watching / External), 10User-Addshore: Make @daniel a MediaWiki deployer - https://phabricator.wikimedia.org/T173230#3548373 (10daniel) Thank you! [13:21:38] I'll be back soon [13:21:58] Amir1: i'll hold off, ping me when you're back [13:25:59] I'm bac [13:26:03] phuedx: ^ [13:29:18] Amir1: does anything else need to happen for the wikidata change? i'm aware that wikidata has a build process and might need other steps? [13:29:28] *. [13:29:38] phuedx: I changed the build so nothing needs to be done [13:30:38] Amir1: cool. once that's merged, i'll update mwdebug1002 [13:30:45] wait [13:30:48] you said no testing [13:30:52] eah [13:30:54] *yeah [13:31:02] because it's for the jobrunner [13:34:56] (03PS1) 10Phuedx: pagePreviews: Bump on/control group size to 25% (BC-only) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373541 (https://phabricator.wikimedia.org/T171853) [13:35:52] * phuedx twiddles thumbs [13:39:16] phuedx: I guess it's merged now :) [13:39:32] Amir1: so now i have to bump the submodule right? [13:39:40] i coulda sworn someone said that this was done automatically now [13:40:04] you need to do "git submodule update" after git fetch [13:42:31] git fetch; git submodule update extensions/Wikidata ain't doing anything currently [13:44:13] https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#Merging_and_Fetching_Patches [13:44:20] Doesn't this work? let me check [13:46:57] zeljkof: any advice? [13:47:23] merged a change in an extension, not seeing the submodule update on the deployment server [13:47:45] phuedx: looking... [13:48:41] phuedx: which commit did you merge? [13:48:59] zeljkof: https://gerrit.wikimedia.org/r/#/c/373539/ [13:51:09] phuedx: I know there is something special about deploying wikidata, not sure I have ever done it [13:51:57] maybe small patches are the same as usual, this is the docs https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Step_2:_get_the_code_on_the_deployment_host [13:53:55] Amir1: any thoughts? [13:54:16] I don't think this patch is any different from any other extension [13:54:34] phuedx: my notes say to do this [13:54:37] (03PS1) 10Ottomata: Add -f to rm flag when deleting old druid request logs [puppet] - 10https://gerrit.wikimedia.org/r/373542 [13:54:43] zfilipin@tin:~$ cd /srv/mediawiki-staging/php-1.30.0-wmf.15/extensions/Wikidata [13:54:48] git fet h [13:54:54] sorry: git fetch [13:55:07] git log HEAD..origin/wmf/1.30.0-wmf.15 [13:55:09] did you do that? [13:55:36] (03CR) 10Ottomata: [C: 032] Add -f to rm flag when deleting old druid request logs [puppet] - 10https://gerrit.wikimedia.org/r/373542 (owner: 10Ottomata) [13:57:08] zeljkof: yeah and i don't see any change [13:57:31] hm, strange [13:57:42] wait, Amir1 has that change been added to the build already? [13:57:49] I rarely deploy extensions, so I don't have much experience :| [13:57:51] no [13:57:51] it's branch master but topic wmf/1.30 [13:58:11] wtf [13:58:17] oh, wait, that's in master!? [13:58:19] that's my fault [13:58:47] Amir1: well three of us misread it too :) [13:59:00] yes, the topic name confused me too :) [13:59:07] zeljkof, are you swatting still please? [13:59:09] I need two scripts to be ran [13:59:20] so, it has to be merged in master, then picked in a branch [13:59:22] zeljkof: do you know why when I did "checkout wmf..." it changed topic and not brnahc [13:59:28] Urbanecm: phuedx is swatting :) [13:59:44] Amir1: maybe you had a topic branch with that name [13:59:59] Ok. phuedx, can you run a script for me please? [14:00:05] yeah, people made it when they build the new branch [14:00:06] Urbanecm: the window is still open [14:00:11] what can i do for you? [14:00:21] I wish T173994 to be resolved. [14:00:23] T173994: Run refreshLinks.php then updateArticleCount.php on Bashkir Wikibooks - https://phabricator.wikimedia.org/T173994 [14:00:29] Just run the two scripts in the task :) [14:00:30] Urbanecm: i'll see if i can! i'm not sure what it entails [14:00:30] (03PS1) 10Marostegui: db-eqiad.php: Add db1096 to s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373543 (https://phabricator.wikimedia.org/T172679) [14:00:59] i'm also going to deploy a minor bc-only change now too [14:01:02] phuedx: usually, it's just copy/paste of the command https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#Maintenance_Scripts [14:01:20] (03CR) 10Phuedx: [C: 032] "SWAT!!1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373541 (https://phabricator.wikimedia.org/T171853) (owner: 10Phuedx) [14:01:24] phuedx, I can tell you exact commands which you should run at terbium/tin. [14:01:39] Urbanecm: cool, i'm on tin right now [14:02:22] phuedx, ok. I'll put the syntax in the task [14:02:35] zeljkof: what do you suggest on how to proceed? [14:02:42] 10Operations, 10ops-codfw, 10User-fgiunchedi: ms-be2024 not powering on - https://phabricator.wikimedia.org/T171275#3459460 (10faidon) The reported (by dmidecode etc.) serial number for the system changed from MXQ62300TQ to HZ6BNV8315. I changed Racktables to reflect that. I'm not sure what's our policy supp... [14:02:43] Amir1: do you need more time? [14:02:56] It's there [14:03:05] yeah a little [14:03:05] (03Merged) 10jenkins-bot: pagePreviews: Bump on/control group size to 25% (BC-only) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373541 (https://phabricator.wikimedia.org/T171853) (owner: 10Phuedx) [14:03:18] (03CR) 10jenkins-bot: pagePreviews: Bump on/control group size to 25% (BC-only) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373541 (https://phabricator.wikimedia.org/T171853) (owner: 10Phuedx) [14:03:20] To see how I can make a patch in the branch and not on the topic [14:04:09] https://www.irccloud.com/pastebin/30tQwmAj/ [14:04:48] Urbanecm: sorry, which task? [14:04:52] Amir1: is the patch urgent? since the swat window is over [14:04:55] T173994 [14:05:07] phuedx, ^^^ [14:05:14] yeah, the jobqueue is more than 10.5M jobs now [14:05:24] Amir1: if not, take your time and cherry-pick the commit to a branch, merge and deploy [14:05:44] Amir1: oh, so if it's urgent, cherry-pick and deploy immediately :) [14:06:07] you can cherry pick from gerrit web interface as far as I remember, so there is little chance for mistake [14:06:13] okay I will take over and do it [14:06:27] Urbanecm: oh mibad [14:06:31] i missed that [14:07:04] Amir1: I'm around, but I don't have much experience with extensions, especially wikidata :| [14:07:12] but looks like you know what to do [14:07:35] phuedx, is there anything else you must know about the two scripts? [14:07:47] Urbanecm: nope, your task explains it [14:07:50] okay [14:07:52] Thanks [14:07:56] zeljkof: is it alright if i run the scripts [14:07:59] we're a little outside the window? [14:09:35] From my POV it is alright, the wiki is very small and it should take only a second [14:10:37] Zzeljkof : ^ [14:10:39] zeljkof: ^ [14:11:23] (03PS1) 10Rush: openstack: for glance use ipaddress fact [puppet] - 10https://gerrit.wikimedia.org/r/373545 (https://phabricator.wikimedia.org/T171494) [14:11:42] alright [14:12:06] phuedx: sure [14:12:20] ok [14:12:29] if there is no other deployment going on, and if the script is fast, go ahead [14:13:35] Urbanecm: running the first script (refreshLinks0 [14:13:44] Ack [14:13:59] !log refreshLinks on bawiki (T173994) [14:13:59] !log updated cumin package in our APT to version cumin_1.0.0-1 [14:14:02] Hi, is swat done? [14:14:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:25] marostegui: i believe it's still open for Amir1 so that he can deploy a hotfix [14:14:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:26] T173994: Run refreshLinks.php then updateArticleCount.php on Bashkir Wikibooks - https://phabricator.wikimedia.org/T173994 [14:14:34] phuedx: ok! [14:14:57] marostegui: I'm doing a rather unscheduled deploy [14:15:01] (03PS2) 10Rush: openstack: for glance use ipaddress fact [puppet] - 10https://gerrit.wikimedia.org/r/373545 (https://phabricator.wikimedia.org/T171494) [14:15:07] it will be done hopefully by ten minutes [14:15:19] sure, no worries [14:15:49] (03CR) 10Rush: [C: 032] openstack: for glance use ipaddress fact [puppet] - 10https://gerrit.wikimedia.org/r/373545 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [14:18:40] phuedx, in the task I can see "refreshLinks on bawiki". Just want to remind you that this is about bawikibooks, not bawiki. [14:19:21] sorry UUrbanecm , i'll correct myself on the task [14:19:25] i did type "bawikibooks" [14:20:05] for the script [14:20:55] phuedx, Sorry, some internet troubles. What did you say before? [14:21:23] i did type "bawikibooks" for the script invocation [14:21:37] but i'll correct myself on the task re: the log message [14:21:44] refreshlinks is almost done [14:21:47] Oh, that's great. Thank you. Are both of the scripts done? [14:22:21] Great, thank you [14:25:03] (03PS2) 10Marostegui: db-eqiad.php: Add db1096 to s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373543 (https://phabricator.wikimedia.org/T172679) [14:25:35] (03PS1) 10Volans: Cumin: update configuration file for v1.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/373550 [14:25:42] Can I deploy ^ phuedx? [14:26:19] Urbanecm: those scripts are done, sorry it took so long [14:26:25] Amir1: you now have the window [14:26:52] Thanks [14:27:13] marostegui: i'll defer to zeljkof and Amir1 on that one -- i know that Amir1 's trying to get out a hotfix to stop the job queue from increasing in size [14:27:20] ok [14:27:27] but i actually have to run in 5 minutes -- i wasn't planning on taking this long [14:28:08] phuedx, thank you! [14:28:17] marostegui: phuedx was running SWAT today, if Amir1 is done, go ahead [14:28:39] ^ thanks y'all [14:28:48] that wasn't as smooth as i'd have hoped [14:28:51] finally [14:28:51] https://gerrit.wikimedia.org/r/#/c/373551/ [14:29:05] I got it to the branch after fourth try [14:29:49] o/ [14:41:44] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: lists.wikimedia.org (208.80.154.21) blocked by Trend Micro - https://phabricator.wikimedia.org/T172602#3548860 (10herron) 05Open>03Resolved a:03herron 208.80.154.21 is no longer listed here. Considering this task resolved. @Platonides thanks for report... [14:44:17] (03PS1) 10Rush: openstack: remove inactive uwsgi cruft for keystone [puppet] - 10https://gerrit.wikimedia.org/r/373553 (https://phabricator.wikimedia.org/T171494) [14:44:43] !log ladsgroup@tin Synchronized php-1.30.0-wmf.15/extensions/Wikidata/extensions/Wikibase/client/includes/Changes/WikiPageUpdater.php: Reduce batch size in WikiPageUpdater (T173710) (duration: 00m 48s) [14:44:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:55] T173710: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710 [14:45:18] marostegui: My deploy is done now, waiting to check if anything explodes [14:48:37] (03PS1) 10Alexandros Kosiaris: WIP: Upgrade to kubernetes 1.7.4 [debs/kubernetes] - 10https://gerrit.wikimedia.org/r/373554 (https://phabricator.wikimedia.org/T170119) [14:49:10] (03CR) 10Alexandros Kosiaris: [C: 031] swift: don't track connections to swift backend services on frontend machines [puppet] - 10https://gerrit.wikimedia.org/r/373039 (https://phabricator.wikimedia.org/T173731) (owner: 10Filippo Giunchedi) [14:49:39] (03PS4) 10Filippo Giunchedi: swift: don't track connections to swift backend services on frontend machines [puppet] - 10https://gerrit.wikimedia.org/r/373039 (https://phabricator.wikimedia.org/T173731) [14:50:22] (03CR) 10Filippo Giunchedi: [C: 032] swift: don't track connections to swift backend services on frontend machines [puppet] - 10https://gerrit.wikimedia.org/r/373039 (https://phabricator.wikimedia.org/T173731) (owner: 10Filippo Giunchedi) [14:51:53] (03CR) 10Andrew Bogott: [C: 031] openstack: remove inactive uwsgi cruft for keystone [puppet] - 10https://gerrit.wikimedia.org/r/373553 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [14:53:53] PROBLEM - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:54:00] gah that's me [14:54:05] ack [14:54:07] PROBLEM - Swift HTTP backend on ms-fe1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 351 bytes in 3.015 second response time [14:54:16] cool [14:54:17] sorry about that [14:54:21] np [14:54:40] I'm out, I'm online in telegram if anything happens [14:54:52] RECOVERY - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 173 bytes in 0.003 second response time [14:54:55] I prefer 1000 times more maintenace than an outage :-) [14:55:00] (03PS2) 10Rush: openstack: remove inactive uwsgi cruft for keystone [puppet] - 10https://gerrit.wikimedia.org/r/373553 (https://phabricator.wikimedia.org/T171494) [14:55:04] heheh indeed [14:55:17] PROBLEM - Swift HTTP backend on ms-fe1007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 351 bytes in 3.015 second response time [14:55:58] I'm chocking that one up to maint too^ :) [14:55:59] atm [14:56:26] yeah [14:58:17] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [14:59:37] ok I'll revert that change [15:00:08] RECOVERY - Swift HTTP backend on ms-fe1005 is OK: HTTP OK: HTTP/1.1 200 OK - 391 bytes in 0.023 second response time [15:00:35] (03PS1) 10Filippo Giunchedi: Revert "swift: don't track connections to swift backend services on frontend machines" [puppet] - 10https://gerrit.wikimedia.org/r/373558 [15:00:57] (03CR) 10jerkins-bot: [V: 04-1] Revert "swift: don't track connections to swift backend services on frontend machines" [puppet] - 10https://gerrit.wikimedia.org/r/373558 (owner: 10Filippo Giunchedi) [15:01:41] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] Revert "swift: don't track connections to swift backend services on frontend machines" [puppet] - 10https://gerrit.wikimedia.org/r/373558 (owner: 10Filippo Giunchedi) [15:02:03] 15:00:54 Line 1: First line should be <=80 characters [15:02:06] ppffftt [15:03:17] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:05:18] RECOVERY - Swift HTTP backend on ms-fe1007 is OK: HTTP OK: HTTP/1.1 200 OK - 391 bytes in 0.014 second response time [15:05:37] 10Operations, 10DNS, 10Traffic, 10Wikimedia-Apache-configuration: Redirect nan.wikipedia.org to zh-min-nan.wikipedia.org - https://phabricator.wikimedia.org/T173966#3546343 (10Liuxinyu970226) @Verdy_p isn't this domain already redirected for years? In [[ https://phabricator.wikimedia.org/source/operations-... [15:11:11] 10Operations, 10monitoring: add pdu redundancy checking to server/router/switch checks in icinga - https://phabricator.wikimedia.org/T109903#1562129 (10herron) For servers the ipmi sensor check used for monitoring temperature could also be used to monitor additional sensors like power supplies. ~# /usr/loc... [15:12:51] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Add db1096 to s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373543 (https://phabricator.wikimedia.org/T172679) (owner: 10Marostegui) [15:15:29] (03Merged) 10jenkins-bot: db-eqiad.php: Add db1096 to s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373543 (https://phabricator.wikimedia.org/T172679) (owner: 10Marostegui) [15:16:24] (03CR) 10jenkins-bot: db-eqiad.php: Add db1096 to s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373543 (https://phabricator.wikimedia.org/T172679) (owner: 10Marostegui) [15:17:39] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Add db1096 depooled to s5 - T172679 (duration: 00m 47s) [15:17:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:50] T172679: Productionize 11 new eqiad database servers - https://phabricator.wikimedia.org/T172679 [15:18:36] (03CR) 10Dbarratt: [C: 031] Enable wgEchoPerUserBlacklist at all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373133 (https://phabricator.wikimedia.org/T173838) (owner: 10Urbanecm) [15:25:05] (03PS1) 10Andrew Bogott: add interface::add_ip6_mapped to californium [puppet] - 10https://gerrit.wikimedia.org/r/373564 [15:25:43] (03CR) 10Andrew Bogott: [C: 032] add interface::add_ip6_mapped to californium [puppet] - 10https://gerrit.wikimedia.org/r/373564 (owner: 10Andrew Bogott) [15:27:50] (03PS3) 10Rush: openstack: remove inactive uwsgi cruft for keystone [puppet] - 10https://gerrit.wikimedia.org/r/373553 (https://phabricator.wikimedia.org/T171494) [15:28:55] (03CR) 10Rush: [C: 032] openstack: remove inactive uwsgi cruft for keystone [puppet] - 10https://gerrit.wikimedia.org/r/373553 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [15:31:57] 10Operations, 10monitoring, 10Patch-For-Review: Icinga disk space check should also check inode usage - https://phabricator.wikimedia.org/T129222#3549212 (10herron) [15:31:59] 10Operations, 10Icinga: Make nagios check_disk check for inode usage as well - https://phabricator.wikimedia.org/T84171#3549211 (10herron) [15:32:48] 10Operations, 10Icinga: Make nagios check_disk check for inode usage as well - https://phabricator.wikimedia.org/T84171#923686 (10herron) 05Open>03Resolved a:03herron [15:42:34] (03PS1) 10Marostegui: db-eqiad.php: Give some weight to db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373569 (https://phabricator.wikimedia.org/T172679) [15:42:53] (03CR) 10Marostegui: [C: 04-2] "server still catching up" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373569 (https://phabricator.wikimedia.org/T172679) (owner: 10Marostegui) [15:44:16] 10Operations, 10ops-eqiad, 10netops: eqiad: rack frack refresh equipment - https://phabricator.wikimedia.org/T169644#3549254 (10ayounsi) As we struggled a bit in codfw to figure out which optics/links were working on the SRX/EX, here is what is currently there: pfw3a/pfw3b ``` FPC 0 REV 11 711-0... [15:50:03] 10Operations, 10DNS, 10Traffic, 10Wikimedia-Apache-configuration: Redirect nan.wikipedia.org to zh-min-nan.wikipedia.org - https://phabricator.wikimedia.org/T173966#3549273 (10Verdy_p) I did '''not''' made that request, it was there since years but not triaged like other pending renames all related to the... [15:51:56] (03PS2) 10Marostegui: db-eqiad.php: Give some weight to db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373569 (https://phabricator.wikimedia.org/T172679) [15:54:05] moritzm: (non urgent, but if you are still working…) I'm wondering if we can make the ferm puppet classes automatically duplicate ipv4 firewall rules for ipv6. Is that something you've thought about? [15:54:18] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Give some weight to db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373569 (https://phabricator.wikimedia.org/T172679) (owner: 10Marostegui) [15:55:59] (03PS2) 10Volans: Cumin: update configuration file for v1.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/373550 [15:57:09] (03Merged) 10jenkins-bot: db-eqiad.php: Give some weight to db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373569 (https://phabricator.wikimedia.org/T172679) (owner: 10Marostegui) [15:57:19] (03CR) 10jenkins-bot: db-eqiad.php: Give some weight to db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373569 (https://phabricator.wikimedia.org/T172679) (owner: 10Marostegui) [15:57:37] PROBLEM - keystone admin endpoint on labcontrol1002 is CRITICAL: connect to address 208.80.154.95 and port 35357: Connection refused [15:58:23] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Give some weight to db1096 - T172679 (duration: 00m 47s) [15:58:28] PROBLEM - keystone public endoint on labcontrol1002 is CRITICAL: connect to address 208.80.154.95 and port 5000: Connection refused [15:58:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:34] T172679: Productionize 11 new eqiad database servers - https://phabricator.wikimedia.org/T172679 [16:00:05] godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170824T1600). Please do the needful. [16:01:36] (03PS3) 10Volans: Cumin: update configuration file for v1.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/373550 [16:01:43] (03PS1) 10Marostegui: db-eqiad.php: Add weight to db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373594 (https://phabricator.wikimedia.org/T172679) [16:03:26] (03PS1) 10Matthias Mullie: Add missing THREED2PNG_PATH [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/373595 [16:05:37] PROBLEM - Check Varnish expiry mailbox lag on cp1072 is CRITICAL: CRITICAL: expiry mailbox lag is 2004930 [16:09:04] (03PS4) 10Jcrespo: mariadb: reimage db1069 as a core host, remove sanitarium old role [puppet] - 10https://gerrit.wikimedia.org/r/373311 (https://phabricator.wikimedia.org/T169514) [16:09:24] (03PS4) 10Volans: Cumin: update configuration file for v1.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/373550 [16:09:45] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 6 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3549359 (10daniel) Now let's see what the reduced batch size does. It may actually make the problem worse, but increasing the total number of jobs. Let's... [16:10:20] (03CR) 10Volans: "Puppet compiler result available here: https://puppet-compiler.wmflabs.org/compiler02/7595/" [puppet] - 10https://gerrit.wikimedia.org/r/373550 (owner: 10Volans) [16:13:13] (03CR) 10Jcrespo: [C: 032] mariadb: reimage db1069 as a core host, remove sanitarium old role [puppet] - 10https://gerrit.wikimedia.org/r/373311 (https://phabricator.wikimedia.org/T169514) (owner: 10Jcrespo) [16:15:47] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [140.0] [16:17:47] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [140.0] [16:19:16] (03CR) 10Dzahn: "Paladox, have that link?" [puppet] - 10https://gerrit.wikimedia.org/r/356645 (https://phabricator.wikimedia.org/T43608) (owner: 10Paladox) [16:19:27] 10Operations, 10DBA, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3549410 (10jcrespo) It is my intention to reimage db1069, provisioning it from db1033 (s7) and pool it as a db1028 replacement, making both db1033 and db1028 obsolete (to be retire... [16:20:29] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Add weight to db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373594 (https://phabricator.wikimedia.org/T172679) (owner: 10Marostegui) [16:23:32] re zuul alert: [16:23:41] yep, looks like a big backlog of patches to review but nothing broken afaict: https://grafana.wikimedia.org/dashboard/db/nodepool?orgId=1 [16:23:51] (what I said in response to it in -releng) [16:24:10] !log apt-get upgrade and updated to MW 1.29.1 on wikitech-static. Rebooting to pick up kernel updates. [16:24:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:02] (03CR) 10Dzahn: [C: 04-1] "this is another one for the Gerrit deb file, but we are not going to use the deb anymore..." [debs/gerrit] - 10https://gerrit.wikimedia.org/r/350440 (owner: 10Paladox) [16:27:04] (03CR) 10Paladox: "Hi, yep, https://github.com/GerritCodeReview/gerrit/blob/master/gerrit-server/src/main/resources/com/google/gerrit/server/mail/Comment.soy" [puppet] - 10https://gerrit.wikimedia.org/r/356645 (https://phabricator.wikimedia.org/T43608) (owner: 10Paladox) [16:27:06] (03PS7) 10Paladox: Gerrit: Reveal the author in the title of the email [puppet] - 10https://gerrit.wikimedia.org/r/356645 (https://phabricator.wikimedia.org/T43608) [16:27:47] (03Abandoned) 10Paladox: Upgrade gerrit to 2.14.2-pre (DO NOT MERGE) [debs/gerrit] - 10https://gerrit.wikimedia.org/r/350440 (owner: 10Paladox) [16:29:33] (03PS8) 10Paladox: Gerrit: Set auth.userNameToLowerCase [puppet] - 10https://gerrit.wikimedia.org/r/368196 [16:29:39] (03PS6) 10Paladox: Gerrit: Remove ldap user and password from secure.config [puppet] - 10https://gerrit.wikimedia.org/r/366910 [16:30:01] (03CR) 10Dzahn: [C: 04-1] "@Paladox and the last one for operations/debs/gerrit (we won't be using the deb anymore)" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/336002 (https://phabricator.wikimedia.org/T145885) (owner: 10Paladox) [16:31:40] (03Abandoned) 10Paladox: Add mariadb-java-client [debs/gerrit] - 10https://gerrit.wikimedia.org/r/336002 (https://phabricator.wikimedia.org/T145885) (owner: 10Paladox) [16:31:45] (03PS1) 10Rush: openstack: openstack2/keystone/monitor move to profile base [puppet] - 10https://gerrit.wikimedia.org/r/373598 (https://phabricator.wikimedia.org/T171494) [16:31:53] (03PS2) 10Rush: openstack: openstack2/keystone/monitor move to profile base [puppet] - 10https://gerrit.wikimedia.org/r/373598 (https://phabricator.wikimedia.org/T171494) [16:31:58] (03CR) 10jerkins-bot: [V: 04-1] openstack: openstack2/keystone/monitor move to profile base [puppet] - 10https://gerrit.wikimedia.org/r/373598 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [16:33:22] Almost 40 minutes to get a change thru :( [16:33:49] (03Merged) 10jenkins-bot: db-eqiad.php: Add weight to db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373594 (https://phabricator.wikimedia.org/T172679) (owner: 10Marostegui) [16:34:48] !log T169939: Decommission Cassandra: restbase2003-a.codfw.wmnet [16:34:52] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Give some more weight to db1096 - T172679 (duration: 00m 47s) [16:35:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:01] T169939: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939 [16:35:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:12] T172679: Productionize 11 new eqiad database servers - https://phabricator.wikimedia.org/T172679 [16:35:39] 10Operations, 10cloud-services-team: Time to renew *.wmflabs.org cert - https://phabricator.wikimedia.org/T174053#3549484 (10Andrew) [16:35:47] (03PS1) 10Marostegui: db-eqiad.php: Give more weight to db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373599 [16:36:23] (03CR) 10jenkins-bot: db-eqiad.php: Add weight to db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373594 (https://phabricator.wikimedia.org/T172679) (owner: 10Marostegui) [16:38:45] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Give more weight to db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373599 (owner: 10Marostegui) [16:40:51] (03Merged) 10jenkins-bot: db-eqiad.php: Give more weight to db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373599 (owner: 10Marostegui) [16:41:01] (03CR) 10jenkins-bot: db-eqiad.php: Give more weight to db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373599 (owner: 10Marostegui) [16:42:28] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: RAID crashed on db1078 - https://phabricator.wikimedia.org/T173365#3549499 (10jcrespo) 05Open>03Resolved a:03Cmjohnson > Should we close this ticket and create a new one for testing another host and see its behaviour? Let's just do it. [16:43:03] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Give some more weight to db1096 - T172679 (duration: 00m 47s) [16:43:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:16] T172679: Productionize 11 new eqiad database servers - https://phabricator.wikimedia.org/T172679 [16:45:46] (03PS1) 10Marostegui: db-eqiad.php: Give normal weight to db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373602 [16:47:15] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: RAID crashed on db1078 - https://phabricator.wikimedia.org/T173365#3549532 (10Cmjohnson) let me know which db you want to test and when? [16:48:55] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: RAID crashed on db1078 - https://phabricator.wikimedia.org/T173365#3549544 (10jcrespo) Let's us some time to find a good candidate and create some fake load and we will ping either you or Papaul on T174054. [16:49:50] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Give normal weight to db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373602 (owner: 10Marostegui) [16:51:48] (03Merged) 10jenkins-bot: db-eqiad.php: Give normal weight to db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373602 (owner: 10Marostegui) [16:52:57] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Give normal weight to db1096 - T172679 (duration: 00m 47s) [16:53:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:08] T172679: Productionize 11 new eqiad database servers - https://phabricator.wikimedia.org/T172679 [17:00:04] gwicke, cscott, arlolra, subbu, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170824T1700). [17:00:24] arlo will do a parsoid deploy in a little while [17:01:07] RECOVERY - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] [17:01:43] !log stopping and cloning db1033 to db1069 [17:01:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:15] (03CR) 10Rush: [C: 032] openstack: openstack2/keystone/monitor move to profile base [puppet] - 10https://gerrit.wikimedia.org/r/373598 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [17:04:58] (03CR) 10jenkins-bot: db-eqiad.php: Give normal weight to db1096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373602 (owner: 10Marostegui) [17:21:21] !log arlolra@tin Started deploy [parsoid/deploy@bd12f8a]: Updating Parsoid to 538dad7f [17:21:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:10] RECOVERY - DPKG on labstore2001 is OK: All packages OK [17:29:29] RECOVERY - dhclient process on labstore2001 is OK: PROCS OK: 0 processes with command name dhclient [17:29:39] RECOVERY - Check systemd state on labstore2001 is OK: OK - running: The system is fully operational [17:29:39] RECOVERY - Check the NTP synchronisation status of timesyncd on labstore2001 is OK: OK: synced at Thu 2017-08-24 17:29:31 UTC. [17:29:49] RECOVERY - configured eth on labstore2001 is OK: OK - interfaces up [17:29:49] RECOVERY - Disk space on labstore2001 is OK: DISK OK [17:29:59] RECOVERY - IPMI Temperature on labstore2001 is OK: Sensor Type(s) Temperature Status: OK [17:30:00] RECOVERY - puppet last run on labstore2001 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [17:33:45] !log arlolra@tin Finished deploy [parsoid/deploy@bd12f8a]: Updating Parsoid to 538dad7f (duration: 12m 24s) [17:33:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:20] RECOVERY - salt-minion processes on labstore2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:34:39] PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 2037146 [17:35:39] RECOVERY - Check Varnish expiry mailbox lag on cp1072 is OK: OK: expiry mailbox lag is 18 [17:37:51] 10Operations, 10Mail, 10OTRS, 10Patch-For-Review: Automatically merge bounces/DSNs in ticket - https://phabricator.wikimedia.org/T173733#3538151 (10pajz) Did not work in https://ticket.wikimedia.org/otrs/index.pl?Action=AgentTicketZoom;TicketID=10208291, despite the header being set. Hm. [17:38:29] 10Operations, 10ops-codfw, 10DC-Ops, 10Data-Services: Split up labstore external shelf storage available in codfw between labstore2001 and 2 - https://phabricator.wikimedia.org/T171623#3549730 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by madhuvishy on neodymium.eqiad.wmnet for hosts: ```... [17:38:39] RECOVERY - MegaRAID on labstore2001 is OK: OK: optimal, 12 logical, 12 physical [17:39:07] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 6 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3549733 (10EBernhardson) >>! In T173710#3547826, @Ladsgroup wrote: > The jobqueue has slowed down but still increasing, and cirrusSearchIncomingLinkCount... [17:42:45] the jobqueue is still exploding, maybe I just should wait as it's "rush hour" in wikipedia? [17:52:40] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 25 probes of 280 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [17:52:49] PROBLEM - MariaDB Slave Lag: s6 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 316.69 seconds [17:52:49] PROBLEM - MariaDB Slave Lag: s7 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 420.97 seconds [17:54:18] Amir1: enwiki is constant rush hour in my opinion so if its affecting it i say no [17:55:52] Zppix: please let others who know more about the load on our dbs/jobrunners comment [17:56:23] greg-g ack [17:57:40] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 2 probes of 280 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [17:59:49] number of enqueuing jobs is way higher than noon (UTC time) it was around 800/sec back then, it's 1200/sec right now [18:00:05] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170824T1800). [18:04:50] RECOVERY - MariaDB Slave Lag: s6 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 55.82 seconds [18:24:45] !log T169939: Decommission Cassandra: restbase2003-b.codfw.wmnet [18:24:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:57] T169939: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939 [18:31:42] (03PS30) 10Ppchelko: Increase max kafka message size for changeprop and kafka main [puppet] - 10https://gerrit.wikimedia.org/r/372179 [18:37:36] the jobqueue starts to decrease and it's decreasing drastically [18:38:11] Amir1: thats me, testing the effects of lower throttling against viwki htmlCacheUpdate jobs [18:38:18] !log varnish backend restart - cp1074 - mailbox lag [18:38:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:42] ebernhardson: thanks, keep me updated [18:39:31] (03PS31) 10Ppchelko: Increase max kafka message size for changeprop and kafka main [puppet] - 10https://gerrit.wikimedia.org/r/372179 [18:40:04] 10Operations, 10MediaWiki-Containers, 10Release-Engineering-Team, 10Epic, and 3 others: FY2017/18 Program 6 - Outcome 2 - Objective 3: Integrated, container-based development environment - https://phabricator.wikimedia.org/T170456#3550012 (10dbarratt) [18:44:39] RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 0 [19:00:04] thcipriani: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170824T1900). Please do the needful. [19:00:35] * thcipriani does needful [19:03:39] (03PS2) 10Jdlrobson: pagePreviews: remove invalidated popup sampling rate variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373171 (https://phabricator.wikimedia.org/T174075) (owner: 10Niedzielski) [19:03:49] (03CR) 10jerkins-bot: [V: 04-1] pagePreviews: remove invalidated popup sampling rate variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373171 (https://phabricator.wikimedia.org/T174075) (owner: 10Niedzielski) [19:04:11] (03CR) 10Jdlrobson: [C: 031] "WMF13 is live everywhere, so we can feel free to SWAT this whenever we need to" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373171 (https://phabricator.wikimedia.org/T174075) (owner: 10Niedzielski) [19:05:55] 10Operations, 10Cassandra, 10Epic, 10Goal, and 2 others: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939#3550256 (10Eevans) [19:07:53] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 6 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3550263 (10aaron) >>! In T173710#3548223, @daniel wrote: >>>! In T173710#3547580, @aaron wrote: >> In other words, base jobs for entities that will divid... [19:10:50] (03PS1) 10Thcipriani: all wikis to 1.30.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373669 [19:10:52] (03CR) 10Thcipriani: [C: 032] all wikis to 1.30.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373669 (owner: 10Thcipriani) [19:11:26] (03PS1) 10BBlack: dhparam: use ffdhe2048 from RFC7919 [puppet] - 10https://gerrit.wikimedia.org/r/373670 [19:12:16] (03CR) 10BBlack: [C: 032] dhparam: use ffdhe2048 from RFC7919 [puppet] - 10https://gerrit.wikimedia.org/r/373670 (owner: 10BBlack) [19:13:42] 10Operations, 10DBA: decomission db1033 and db1028 - https://phabricator.wikimedia.org/T174076#3550282 (10jcrespo) [19:14:43] !log restarting rabbitmq-server on labcontrol1001 [19:14:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:15] (03PS11) 10MarcoAurelio: Initial configuration for hi.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371109 (https://phabricator.wikimedia.org/T173013) [19:19:30] (03PS8) 10Jcrespo: mariadb: Adding rack allocations, some formatting fixes, read-only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371444 (https://phabricator.wikimedia.org/T172459) [19:19:32] (03PS2) 10Jcrespo: mariadb: Repool es2013 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373524 (https://phabricator.wikimedia.org/T172265) [19:19:34] (03PS1) 10Jcrespo: mariadb: Retire db1033, pool db1069 with its copy and low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373673 (https://phabricator.wikimedia.org/T174076) [19:21:50] (03PS2) 10Jcrespo: mariadb: Retire db1033, pool db1069 with its copy and low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373673 (https://phabricator.wikimedia.org/T174076) [19:23:26] (03Merged) 10jenkins-bot: all wikis to 1.30.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373669 (owner: 10Thcipriani) [19:23:59] RECOVERY - MariaDB Slave Lag: s2 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 21.26 seconds [19:24:19] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: all wikis to 1.30.0-wmf.15 [19:24:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:07] 10Operations, 10DBA, 10Patch-For-Review: Decommission db1033 and db1028 - https://phabricator.wikimedia.org/T174076#3550329 (10jcrespo) [19:26:39] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): Degraded RAID on logstash1006 - https://phabricator.wikimedia.org/T173679#3550331 (10Cmjohnson) return shipping information USPS 9202 3946 5301 2436 4467 08 FDX 9611918 2393026 73196722 [19:27:25] (03CR) 10jenkins-bot: all wikis to 1.30.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373669 (owner: 10Thcipriani) [19:27:55] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: RAID crashed on db1078 - https://phabricator.wikimedia.org/T173365#3550377 (10Cmjohnson) Return shipping info for disk UPS 1ZW0948Y9082750467 [19:33:20] (03PS1) 10Jcrespo: mariadb: Remove db1069 from labs prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/373674 (https://phabricator.wikimedia.org/T169514) [19:33:49] (03CR) 10Jcrespo: [C: 032] mariadb: Remove db1069 from labs prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/373674 (https://phabricator.wikimedia.org/T169514) (owner: 10Jcrespo) [19:38:00] 10Operations, 10ops-eqiad, 10netops: eqiad: rack frack refresh equipment - https://phabricator.wikimedia.org/T169644#3550424 (10Cmjohnson) All the switches and pfw's have been racked and racktables updated. both pfw3a and pfw3b have been connected to the scs console and msw switch. Both frasw1 and frasw2 ha... [19:40:07] 10Operations, 10ops-codfw, 10DC-Ops, 10Data-Services: Split up labstore external shelf storage available in codfw between labstore2001 and 2 - https://phabricator.wikimedia.org/T171623#3550443 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['labstore2002.codfw.wmnet'] ``` Of which those **FAI... [19:42:22] andrewbogott: looks like there are nodepool launch errors and delete timeouts. Initially when you kicked rabbitmq it looked like things started moving, now getting backed up again :( [19:44:43] 10Operations, 10Mail: mail.wikimedia.org SSL cert expiring Mon 23 Oct 2017 - https://phabricator.wikimedia.org/T174081#3550469 (10herron) [19:56:20] PROBLEM - nodepoold running on labnodepool1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d [19:58:19] ^ known andrewbogott is cleaning things up [19:58:51] !log T169939: Decommission Cassandra: restbase2003-c.codfw.wmnet [19:59:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:03] T169939: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939 [19:59:20] RECOVERY - nodepoold running on labnodepool1001 is OK: PROCS OK: 1 process with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d [20:02:55] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 6 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3550574 (10EBernhardson) I've been doing some testing with the HTMLCacheUpdate jobs against relatively low traffic wikis with high numbers of jobs (viwki... [20:07:23] !log stopping, starting, restarting etc. nodepool [20:07:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:53] ok, reenabled puppet on isntall1002, i need a partman break, going to go buy some lunch. [20:27:55] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 6 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3550742 (10EBernhardson) Some background from bblack about the cache purge pipeline: ``` A) Sometime in the distant past, the way it worked is that when... [20:30:43] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 6 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3550759 (10Jdforrester-WMF) Well, it's dropped by ~1.5M jobs in the last couple of hours and seems to be now more slowly draining the pool. [20:36:41] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 6 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3550786 (10EBernhardson) Related previous tickets about purge problems/increases: T124418 T133821 [20:40:18] (03PS1) 10Rush: openstack: misc components to module/role/profile [puppet] - 10https://gerrit.wikimedia.org/r/373685 (https://phabricator.wikimedia.org/T171494) [20:40:33] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 6 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3550809 (10EBernhardson) >>! In T173710#3550759, @Jdforrester-WMF wrote: > Well, it's dropped by ~1.5M jobs in the last couple of hours and seems to be n... [20:40:34] (03CR) 10jerkins-bot: [V: 04-1] openstack: misc components to module/role/profile [puppet] - 10https://gerrit.wikimedia.org/r/373685 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [20:40:52] (03PS2) 10Rush: openstack: misc components to module/role/profile [puppet] - 10https://gerrit.wikimedia.org/r/373685 (https://phabricator.wikimedia.org/T171494) [20:41:15] (03CR) 10jerkins-bot: [V: 04-1] openstack: misc components to module/role/profile [puppet] - 10https://gerrit.wikimedia.org/r/373685 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [20:42:00] PROBLEM - puppet last run on cp1063 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:42:19] PROBLEM - traffic-pool service on cp1063 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:42:29] PROBLEM - confd service on cp1063 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:42:39] PROBLEM - SSH on cp1063 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:46:59] PROBLEM - MD RAID on cp1063 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:47:09] (03PS3) 10Rush: openstack: misc components to module/role/profile [puppet] - 10https://gerrit.wikimedia.org/r/373685 (https://phabricator.wikimedia.org/T171494) [20:47:23] (03CR) 10jerkins-bot: [V: 04-1] openstack: misc components to module/role/profile [puppet] - 10https://gerrit.wikimedia.org/r/373685 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [20:48:01] (03PS5) 10Rush: openstack: misc components to module/role/profile [puppet] - 10https://gerrit.wikimedia.org/r/373685 (https://phabricator.wikimedia.org/T171494) [20:48:25] (03CR) 10jerkins-bot: [V: 04-1] openstack: misc components to module/role/profile [puppet] - 10https://gerrit.wikimedia.org/r/373685 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [20:50:09] (03PS6) 10Rush: openstack: misc components to module/role/profile [puppet] - 10https://gerrit.wikimedia.org/r/373685 (https://phabricator.wikimedia.org/T171494) [20:50:43] (03PS1) 10Ottomata: Set discovery-stats user primary group to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/373686 [20:50:49] RECOVERY - SSH on cp1063 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [20:50:59] RECOVERY - MD RAID on cp1063 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [20:51:09] RECOVERY - puppet last run on cp1063 is OK: OK: Puppet is currently enabled, last run 25 minutes ago with 0 failures [20:51:19] RECOVERY - traffic-pool service on cp1063 is OK: OK - traffic-pool is active [20:51:29] RECOVERY - confd service on cp1063 is OK: OK - confd is active [20:51:39] (03CR) 10Ottomata: [C: 032] Set discovery-stats user primary group to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/373686 (owner: 10Ottomata) [20:51:51] (03PS2) 10Ottomata: Set discovery-stats user primary group to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/373686 [20:51:53] (03CR) 10Ottomata: [V: 032 C: 032] Set discovery-stats user primary group to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/373686 (owner: 10Ottomata) [20:53:31] (03PS1) 10Dzahn: icinga: enhance check for screen sessions, also detect tmux [puppet] - 10https://gerrit.wikimedia.org/r/373687 (https://phabricator.wikimedia.org/T165348) [21:00:05] MaxSem and Niharika: Dear anthropoid, the time has come. Please deploy Community Tech breaking the site (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170824T2100). [21:00:22] Hi jouncebot. [21:00:37] breaking what? [21:00:38] lol [21:00:56] (03PS1) 10Ottomata: Include discovery-stats user in analytics_cluster::users [puppet] - 10https://gerrit.wikimedia.org/r/373689 [21:01:28] (03PS2) 10Ottomata: Include discovery-stats user in analytics_cluster::users [puppet] - 10https://gerrit.wikimedia.org/r/373689 [21:01:51] (03CR) 10jerkins-bot: [V: 04-1] Include discovery-stats user in analytics_cluster::users [puppet] - 10https://gerrit.wikimedia.org/r/373689 (owner: 10Ottomata) [21:02:09] Niharika: 10/10 deployment window name [21:02:23] Zppix: All MaxSem! [21:02:45] All windows should variant off that [21:03:29] (03PS2) 10Dzahn: icinga: enhance check for screen sessions, also detect tmux [puppet] - 10https://gerrit.wikimedia.org/r/373687 (https://phabricator.wikimedia.org/T165348) [21:03:55] I prefer the "scap" = "scattering crap across production" by bd808 :P [21:04:18] TabbyCat: isnt that what scap does though? [21:04:35] yes when it deploys Flow to wikis [21:04:45] (03PS3) 10Ottomata: Include discovery-stats user in analytics_cluster::users [puppet] - 10https://gerrit.wikimedia.org/r/373689 [21:04:49] ;) [21:05:07] (03CR) 10jerkins-bot: [V: 04-1] Include discovery-stats user in analytics_cluster::users [puppet] - 10https://gerrit.wikimedia.org/r/373689 (owner: 10Ottomata) [21:05:27] thcipriani: (if you're around) Zuul looks partially dead https://integration.wikimedia.org/zuul/ [21:06:04] Niharika: its recovering [21:06:09] Slowly but surely [21:06:12] (03PS4) 10Ottomata: Include discovery-stats user in analytics_cluster::users [puppet] - 10https://gerrit.wikimedia.org/r/373689 [21:06:15] Niharika: we had some nodepool trouble earlier, it should be recovering now :) [21:06:30] I hope we don't miss our deployment window. [21:07:05] It appears to be going thru the queue at a decent speed [21:07:33] moar CI boxes plz [21:08:40] That Zuul page doesn't auto-update anymore. [21:09:22] thcipriani: does zuul no longer auto refresh? [21:09:28] it's being shy, too many eyes on it [21:09:35] I noticed that earlier, tracked here: https://phabricator.wikimedia.org/T174058 [21:10:01] MaxSem: "ive got my eyes on you" [21:13:56] !log maxsem@tin Synchronized php-1.30.0-wmf.15/extensions/LoginNotify/: Logging: https://gerrit.wikimedia.org/r/#/c/373691/ (duration: 00m 44s) [21:14:07] (03CR) 10Ottomata: [C: 04-2] "This needs more thought and a ticket." [puppet] - 10https://gerrit.wikimedia.org/r/373689 (owner: 10Ottomata) [21:14:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:14] (03PS1) 10Urbanecm: Remove non-transparent background from dty.wiki logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373694 (https://phabricator.wikimedia.org/T174098) [21:29:19] Urbanecm: does https://phabricator.wikimedia.org/source/mediawiki-config/browse/master/static/images/project-logos/dtywiki-1.5x.png look a bit blurry to you? [21:29:30] otherwise lgtm [21:29:58] TabbyCat: Yes. I regenerated the images from a SVG, it looks okay in my patch [21:30:15] (03PS1) 10Urbanecm: throttle.php: Separate the throttling definitions from the exception values itself [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373695 (https://phabricator.wikimedia.org/T167040) [21:31:44] (03CR) 10jerkins-bot: [V: 04-1] throttle.php: Separate the throttling definitions from the exception values itself [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373695 (https://phabricator.wikimedia.org/T167040) (owner: 10Urbanecm) [21:31:52] (03PS1) 10Pmiazga: Fix incorrect Special:Userlogin name in Popups blacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373696 (https://phabricator.wikimedia.org/T170169) [21:32:25] (03CR) 10MarcoAurelio: [C: 031] "I have viewed the new logos and they look good to me. Thank you." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373694 (https://phabricator.wikimedia.org/T174098) (owner: 10Urbanecm) [21:32:38] (03PS2) 10Urbanecm: throttle.php: Separate the throttling definitions from the exception values itself [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373695 (https://phabricator.wikimedia.org/T167040) [21:34:09] (03PS6) 10MarcoAurelio: Cloud VPS configuration for hi.wikivoyage [puppet] - 10https://gerrit.wikimedia.org/r/371096 (https://phabricator.wikimedia.org/T173013) [21:35:36] (03CR) 10Rush: [C: 032] openstack: misc components to module/role/profile [puppet] - 10https://gerrit.wikimedia.org/r/373685 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [21:35:40] (03PS7) 10Rush: openstack: misc components to module/role/profile [puppet] - 10https://gerrit.wikimedia.org/r/373685 (https://phabricator.wikimedia.org/T171494) [21:40:29] PROBLEM - puppet last run on labtestcontrol2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:41:44] (03PS1) 10Urbanecm: Automatically include commons and wikidata in $wmgThrottlingExceptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373698 (https://phabricator.wikimedia.org/T163872) [21:43:19] (03PS2) 10Urbanecm: Automatically include commons and wikidata in $wmgThrottlingExceptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373698 (https://phabricator.wikimedia.org/T163872) [21:43:31] (03PS3) 10Urbanecm: Automatically include commons and wikidata in $wmgThrottlingExceptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373698 (https://phabricator.wikimedia.org/T163872) [21:43:59] (03PS1) 10Rush: openstack: fix labtest wmflabsorg-domainadminenv.sh.erb [puppet] - 10https://gerrit.wikimedia.org/r/373700 (https://phabricator.wikimedia.org/T171494) [21:44:06] 10Operations, 10fundraising-tech-ops: Port fundraising stats off Ganglia - https://phabricator.wikimedia.org/T152562#3551109 (10cwdent) a:03cwdent Hi @fgiunchedi - prometheus is now running on pay-lvs*:9090 They are only watching themselves and one eqiad host, which also has the mysqld exporter. Wondering... [21:44:16] (03PS2) 10Rush: openstack: fix labtest wmflabsorg-domainadminenv.sh.erb [puppet] - 10https://gerrit.wikimedia.org/r/373700 (https://phabricator.wikimedia.org/T171494) [21:45:06] (03CR) 10Rush: [C: 032] openstack: fix labtest wmflabsorg-domainadminenv.sh.erb [puppet] - 10https://gerrit.wikimedia.org/r/373700 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [21:45:43] (03CR) 10jerkins-bot: [V: 04-1] Automatically include commons and wikidata in $wmgThrottlingExceptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373698 (https://phabricator.wikimedia.org/T163872) (owner: 10Urbanecm) [21:47:30] RECOVERY - puppet last run on labtestcontrol2001 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [21:47:44] (03CR) 10Volans: [C: 04-1] "There seems to be a spurious file in the CR, looks good otherwise" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/373509 (owner: 10Gehel) [21:51:19] PROBLEM - puppet last run on labcontrol1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:52:19] RECOVERY - puppet last run on labcontrol1002 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [21:54:53] !log maxsem@tin Synchronized php-1.30.0-wmf.15/extensions/LoginNotify/: https://gerrit.wikimedia.org/r/#/c/373701/1 (duration: 00m 45s) [21:55:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:51] (03CR) 10Volans: [C: 04-1] "Some typos, see inline. Looks good otherwise" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/373510 (owner: 10Gehel) [21:57:54] (03CR) 10Volans: [C: 04-1] "Some typos in the variable names. I'm not sure if we should keep the variables snake-cased in the definition of logrotate::rule while the " (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/373515 (owner: 10Gehel) [21:59:54] !log restarted pdfrender service instances in eqiad / T159922 [22:00:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:06] T159922: pdfrender fails to serve requests since Mar 8 00:30:32 UTC on scb1003 - https://phabricator.wikimedia.org/T159922 [22:14:55] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 6 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3551230 (10EBernhardson) Note necessarily a cause, but while looking into viwiki's backlog, i noticed this bot which seems to be creating an incredible n... [22:17:22] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 6 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3551232 (10aaron) >>! In T173710#3551156, @aaron wrote: > Secondary purges where for dealing with replication lag scenarios, not lost purges. That was on... [22:20:34] (03CR) 10Bearloga: "> This needs more thought and a ticket." [puppet] - 10https://gerrit.wikimedia.org/r/373689 (owner: 10Ottomata) [22:20:51] (03PS5) 10Bearloga: Include discovery-stats user in analytics_cluster::users [puppet] - 10https://gerrit.wikimedia.org/r/373689 (https://phabricator.wikimedia.org/T174110) (owner: 10Ottomata) [22:45:00] !log demon@tin Pruned MediaWiki: 1.30.0-wmf.13 [keeping static files] (duration: 02m 01s) [22:45:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:05] 10Operations, 10ops-codfw, 10DC-Ops, 10Data-Services: Split up labstore external shelf storage available in codfw between labstore2001 and 2 - https://phabricator.wikimedia.org/T171623#3551301 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by madhuvishy on neodymium.eqiad.wmnet for hosts: ```... [23:00:05] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170824T2300). [23:00:05] brion: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:16] Yo yo [23:00:21] ma [23:00:27] :) [23:00:35] 10Operations, 10DNS, 10Traffic, 10Wikimedia-Apache-configuration: Redirect nan.wikipedia.org to zh-min-nan.wikipedia.org - https://phabricator.wikimedia.org/T173966#3551325 (10Liuxinyu970226) OK, but in any case, if you're asking nan.wikipedia.org, that's already redirected and this task should be invalid.... [23:01:12] I can SWAT [23:02:33] \o/ [23:03:46] hopefully jenkins is less clogged now :D [23:04:25] it's back to chugging away (more or less) happily [23:07:16] (03PS1) 10Andrew Bogott: labs puppetmaster: get allowed servers from hiera rather than hard-coding [puppet] - 10https://gerrit.wikimedia.org/r/373710 [23:09:52] (03PS2) 10Andrew Bogott: labs puppetmaster: get allowed servers from hiera rather than hard-coding [puppet] - 10https://gerrit.wikimedia.org/r/373710 [23:12:16] (03CR) 10Andrew Bogott: [C: 032] labs puppetmaster: get allowed servers from hiera rather than hard-coding [puppet] - 10https://gerrit.wikimedia.org/r/373710 (owner: 10Andrew Bogott) [23:13:21] yay jenkins did his thing [23:14:17] brion: change is live on mwdebug1002, check please [23:14:56] thcipriani: looks good [23:15:03] cool, going live [23:15:07] \o/ woot [23:17:00] !log thcipriani@tin Synchronized php-1.30.0-wmf.15/extensions/TimedMediaHandler/TimedMediaHandler.php: SWAT: [[gerrit:373692|Disable Ogg Theora video transcodes in default config]] T172445 (duration: 00m 45s) [23:17:07] ^ brion live everywhere [23:17:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:12] T172445: Deprecate/remove Ogg Theora video output formats in favor of WebM - https://phabricator.wikimedia.org/T172445 [23:17:17] thcipriani: looks good, thanks! [23:17:31] awesome. yw :) [23:19:21] 10Operations, 10Multimedia, 10TimedMediaHandler, 10HHVM, 10Patch-For-Review: Migrate video scalers to jessie - https://phabricator.wikimedia.org/T145742#3551406 (10brion) [23:34:23] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 6 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3551447 (10EBernhardson) Parsing through the existing jobs with a script to measure the "real" number of purges that will be issued: P5916 I'm finding th... [23:44:54] greg-g, I know this is super-late in the SWAT window, but can I deploy https://www.mediawiki.org/wiki/MediaWiki_1.30/Roadmap ? [23:45:11] greg-g, it's a small JS change, but it fixes a bad regression where Flow infinite scroll is totally broken. [23:45:57] matt_flaschen: deploy what? :) [23:47:17] greg-g, doh: https://gerrit.wikimedia.org/r/#/c/373713/ [23:47:53] matt_flaschen: +1 [23:48:22] greg-g, thanks [23:48:50] 10Operations, 10ops-eqiad, 10netops: eqiad: rack frack refresh equipment - https://phabricator.wikimedia.org/T169644#3551475 (10ayounsi) @Cmjohnson, thanks. Can you rename frasw1 and frasw2 to fasw-c1a-eqiad and fasw-c1b-eqiad ? [23:48:53] 10Operations, 10DNS, 10Traffic, 10Wikimedia-Apache-configuration: Redirect nan.wikipedia.org to zh-min-nan.wikipedia.org - https://phabricator.wikimedia.org/T173966#3551476 (10Verdy_p) All this was caused by changes in dependencies in a parent task that was closed by moving it elsewhere without fixing what...