[00:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170214T0000). [00:00:04] Jdlrobson and Dereckson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:22] * MaxSem can do it [00:01:19] Dereckson, your change appears ineligible for SWAT [00:01:31] jdlrobson, yt? [00:02:11] 06Operations, 10Deployment-Systems, 10Stashbot: [[wikitech:Server_admin_log]] should not rely on freenode irc for logmsgbot entries - https://phabricator.wikimedia.org/T46791#3023722 (10demon) >>! In T46791#3023711, @bd808 wrote: > Some related thoughts/explanations on {T156079}. > > wm-bot already does too... [00:03:30] Yup MaxSem [00:05:14] (03PS3) 10MaxSem: Disable Hungarian Popups A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337064 (https://phabricator.wikimedia.org/T156290) (owner: 10Jdlrobson) [00:05:20] (03CR) 10MaxSem: [C: 032] Disable Hungarian Popups A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337064 (https://phabricator.wikimedia.org/T156290) (owner: 10Jdlrobson) [00:05:32] (03PS1) 10Dzahn: add missing wikimania2005.m wikimania2006.m mobile names [dns] - 10https://gerrit.wikimedia.org/r/337522 (https://phabricator.wikimedia.org/T152882) [00:06:53] (03Merged) 10jenkins-bot: Disable Hungarian Popups A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337064 (https://phabricator.wikimedia.org/T156290) (owner: 10Jdlrobson) [00:07:01] (03CR) 10jenkins-bot: Disable Hungarian Popups A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337064 (https://phabricator.wikimedia.org/T156290) (owner: 10Jdlrobson) [00:08:36] jdlrobson, pulled on mwdebug1002, please test [00:08:56] On it [00:12:25] (03PS2) 10MaxSem: Add Hindi Wikipedia wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337441 (owner: 10Jdlrobson) [00:13:27] (03CR) 10Dzahn: [C: 031] new shell user Nithum Thain [puppet] - 10https://gerrit.wikimedia.org/r/337438 (owner: 10RobH) [00:13:46] thx =] [00:14:35] 06Operations, 10ops-codfw, 06DC-Ops, 10hardware-requests: decom install2001 - https://phabricator.wikimedia.org/T157840#3023765 (10Dzahn) [00:14:46] MaxSem: all good [00:16:15] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/337064/ (duration: 00m 48s) [00:16:17] jdlrobson, ^ [00:16:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:48] (03CR) 10MaxSem: [C: 032] Add Hindi Wikipedia wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337441 (owner: 10Jdlrobson) [00:16:53] 06Operations: decom carbon - https://phabricator.wikimedia.org/T158020#3023767 (10Dzahn) [00:18:17] (03Merged) 10jenkins-bot: Add Hindi Wikipedia wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337441 (owner: 10Jdlrobson) [00:18:26] (03CR) 10jenkins-bot: Add Hindi Wikipedia wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337441 (owner: 10Jdlrobson) [00:19:05] jdlrobson, pulled [00:19:53] Maxsem works [00:19:55] Thank you [00:20:10] ACKNOWLEDGEMENT - HTTP on carbon is CRITICAL: connect to address 208.80.154.10 and port 80: Connection refused daniel_zahn decom T158020 [00:20:10] ACKNOWLEDGEMENT - Squid on carbon is CRITICAL: connect to address 208.80.154.10 and port 8080: Connection refused daniel_zahn decom T158020 [00:21:08] !log maxsem@tin Synchronized static/images/mobile/copyright/wikipedia-wordmark-hi.svg: https://gerrit.wikimedia.org/r/#/c/337441/ (duration: 00m 42s) [00:21:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:51] 06Operations: decom carbon - https://phabricator.wikimedia.org/T158020#3023784 (10Dzahn) [00:22:03] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/337441/ (duration: 00m 40s) [00:22:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:22:11] jdlrobson, ^ [00:23:00] (03PS2) 10MaxSem: Update Hebrew wordmark logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337442 (https://phabricator.wikimedia.org/T157863) (owner: 10Jdlrobson) [00:26:23] (03PS1) 10Dzahn: joe: move hosts file for carbon to install1002 [puppet] - 10https://gerrit.wikimedia.org/r/337526 [00:26:43] 06Operations: decom carbon - https://phabricator.wikimedia.org/T158020#3023786 (10Dzahn) [00:31:17] jdlrobson, ... [00:33:06] ? [00:33:21] Hindi works great [00:33:28] Waiting to test Hebrew [00:33:37] (03CR) 10MaxSem: [C: 032] Update Hebrew wordmark logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337442 (https://phabricator.wikimedia.org/T157863) (owner: 10Jdlrobson) [00:33:52] RECOVERY - puppet last run on labtestservices2001 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [00:34:57] (03Merged) 10jenkins-bot: Update Hebrew wordmark logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337442 (https://phabricator.wikimedia.org/T157863) (owner: 10Jdlrobson) [00:35:16] (03PS3) 10Dzahn: install: remove roles from carbon, demote to spare [puppet] - 10https://gerrit.wikimedia.org/r/337197 (https://phabricator.wikimedia.org/T158020) [00:35:24] jdlrobson, pulled [00:36:12] (03PS4) 10Dzahn: install: remove roles from carbon, demote to spare [puppet] - 10https://gerrit.wikimedia.org/r/337197 (https://phabricator.wikimedia.org/T158020) [00:36:23] (03CR) 10Dzahn: [C: 032] install: remove roles from carbon, demote to spare [puppet] - 10https://gerrit.wikimedia.org/r/337197 (https://phabricator.wikimedia.org/T158020) (owner: 10Dzahn) [00:36:45] (03CR) 10jenkins-bot: Update Hebrew wordmark logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337442 (https://phabricator.wikimedia.org/T157863) (owner: 10Jdlrobson) [00:36:56] MaxSem: that is good too. Thank you! [00:38:46] !log maxsem@tin Synchronized static/images/mobile/copyright/wikipedia-wordmark-he.svg: https://gerrit.wikimedia.org/r/#/c/337442/ (duration: 00m 40s) [00:38:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:38:53] 06Operations, 10Continuous-Integration-Config, 06Release-Engineering-Team, 06Wikipedia-Android-App-Backlog, and 2 others: Investigate how to improve Android CI performance and stability - https://phabricator.wikimedia.org/T158014#3023806 (10Niedzielski) I made an image locally with `android create avd -f -... [00:40:55] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/337442/ (duration: 00m 40s) [00:40:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:41:01] jdlrobson, ^ [00:41:26] MaxSem: big thumbs up. Thanks for doing swat today! [00:41:35] wee [00:42:13] 06Operations: make apt.wikimedia.org HA - https://phabricator.wikimedia.org/T158022#3023813 (10Dzahn) [00:42:54] 06Operations: make apt.wikimedia.org HA - https://phabricator.wikimedia.org/T158022#3023829 (10Dzahn) [00:43:33] 06Operations, 13Patch-For-Review: Split carbon's install/mirror roles, provision install1001 - https://phabricator.wikimedia.org/T132757#2853205 (10Dzahn) [00:46:44] (03PS1) 10Dzahn: install: correct spare role name for carbon [puppet] - 10https://gerrit.wikimedia.org/r/337529 [00:46:59] (03PS2) 10Dzahn: install: correct spare role name for carbon [puppet] - 10https://gerrit.wikimedia.org/r/337529 [00:48:20] (03CR) 10Dzahn: [C: 032] install: correct spare role name for carbon [puppet] - 10https://gerrit.wikimedia.org/r/337529 (owner: 10Dzahn) [00:49:15] (03PS1) 10Jforrester: Show 'Publish' not 'Save' on most public wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337530 (https://phabricator.wikimedia.org/T131132) [00:49:16] (03PS1) 10Jforrester: Show 'Publish' not 'Save' on Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337531 (https://phabricator.wikimedia.org/T131132) [00:49:30] 06Operations, 10Deployment-Systems, 10Stashbot: [[wikitech:Server_admin_log]] should not rely on freenode irc for logmsgbot entries - https://phabricator.wikimedia.org/T46791#3023858 (10demon) >>! In T46791#3023722, @demon wrote: > ** My `scap [log|sal]` suggestion as a fallback scenario also easily done as... [00:49:46] (03CR) 10Jforrester: [C: 04-2] "Provisionally scheduled for 2017-03-15." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337530 (https://phabricator.wikimedia.org/T131132) (owner: 10Jforrester) [00:49:54] (03CR) 10Jforrester: [C: 04-2] "Provisionally scheduled for 2017-03-22." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337531 (https://phabricator.wikimedia.org/T131132) (owner: 10Jforrester) [00:50:35] 06Operations: make apt.wikimedia.org HA - https://phabricator.wikimedia.org/T158022#3023861 (10Dzahn) [00:50:40] mutante, my brain keeps reading carbon as cabrón :P [00:51:11] 06Operations, 13Patch-For-Review: decom carbon - https://phabricator.wikimedia.org/T158020#3023876 (10Dzahn) [00:53:45] (03CR) 10Dzahn: [C: 032] "no-op everywhere http://puppet-compiler.wmflabs.org/5440/ (the 4 fails are known unrelated issues)" [puppet] - 10https://gerrit.wikimedia.org/r/337201 (owner: 10Dzahn) [00:53:54] MaxSem: me too, i typed it more than once :) [00:54:35] should rename it for the last week that it's up. hehe [00:54:49] (03PS3) 10Dzahn: lint: 'include base::firewall' -> 'include ::base::firewall' [puppet] - 10https://gerrit.wikimedia.org/r/337201 [00:55:34] 06Operations, 13Patch-For-Review: setup syslog server in codfw - https://phabricator.wikimedia.org/T138073#3023895 (10RobH) [00:55:36] 06Operations, 10ops-codfw, 10hardware-requests: procure syslog hardware in codfw - https://phabricator.wikimedia.org/T138075#3023894 (10RobH) 05stalled>03Resolved [00:56:21] 06Operations, 10TimedMediaHandler: Assign 3 more servers to video scaler duty - https://phabricator.wikimedia.org/T114337#3023896 (10RobH) [00:58:51] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/5441/" [puppet] - 10https://gerrit.wikimedia.org/r/337202 (owner: 10Dzahn) [01:04:14] PROBLEM - puppet last run on labnodepool1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:07:10] 06Operations, 13Patch-For-Review: Migrate carbon to jessie - https://phabricator.wikimedia.org/T123733#3023912 (10Dzahn) [01:07:13] 06Operations: Setting up a mirror serv{er,ice} - https://phabricator.wikimedia.org/T84817#3023913 (10Dzahn) [01:07:16] 06Operations, 13Patch-For-Review: Split carbon's install/mirror roles, provision install1001 - https://phabricator.wikimedia.org/T132757#3023908 (10Dzahn) 05Open>03Resolved all the things originally listed in this ticket have been done - except the "make APT HA" one which has been split out into T158022 d... [01:07:34] 06Operations: Split carbon's install/mirror roles, provision install1001 - https://phabricator.wikimedia.org/T132757#3023914 (10Dzahn) [01:07:48] (03Abandoned) 10Jforrester: Enable VisualEditor by default for all users of the Russian Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292748 (https://phabricator.wikimedia.org/T136995) (owner: 10Jforrester) [01:07:59] (03Abandoned) 10Jforrester: Enable VisualEditor by default for all users of the Chinese Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292747 (https://phabricator.wikimedia.org/T136996) (owner: 10Jforrester) [01:13:24] PROBLEM - puppet last run on cp1072 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:22:01] 06Operations: Standardizing our partman recipes - https://phabricator.wikimedia.org/T156955#2990919 (10Dzahn) https://gerrit.wikimedia.org/r/#/c/307665/ https://gerrit.wikimedia.org/r/#/c/306501/ [01:25:52] (03PS1) 10Dzahn: partman: delete raid1-lvm-ext4 recipe [puppet] - 10https://gerrit.wikimedia.org/r/337532 (https://phabricator.wikimedia.org/T156955) [01:26:07] (03CR) 10jerkins-bot: [V: 04-1] partman: delete raid1-lvm-ext4 recipe [puppet] - 10https://gerrit.wikimedia.org/r/337532 (https://phabricator.wikimedia.org/T156955) (owner: 10Dzahn) [01:32:14] RECOVERY - puppet last run on labnodepool1001 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [01:34:44] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 1805.095231 Seconds [01:35:04] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 1826.752716 Seconds [01:35:44] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 10.073504 Seconds [01:36:04] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 18.236611 Seconds [01:41:24] RECOVERY - puppet last run on cp1072 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [02:35:03] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.11) (duration: 11m 50s) [02:35:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:40:22] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Feb 14 02:40:22 UTC 2017 (duration 5m 19s) [02:40:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:24:34] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 772.03 seconds [03:30:24] PROBLEM - puppet last run on cobalt is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:30:34] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 226.09 seconds [03:47:09] (03CR) 10NehalDaveND: "I forgot how to review. Can some one guide me for this?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337402 (https://phabricator.wikimedia.org/T101634) (owner: 10Dereckson) [03:54:55] (03CR) 10Dereckson: "You can write comments here like you did." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337402 (https://phabricator.wikimedia.org/T101634) (owner: 10Dereckson) [03:58:24] RECOVERY - puppet last run on cobalt is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [04:10:24] PROBLEM - puppet last run on elastic1024 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:26:14] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=1247.30 Read Requests/Sec=4523.50 Write Requests/Sec=490.00 KBytes Read/Sec=18130.00 KBytes_Written/Sec=7967.20 [04:38:14] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=8.80 Read Requests/Sec=1.00 Write Requests/Sec=0.80 KBytes Read/Sec=13.60 KBytes_Written/Sec=17.20 [04:39:24] RECOVERY - puppet last run on elastic1024 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [05:09:54] PROBLEM - Disk space on thumbor1001 is CRITICAL: DISK CRITICAL - free space: /srv 15924 MB (3% inode=99%) [05:59:34] PROBLEM - puppet last run on analytics1038 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:16:55] PROBLEM - puppet last run on ms-be1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:28:34] RECOVERY - puppet last run on analytics1038 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:45:54] RECOVERY - puppet last run on ms-be1015 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:46:24] PROBLEM - puppet last run on elastic1046 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tzdata] [06:48:37] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: db2060 crashed (RAID controller) - https://phabricator.wikimedia.org/T154031#3024172 (10Marostegui) [06:48:39] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: db2060 not accessible - https://phabricator.wikimedia.org/T156161#3024170 (10Marostegui) 05Open>03Resolved I am going to close this for now, but will leave the server depooled for the next few days. If we see this happening again we'll reopen it Th... [06:48:44] PROBLEM - Host mw2256 is DOWN: PING CRITICAL - Packet loss = 100% [06:49:04] RECOVERY - Host mw2256 is UP: PING OK - Packet loss = 0%, RTA = 36.11 ms [06:56:16] !log Deploy alter table on x1 echo_notification tables - T136428 [06:56:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:22] T136428: Add primary key to echo_notification table - https://phabricator.wikimedia.org/T136428 [07:10:54] PROBLEM - puppet last run on cp1051 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:14:25] RECOVERY - puppet last run on elastic1046 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [07:20:14] PROBLEM - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2344 [07:22:54] RECOVERY - Disk space on thumbor1001 is OK: DISK OK [07:23:22] godog: truncated the same log file as yesterday (but on 1001 this time) 00^ [07:25:14] RECOVERY - check_mysql on frdb2001 is OK: Uptime: 52295 Threads: 1 Questions: 1503784 Slow queries: 0 Opens: 619 Flush tables: 1 Open tables: 237 Queries per second avg: 28.755 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [07:25:52] <_joe_> elukey: thanks [07:28:24] (03CR) 10Giuseppe Lavagetto: [C: 032] stdlib: upgrade to 4.15.0 [puppet] - 10https://gerrit.wikimedia.org/r/336230 (owner: 10Giuseppe Lavagetto) [07:28:39] (03PS4) 10Giuseppe Lavagetto: stdlib: upgrade to 4.15.0 [puppet] - 10https://gerrit.wikimedia.org/r/336230 [07:28:43] <_joe_> I'm upgrading puppet stdlib now [07:29:07] <_joe_> elukey: it will change the logstash host some of the kafka instances will have in their config [07:29:13] <_joe_> from 1002 to 1003 [07:29:19] <_joe_> that should be ok, right? [07:29:23] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] stdlib: upgrade to 4.15.0 [puppet] - 10https://gerrit.wikimedia.org/r/336230 (owner: 10Giuseppe Lavagetto) [07:30:46] _joe_ yep yep [07:39:54] RECOVERY - puppet last run on cp1051 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [07:42:13] (03PS2) 10Giuseppe Lavagetto: joe: move hosts file for carbon to install1002 [puppet] - 10https://gerrit.wikimedia.org/r/337526 (owner: 10Dzahn) [07:42:35] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] "Thanks a ton Daniel!" [puppet] - 10https://gerrit.wikimedia.org/r/337526 (owner: 10Dzahn) [07:46:24] PROBLEM - puppet last run on mw1205 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/home/oblivian/.hosts/install1002] [07:50:52] this one is fixed after a second puppet run --^ [07:51:24] RECOVERY - puppet last run on mw1205 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [07:51:31] (03CR) 10Giuseppe Lavagetto: [C: 031] "nice catch, thanks!" [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/337400 (https://phabricator.wikimedia.org/T147425) (owner: 10Ema) [08:23:07] !log restarting zookeeper on conf1002 to pick up OpenJDK update (restarts were stopped yesterday to further investigate gc behaviour) [08:23:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:04] Niharika: Dear anthropoid, the time has come. Please deploy Wikimania scholarships app deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170214T0830). [08:33:01] !log restarting zookeeper on conf1003 [08:33:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:24] PROBLEM - puppet last run on labvirt1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:37:00] (03PS5) 10Elukey: Add default JAVA_OPTS to the zookeeper server class [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/337413 (https://phabricator.wikimedia.org/T157968) [08:40:24] PROBLEM - puppet last run on mw1237 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[exim4-daemon-light],Package[exim4-config] [08:43:24] RECOVERY - puppet last run on mw1237 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [08:45:14] (03PS6) 10Elukey: Add default JAVA_OPTS to the zookeeper server class [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/337413 (https://phabricator.wikimedia.org/T157968) [08:45:43] !log installing vim security updates [08:45:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:47] elukey: what so you mean fixed by a second puppet run? [08:46:49] (03CR) 10Filippo Giunchedi: "I think the check can go, afaik there are no more precise dbs" [puppet] - 10https://gerrit.wikimedia.org/r/337204 (owner: 10Dzahn) [08:48:31] volans: silly thing, it was a 400 from the puppet master [08:49:00] (03PS1) 10Giuseppe Lavagetto: Revert "tests: Use sample data that doesn't match production names" [software/conftool] - 10https://gerrit.wikimedia.org/r/337547 [08:49:09] (03PS7) 10Elukey: Add default JAVA_OPTS to the zookeeper server class [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/337413 (https://phabricator.wikimedia.org/T157968) [08:49:36] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Revert "tests: Use sample data that doesn't match production names" [software/conftool] - 10https://gerrit.wikimedia.org/r/337547 (owner: 10Giuseppe Lavagetto) [08:49:38] (03CR) 10Muehlenhoff: "There's still labsdb1005 running precise" [puppet] - 10https://gerrit.wikimedia.org/r/337204 (owner: 10Dzahn) [08:49:40] (03CR) 10Jcrespo: [C: 04-1] "There is still 1 precise (labsdb1005), which will go away tomorrow." [puppet] - 10https://gerrit.wikimedia.org/r/337204 (owner: 10Dzahn) [08:50:04] (03PS8) 10Giuseppe Lavagetto: Add schema support [software/conftool] - 10https://gerrit.wikimedia.org/r/288881 (https://phabricator.wikimedia.org/T155823) [08:50:38] elukey: ok, I though you meant that we needed 2 run to setup something, and we shouldn't [08:50:40] (03CR) 10Marostegui: [C: 04-1] "We still have one precise hosts (labsdb1005 for example, which will be migrated tomorrow). So if this can wait a couple of days it might b" [puppet] - 10https://gerrit.wikimedia.org/r/337204 (owner: 10Dzahn) [08:50:56] jynus ^ you were faster! [08:54:20] (03CR) 10Gehel: [C: 031] "LGTM" [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/337413 (https://phabricator.wikimedia.org/T157968) (owner: 10Elukey) [08:54:46] thanks! I thought the labsdb migration happened already heh [09:00:01] no, tomorrow :) [09:01:56] Gehel, I am not sure if Go is well understood outside of France, Switzerland, Canada and Romania [09:04:25] RECOVERY - puppet last run on labvirt1008 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [09:04:50] jynus: I'm not sure I understand what you are referring to... [09:05:00] !log upgrading firejail on sca cluster [09:05:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:31] "it looks like most of the time (looking at logs over the past several days) the heap after GC is between 2 and 6 Go" [09:05:41] at T148478 [09:05:41] T148478: Investigate seemingly random Gerrit slow-downs - https://phabricator.wikimedia.org/T148478 [09:06:41] jynus: Oh... Gb would be a better acronym ? [09:06:56] I just noticed that because it is what I used while in Switzerland [09:07:07] but I have not seen it anywhere else [09:07:15] not a big deal [09:07:52] I am now learning that Go is not a universal acronym ! [09:08:16] which is exactly my intention, you may have not noticed that [09:08:31] probably is understood with context [09:08:42] but just a funny notice [09:08:51] * gehel is learning something new everyday! Thanks jynus! [09:09:31] probably GB or GiB is preferred, Gb being gigabit [09:10:31] hey, you have qwertz keyboards, life is already too hard for you :-) [09:11:01] :) [09:11:33] <_joe_> gehel: first time I read "Go" [09:11:35] * gehel is still trying to learn to use his alphagrip (http://www.alphagrips.com/) [09:11:42] <_joe_> I thought it was a typo :P [09:13:20] so I guess it is only in french that we say "octet" ? [09:13:39] (03PS2) 10Ema: VCL: Add support for WMF-Last-Access-Global analytics cookie [puppet] - 10https://gerrit.wikimedia.org/r/336790 (https://phabricator.wikimedia.org/T138027) [09:13:53] (03CR) 10Ema: [V: 032 C: 032] VCL: Add support for WMF-Last-Access-Global analytics cookie [puppet] - 10https://gerrit.wikimedia.org/r/336790 (https://phabricator.wikimedia.org/T138027) (owner: 10Ema) [09:13:54] <_joe_> instead of byte? [09:14:00] <_joe_> gehel: we just use the english word [09:14:01] ema: \o/ [09:14:03] thanks! [09:14:48] it is understood, only French are language zealots, but then say weekend :-) [09:14:50] we french speaking always try to have our own words :) [09:15:23] oh, I know, I had to learn a new language when I delivered a mysql course in french [09:15:53] <_joe_> jynus: here translating every foreign term to italian is frowned upon, because the fascist regime enforced that [09:16:00] <_joe_> so it's a fascist heritage of sorts [09:16:07] yeah, I see that [09:16:56] 06Operations, 10Analytics, 10Analytics-Cluster, 13Patch-For-Review: Zookeeper heap usage patterns - https://phabricator.wikimedia.org/T157968#3024296 (10elukey) Moritz completed the restarts and the Heap usage pattern changed on all the nodes, so this is probably something to expect with the current settin... [09:17:44] PROBLEM - puppet last run on cp2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:17:44] PROBLEM - puppet last run on cp2011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:17:51] uh [09:18:04] PROBLEM - puppet last run on cp2014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:18:04] PROBLEM - puppet last run on cp4014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:18:05] Detail: key not found: "top_domain" [09:18:15] :( [09:19:04] PROBLEM - puppet last run on cp4019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:19:24] PROBLEM - puppet last run on cp1058 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:19:54] PROBLEM - puppet last run on cp1071 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:20:04] PROBLEM - puppet last run on cp4005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:20:04] PROBLEM - puppet last run on cp4015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:20:04] PROBLEM - puppet last run on cp3006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:20:34] PROBLEM - puppet last run on cp1099 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:20:34] PROBLEM - puppet last run on cp1073 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:20:45] PROBLEM - puppet last run on cp2008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:20:54] PROBLEM - puppet last run on cp2022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:21:34] PROBLEM - puppet last run on cp1062 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:22:02] (03PS1) 10Ema: Analytics VCL: default to 'org' if top_domain is not set [puppet] - 10https://gerrit.wikimedia.org/r/337549 (https://phabricator.wikimedia.org/T138027) [09:22:18] elukey: ^ [09:22:54] PROBLEM - puppet last run on cp2015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:23:42] (03CR) 10Filippo Giunchedi: [C: 04-1] "See inline. Marko, after you've explained the rationale for this change at ops/services syncup meeting it makes much more sense. Could you" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/336647 (owner: 10Mobrovac) [09:23:44] PROBLEM - puppet last run on cp2017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:24:04] PROBLEM - puppet last run on cp3045 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:24:04] PROBLEM - puppet last run on cp3046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:24:32] ema: looks good, maybe we can follow up later on with a test? [09:24:40] sorry, adding moar tests [09:24:42] if possible [09:25:04] PROBLEM - puppet last run on cp3039 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:27:50] elukey: yeah I've disabled puppet on cache hosts for now, will merge and test on a few hosts before re-enabling it [09:28:10] (03CR) 10Ema: [V: 032 C: 032] Analytics VCL: default to 'org' if top_domain is not set [puppet] - 10https://gerrit.wikimedia.org/r/337549 (https://phabricator.wikimedia.org/T138027) (owner: 10Ema) [09:29:19] ema: sure sure I wasn't doubting it, I meant Varnish tests to avoid regressions in the future (super ignorant, not sure if in this case it is good or not) [09:30:22] oh the vtc tests are there, this was just puppetfails :) [09:30:44] RECOVERY - puppet last run on cp2003 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [09:31:09] 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 07Wikimedia-Multiple-active-datacenters: Check the size of every cluster in codfw to see if it matches eqiad's capacity - https://phabricator.wikimedia.org/T156023#3024335 (10fgiunchedi) >>! In T156023#3021239, @elukey wrote: > If the above counts are... [09:32:44] RECOVERY - puppet last run on cp2011 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [09:36:44] RECOVERY - puppet last run on cp2008 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [09:40:04] RECOVERY - puppet last run on cp2014 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [09:41:04] RECOVERY - puppet last run on cp4014 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [09:43:04] RECOVERY - puppet last run on cp4019 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [09:43:24] RECOVERY - puppet last run on cp1058 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [09:44:24] PROBLEM - puppet last run on kafka1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:44:54] RECOVERY - puppet last run on cp1071 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:44:56] (03CR) 10Hashar: systemd: allow isequal to match programname in/rsyslog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/337411 (owner: 10Hashar) [09:45:04] RECOVERY - puppet last run on cp4005 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [09:45:43] (03PS2) 10Ema: MonitoringProtocol: do not crash with ValueError on unicode strings [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/337400 (https://phabricator.wikimedia.org/T147425) [09:47:04] RECOVERY - puppet last run on cp4015 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [09:47:40] (03PS2) 10Hashar: systemd: allow isequal to match programname in/rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/337411 [09:48:04] RECOVERY - puppet last run on cp3006 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [09:48:34] RECOVERY - puppet last run on cp1099 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [09:48:34] RECOVERY - puppet last run on cp1073 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [09:48:54] RECOVERY - puppet last run on cp2022 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [09:49:27] (03CR) 10Ema: [V: 032 C: 032] MonitoringProtocol: do not crash with ValueError on unicode strings [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/337400 (https://phabricator.wikimedia.org/T147425) (owner: 10Ema) [09:49:34] RECOVERY - puppet last run on cp1062 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [09:50:54] RECOVERY - puppet last run on cp2015 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [09:51:44] RECOVERY - puppet last run on cp2017 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [09:52:04] RECOVERY - puppet last run on cp3046 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [09:53:04] RECOVERY - puppet last run on cp3045 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [09:54:04] RECOVERY - puppet last run on cp3039 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [09:58:39] (03CR) 10Volans: "Nice work! I'd like to play with it a bit, but for now see a bunch of comments inline." (0340 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/288881 (https://phabricator.wikimedia.org/T155823) (owner: 10Giuseppe Lavagetto) [10:01:42] 06Operations, 06Parsing-Team, 13Patch-For-Review: Visual-diff testreduce make ruthenium unresponsive - https://phabricator.wikimedia.org/T156177#3024378 (10ssastry) Okay .. I have this almost working now. But, I am finding that the old proxy settings I used for nodejs as well as phantom to contact services a... [10:02:43] ahahhah 40 comments by volans [10:03:00] :D :D :D [10:03:30] poor joe [10:03:43] elukey: :-P most of them are cosmetics, I miss in gerrit the feature to m*rk a comment as "blocker" that we had in our common previous job ;) [10:04:29] 06Operations, 06Parsing-Team, 13Patch-For-Review: Visual-diff testreduce make ruthenium unresponsive - https://phabricator.wikimedia.org/T156177#2966544 (10MoritzMuehlenhoff) >>! In T156177#3024378, @ssastry wrote: > So, did something change on ruthenium wrt proxy settings? No, there shouldn't have been any... [10:07:53] subbu: you around? [10:08:23] hi yes. [10:08:31] i was responding on the ticket. [10:08:31] re:T156177 are you using the HTTP proxy by any chance? [10:08:32] T156177: Visual-diff testreduce make ruthenium unresponsive - https://phabricator.wikimedia.org/T156177 [10:09:06] yes, HTTP_PROXY_IP_AND_PORT=208.80.154.10:8080 [10:09:37] it was migrated from carbon to install1002, ahhh you're using hardcoded IPs... why? [10:11:22] i don't remember now. it is in /lib/systemd/system/parsoid-vd-client.service which comes from puppet ... [10:11:52] I'll check it, see also https://wikitech.wikimedia.org/wiki/HTTP_proxy [10:12:23] 06Operations, 06Parsing-Team, 13Patch-For-Review: Visual-diff testreduce make ruthenium unresponsive - https://phabricator.wikimedia.org/T156177#3024424 (10ssastry) >>! In T156177#3024393, @MoritzMuehlenhoff wrote: >>>! In T156177#3024378, @ssastry wrote: >> So, did something change on ruthenium wrt proxy se... [10:12:54] PROBLEM - puppet last run on ms-be1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:13:24] RECOVERY - puppet last run on kafka1001 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [10:14:36] (03CR) 10Filippo Giunchedi: [C: 032] "PCC https://puppet-compiler.wmflabs.org/5450/" [puppet] - 10https://gerrit.wikimedia.org/r/336804 (https://phabricator.wikimedia.org/T157022) (owner: 10Filippo Giunchedi) [10:15:03] subbu: would it be a problem if it was a domain instead of an IP? [10:15:11] I'd rather avoid to hardcode it! [10:15:24] volans: if you want to make a comment a blocker: vote Code-Review: -2 :} [10:15:31] (03PS1) 10Hashar: zuul: monitor Gearman queue growing out of control [puppet] - 10https://gerrit.wikimedia.org/r/337552 (https://phabricator.wikimedia.org/T70113) [10:15:33] (03PS4) 10Filippo Giunchedi: diamond: require $handler to be defined [puppet] - 10https://gerrit.wikimedia.org/r/336804 (https://phabricator.wikimedia.org/T157022) [10:15:33] shouldn't be a problem. [10:15:47] volans: the vote stick between patchsets, though one can still remove view from the list of reviewers to clear the vote [10:16:50] volans: gehel: _joe_ : can one of you or any root head to cobalt and do a thread dump / java magic for Gerrit [10:16:56] it has a huge surge of load [10:16:59] hashar: I was referring to the possibility to m*rk a single comment as blocker, not for the general vote but to ease the reading of comments, I have dome 40 comments and like 3 of then must be changed, all the rest is mostly cosmetic and debatable [10:17:03] an issue that has been crippling us for a few weeks now [10:17:17] volans: at a previous job with reviewboard we'd prefix comments with ~ ~~ or ~~~ to indicate how cosmetic/picky the comment was, we could do sth like that [10:17:37] hashar: I'm heading to cobalt... [10:17:40] subbu: let me send a patch for that [10:17:43] or no prefix, for normal comments [10:17:54] k [10:17:58] godog: actually the updated review board I worked with was having this feature, just a checkbox :D [10:17:59] !log uploading pybal 1.13.5 to apt.w.o T147425 [10:18:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:05] T147425: pybal error "exceptions.ValueError: Value of arguments is not a string or stringlist" - https://phabricator.wikimedia.org/T147425 [10:18:18] gehel: Gerrit java process is acting strangely, sometime burst load out of control with lot of threads and we have no clue what is going on [10:18:20] volans: fancy! [10:18:28] gehel: whatever java trace you can take would surely help :} [10:19:23] 06Operations, 10Gerrit, 13Patch-For-Review: Investigate seemingly random Gerrit slow-downs - https://phabricator.wikimedia.org/T148478#3024451 (10ArielGlenn) The 28G heap was for something that @demon was planning IIRC. Perhaps he can weigh in. [10:20:29] gehel: apparently you looked at Gerrit yesterday ( https://phabricator.wikimedia.org/T148478#3022797 ) and pointed at gc :/ [10:20:50] hashar: actually, I pointed at NOT gc [10:20:54] ah yeah [10:20:56] my bad [10:20:58] :) [10:21:38] it looks like there are a ton of threads doing nothing (we might have an oversized threadpool somewhere) [10:22:58] hashar: I have some thread dumps collected in cobalt.wikimedia.org:~/thread* [10:23:11] \O/ [10:23:12] hashar: I'm digging into them to see if I find something obvious [10:23:14] 06Operations, 10Gerrit, 13Patch-For-Review: Investigate seemingly random Gerrit slow-downs - https://phabricator.wikimedia.org/T148478#3024468 (10hashar) Happening again right 39.1 G of VIRT and 27.4G of RES. `ps -eLf` shows that Gerrit has ~ 200 threads. [10:23:17] !log lvs200[456]: upgrade to jessie 8.7, pybal 1.13.5, reboot into kernel 4.4.2-3+wmf8 T155401 T147425 [10:23:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:23] T147425: pybal error "exceptions.ValueError: Value of arguments is not a string or stringlist" - https://phabricator.wikimedia.org/T147425 [10:23:23] T155401: Integrate jessie 8.7 point release - https://phabricator.wikimedia.org/T155401 [10:23:34] volans, verified that using webproxy.eqiad.wmnet:8080 fixes the problem [10:23:51] gehel: that is nice :-} [10:23:59] (03PS1) 10Volans: Testreduce: use address instead of IP for web proxy [puppet] - 10https://gerrit.wikimedia.org/r/337553 (https://phabricator.wikimedia.org/T156177) [10:24:00] subbu: ^^^ [10:24:53] (03CR) 10Subramanya Sastry: [C: 031] Testreduce: use address instead of IP for web proxy [puppet] - 10https://gerrit.wikimedia.org/r/337553 (https://phabricator.wikimedia.org/T156177) (owner: 10Volans) [10:25:37] subbu: to be picky we should rename the env variable from HTTP_PROXY_IP_AND_PORT ;) [10:25:45] not being anymore an IP :D [10:25:46] (03PS1) 10Urbanecm: New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337554 (https://phabricator.wikimedia.org/T158040) [10:26:12] (03CR) 10Volans: [C: 032] Testreduce: use address instead of IP for web proxy [puppet] - 10https://gerrit.wikimedia.org/r/337553 (https://phabricator.wikimedia.org/T156177) (owner: 10Volans) [10:26:29] volans, right .. will update it in follow up patches. [10:26:43] yeah, let's fix it for now [10:28:42] subbu: merged and puppt run on ruthenium [10:28:45] you can try again [10:28:50] will do. [10:28:54] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.399 second response time [10:30:31] 06Operations, 06Parsing-Team, 13Patch-For-Review: Visual-diff testreduce make ruthenium unresponsive - https://phabricator.wikimedia.org/T156177#3024483 (10Volans) For reference, it was the HTTP proxy that was hardcoded with carbon's IP in the systemd unit file. Modified with the name instead in the above pa... [10:30:50] volans, working now. [10:31:08] volans: there was a lot of I/O on md1 | /srv that is where the git repositories are [10:31:12] err wrong ping [10:31:22] so, those patches where you turned off these 2 services can be reverted. [10:31:24] subbu: great! Should I revert the change that was not starting the services? [10:31:27] :) [10:31:41] gehel: there was a lot of I/O on md1 | /srv that is where the git repositories are . I guess Gerrit is somehow busy with some huge git pack files [10:32:31] (03PS1) 10Raimond Spekking: Create Wikichanzo namespace for swwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337555 (https://phabricator.wikimedia.org/T158041) [10:32:47] That's coherent with the thread dumps [10:32:48] (03PS1) 10Volans: Revert "Testreduce: allow to decide the state of the services" [puppet] - 10https://gerrit.wikimedia.org/r/337556 [10:33:22] (03PS2) 10Volans: Revert "Testreduce: allow to decide the state of the services" [puppet] - 10https://gerrit.wikimedia.org/r/337556 [10:33:54] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.765 second response time [10:35:49] (03PS3) 10Volans: Revert "Testreduce: allow to decide the state of the services" [puppet] - 10https://gerrit.wikimedia.org/r/337556 (https://phabricator.wikimedia.org/T156177) [10:35:59] * volans playing with gerrit UI :D [10:37:13] subbu: if you're ok I'll merge the revert and let puppet restart the 2 services [10:37:28] volans, give me a moment .. let me also fix he name of that setting to remove the _IP [10:37:28] will you be around to check if they work properly in the next ~30 min? [10:37:38] sure [10:37:45] I can change the puppet sude [10:37:47] *side [10:37:53] i'll update my code first, turn off the services on ruthenium, update the code, and then you can update puppet .. [10:37:56] i am around, yes. [10:38:45] great, let me know the final name for the HTTP_PROXY_IP_AND_PORT env variable [10:39:05] s/_IP// [10:40:02] gehel: for Jenkins we have JavaMelody to monitor the java app and it helped me a lot. Typically a nicer thread dump with the cpu/user time https://integration.wikimedia.org/ci/monitoring?part=threads :) [10:41:07] gehel: anyway can you report on the task with the thread dumps you have taken? and maybe Chad can look at it later today. Thanks for the traces! [10:41:14] (03PS1) 10Volans: Testreduce: renamed environmental variable [puppet] - 10https://gerrit.wikimedia.org/r/337557 (https://phabricator.wikimedia.org/T156177) [10:41:44] (03PS2) 10Urbanecm: New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337554 (https://phabricator.wikimedia.org/T158040) [10:41:44] hashar: I prefer raw thread dumps and some tool to analyze them (http://fastthread.io/, https://github.com/irockel/tda, ...) [10:41:53] (03CR) 10Subramanya Sastry: [C: 031] Testreduce: renamed environmental variable [puppet] - 10https://gerrit.wikimedia.org/r/337557 (https://phabricator.wikimedia.org/T156177) (owner: 10Volans) [10:42:26] volans, i now have updated code to use the new env variable. [10:42:34] so, you can merge that patch and i can test. [10:42:54] RECOVERY - puppet last run on ms-be1018 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [10:42:55] ok, just the renamed right? [10:42:58] yup. [10:43:01] (03PS2) 10Filippo Giunchedi: udp2log: mirror traffic from mwlog1001 to fluorine [puppet] - 10https://gerrit.wikimedia.org/r/335625 (https://phabricator.wikimedia.org/T123728) [10:43:07] (03CR) 10Volans: [C: 032] Testreduce: renamed environmental variable [puppet] - 10https://gerrit.wikimedia.org/r/337557 (https://phabricator.wikimedia.org/T156177) (owner: 10Volans) [10:43:31] gehel: neat. Next time Jenkins explode I guess I can use those as well [10:44:03] hashar: first time that I'm using fastthread, but it looks nice! [10:44:25] subbu: done, ruthenium updated [10:45:08] works. [10:45:18] so you can merge the revert as well. [10:45:30] ok, doing it [10:46:11] (03PS4) 10Volans: Revert "Testreduce: allow to decide the state of the services" [puppet] - 10https://gerrit.wikimedia.org/r/337556 (https://phabricator.wikimedia.org/T156177) [10:47:01] (03PS2) 10Muehlenhoff: Only run the timesynd_ntp_status Icinga check every 30 minutes [puppet] - 10https://gerrit.wikimedia.org/r/337401 [10:48:09] (03PS3) 10Elukey: Revert "Revert "Add JVM Heap usage alarms for basic Hadoop daemons"" [puppet] - 10https://gerrit.wikimedia.org/r/335795 [10:48:14] !log installing tomcat security updates [10:48:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:43] * volans waiting for jenkins [10:51:07] !log lvs200[123]: upgrade to jessie 8.7, pybal 1.13.5, reboot into kernel 4.4.2-3+wmf8 T155401 T147425 [10:51:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:12] T147425: pybal error "exceptions.ValueError: Value of arguments is not a string or stringlist" - https://phabricator.wikimedia.org/T147425 [10:51:12] T155401: Integrate jessie 8.7 point release - https://phabricator.wikimedia.org/T155401 [10:51:54] (03CR) 10Elukey: [C: 032] Revert "Revert "Add JVM Heap usage alarms for basic Hadoop daemons"" [puppet] - 10https://gerrit.wikimedia.org/r/335795 (owner: 10Elukey) [10:52:32] elukey: you steal my spot :-P [10:52:41] (03PS5) 10Volans: Revert "Testreduce: allow to decide the state of the services" [puppet] - 10https://gerrit.wikimedia.org/r/337556 (https://phabricator.wikimedia.org/T156177) [10:52:50] volans: sorryyyy [10:52:55] after 6 minutes waiting for jenkins to verify it :D [10:54:22] (03CR) 10Volans: [C: 032] Revert "Testreduce: allow to decide the state of the services" [puppet] - 10https://gerrit.wikimedia.org/r/337556 (https://phabricator.wikimedia.org/T156177) (owner: 10Volans) [10:55:34] subbu: puppet running on ruthenium with the revert [10:55:52] ok. thanks! [10:56:27] completed just now [10:56:45] but I didn't see starting the service, ah it's already running [10:56:53] (03CR) 10Filippo Giunchedi: "Different approach, mirror traffic mwlog1001 -> fluorine first and progressively send traffic to mwlog1001 instead." [puppet] - 10https://gerrit.wikimedia.org/r/335625 (https://phabricator.wikimedia.org/T123728) (owner: 10Filippo Giunchedi) [10:56:56] yes, i started it up earlier. [10:57:12] so, while i have your attention, i have another question for you .. i have parsoid service running on ruthenium on port 8142 .. but why does curl http://localhost:8142 lead to connection refused? [10:57:18] (03PS3) 10Muehlenhoff: Only run the timesynd_ntp_status Icinga check every 30 minutes [puppet] - 10https://gerrit.wikimedia.org/r/337401 [10:57:38] lemme look [10:58:08] 06Operations, 13Patch-For-Review: Upgrade fluorine to trusty/jessie - https://phabricator.wikimedia.org/T123728#3024609 (10fgiunchedi) >>! In T123728#3021999, @fgiunchedi wrote: > Thanks a lot @bd808 for the explanation! I'll take a stab at the `GroupHandler` strategy first since that's the configuration I'd... [10:58:54] (03CR) 10Hashar: [C: 031] "Should be fine. I also checked that we have no Puppet classes having 'standard' somewhere in their name so there Puppet never made any rel" [puppet] - 10https://gerrit.wikimedia.org/r/337202 (owner: 10Dzahn) [10:59:05] subbu: works for me [10:59:22]

Welcome to the Parsoid web service.

[10:59:42] oh hmm .. ok. must be something in my setting. [11:00:38] alright, never mind .. will investigate later. [11:00:39] (03PS1) 10Filippo Giunchedi: Switch udp2log destination to mwlog1001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337560 (https://phabricator.wikimedia.org/T123728) [11:00:54] 06Operations, 06Analytics-Kanban: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558#3024623 (10elukey) Just checked with Manuel if `thread_pool_size` was available for mysql 5.5 but it seems that it needs a proprietary extension to work. I executed `SET GLOBAL max_connections=... [11:01:23] subbu: ok, let me know if you need anything and feel free to close the Phab task once you feel confident it's working properly [11:02:30] (03CR) 10Muehlenhoff: [C: 032] Only run the timesynd_ntp_status Icinga check every 30 minutes [puppet] - 10https://gerrit.wikimedia.org/r/337401 (owner: 10Muehlenhoff) [11:02:38] volans, i had closed the ticket already. :) [11:02:49] right :D [11:03:19] 06Operations, 10Gerrit, 13Patch-For-Review: Investigate seemingly random Gerrit slow-downs - https://phabricator.wikimedia.org/T148478#3024632 (10Gehel) Some thread dumps were collected: {F5628818} Note: [[ http://fastthread.io/ | fastThread.io ]] can help with the analysis. Preliminary findings: * most t... [11:03:55] hashar: I added a few notes... not really sure what the next step could be. [11:04:11] (03CR) 10Hashar: [C: 031] lint: 'include base::firewall' -> 'include ::base::firewall' [puppet] - 10https://gerrit.wikimedia.org/r/337201 (owner: 10Dzahn) [11:04:24] hashar: do we track the number of uploads and their size? [11:04:47] gehel: I don't think so [11:04:49] that being said, lunch time... [11:05:32] gehel: that can really be anything since Gerrit repos are exposed over https / ssh access. Thanks for the analysis of the thread dumps :} [11:05:51] I'm playing the dumb card of not knowing anything about gerrit internals... but do we have or it would be easy to add metrics for all the actions? [11:06:29] so to have a chance to correlate timewise the overloads with any specific action [11:06:52] volans: you read my mind! [11:07:16] gehel: you know... I provide magic :-P [11:07:29] (as you stated before) [11:08:15] volans: you need to add that line to your job description [11:08:24] rotfl [11:15:05] 06Operations, 10Gerrit, 13Patch-For-Review: Investigate seemingly random Gerrit slow-downs - https://phabricator.wikimedia.org/T148478#3024661 (10hashar) Another potential issue is the memory usage [[ https://grafana.wikimedia.org/dashboard/file/server-board.json?var-server=cobalt&var-network=eth0&panelId=14... [11:16:48] 06Operations, 13Patch-For-Review: Evaluate use of systemd-timesyncd on jessie for clock synchronisation - https://phabricator.wikimedia.org/T150257#3024664 (10MoritzMuehlenhoff) [11:16:51] 06Operations: Run Icinga check for systemd-timedated less often - https://phabricator.wikimedia.org/T157797#3024662 (10MoritzMuehlenhoff) 05Open>03Resolved check_timesyncd_ntp_status now runs only twice per hour [11:22:09] 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 07Wikimedia-Multiple-active-datacenters: Check the size of every cluster in codfw to see if it matches eqiad's capacity - https://phabricator.wikimedia.org/T156023#3024670 (10Joe) >>! In T156023#3024335, @fgiunchedi wrote: >>>! In T156023#3021239, @elu... [11:22:50] 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 07Wikimedia-Multiple-active-datacenters: Check the size of every cluster in codfw to see if it matches eqiad's capacity - https://phabricator.wikimedia.org/T156023#3024672 (10Joe) Also note that while for videoscalers and jobrunners it is advisable to... [11:28:35] !log performing schema change on all mariadb servers T150474 [11:28:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:40] T150474: Under high load, there is replication check pile-ups on coredbs, specially enwiki API servers - https://phabricator.wikimedia.org/T150474 [11:29:45] RECOVERY - Disk space on krypton is OK: DISK OK [11:30:55] !log manual fix up of exim spool permissions on krypton (used to run the heavy exim variant) [11:30:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:35] RECOVERY - graphite.wikimedia.org on graphite1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.065 second response time [11:33:59] 06Operations, 10Gerrit, 13Patch-For-Review: Investigate seemingly random Gerrit slow-downs - https://phabricator.wikimedia.org/T148478#3024690 (10hashar) `gerrit show-caches` takes roughly 45 seconds to complete: ``` Gerrit Code Review 2.13.4-13-gc0c5cc4742 now 11:26:37 UTC... [11:34:23] db1088 errors are mine, and they should be normal [11:35:42] (03PS1) 10Elukey: Change role to mw222[123] (appservers -> api_appservers) [puppet] - 10https://gerrit.wikimedia.org/r/337563 (https://phabricator.wikimedia.org/T156023) [11:36:00] they are only 40 errors, we should survive (and probably all from the job queue) [11:37:20] graphite1001 was me btw [11:38:09] I may be bringing down db1079? [11:38:47] ? [11:39:52] it is giving problems [11:40:55] Threadpool could not create additional thread to handle queries, because the number of allowed threads was reached. Increasing 'thread_pool_max_threads' parameter can help in this situation. [11:41:32] yeah, it is the metadata locking [11:41:45] which kind of confirms the issue exists, and this will fix it [11:42:03] (03PS1) 10Filippo Giunchedi: install_server: fix graphite partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/337564 [11:42:48] channel:DBReplication won't be happy [11:43:39] (03PS2) 10Filippo Giunchedi: install_server: fix graphite partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/337564 [11:44:17] somebody from local is complaining "[WKLsXApAMEcAAF-fnygAAAEX] 2017-02-14 11:40:11: 종류 "DBQueryError"에서 심각한 오류" [11:44:35] PROBLEM - HHVM rendering on mw1296 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:44:37] "Fatal error in type "DBqueryerror"" [11:44:41] (03CR) 10Elukey: [C: 032] "Looks good from https://puppet-compiler.wmflabs.org/5452/" [puppet] - 10https://gerrit.wikimedia.org/r/337563 (https://phabricator.wikimedia.org/T156023) (owner: 10Elukey) [11:45:20] revi, people should just retry [11:45:43] db1094 had a spike [11:45:46] I am checking it [11:45:49] (03PS3) 10Filippo Giunchedi: install_server: fix graphite partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/337564 [11:45:52] checking mw1296 [11:46:14] Same thing as db1079 [11:46:19] 06Operations: Upgrade firejail to 0.44 - https://phabricator.wikimedia.org/T149078#3024710 (10MoritzMuehlenhoff) 05Open>03Resolved This is done [11:46:26] RECOVERY - HHVM rendering on mw1296 is OK: HTTP OK: HTTP/1.1 200 OK - 75302 bytes in 0.134 second response time [11:46:44] ok, then let's depool the largest ones one at a time [11:47:00] good, retry solved it [11:47:07] ty jynus [11:47:19] jynus: sure, I can help which one you want me to depool now? [11:50:18] (03PS4) 10Filippo Giunchedi: install_server: fix graphite partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/337564 [11:51:28] (03PS1) 10Elukey: Move mw222[123] from appservers to api_appservers (conftool) [puppet] - 10https://gerrit.wikimedia.org/r/337567 (https://phabricator.wikimedia.org/T156023) [11:54:48] (03CR) 10Elukey: [C: 032] Move mw222[123] from appservers to api_appservers (conftool) [puppet] - 10https://gerrit.wikimedia.org/r/337567 (https://phabricator.wikimedia.org/T156023) (owner: 10Elukey) [11:56:07] I am running it on all servers except the loaded ones [11:56:40] grep -v 'db107\|db108\|db109\|labsdb' [11:56:56] ok [11:57:22] !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=mw2221.codfw.wmnet [11:57:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:27] (03CR) 10Filippo Giunchedi: [C: 032] install_server: fix graphite partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/337564 (owner: 10Filippo Giunchedi) [11:58:35] (03PS5) 10Filippo Giunchedi: install_server: fix graphite partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/337564 [11:58:43] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] install_server: fix graphite partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/337564 (owner: 10Filippo Giunchedi) [12:00:36] PROBLEM - puppet last run on wtp1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:02:22] gehel, hashar we could try lowering the heap size a bit [12:02:43] because i read somewhere that having a too big heap and a too small heap can actuall be bad. [12:08:05] (03PS1) 10Filippo Giunchedi: Switch xenon redis to mwlog1001.eqiad.wmnet [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337568 (https://phabricator.wikimedia.org/T123728) [12:10:34] gehel hashar as this was i/o , would ssd make it perform better? [12:16:19] (03PS1) 10Filippo Giunchedi: performance: switch xenon apache backend to mwlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/337569 (https://phabricator.wikimedia.org/T123728) [12:18:05] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team: Phabricator master and slave crashed - https://phabricator.wikimedia.org/T156905#3024779 (10Marostegui) 05Open>03Resolved a:03Marostegui I am going to close this for now as the short-term solution was to move search to ES and this hasn't... [12:20:37] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team: Phabricator master and slave crashed - https://phabricator.wikimedia.org/T156905#3024783 (10Paladox) Elasticsearch is actually a long term fix. I think we have permanently moved to it. We were preparing to move to it before. :) [12:20:56] PROBLEM - DPKG on hafnium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:21:56] RECOVERY - DPKG on hafnium is OK: All packages OK [12:22:13] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team: Phabricator master and slave crashed - https://phabricator.wikimedia.org/T156905#3024788 (10Marostegui) >>! In T156905#3024783, @Paladox wrote: > Elasticsearch is actually a long term fix. I think we have permanently moved to it. We were prepa... [12:23:10] 06Operations, 10DBA, 10Phabricator, 06Release-Engineering-Team: Phabricator master and slave crashed - https://phabricator.wikimedia.org/T156905#3024790 (10Paladox) Yep, @mmodell has improved searching by a lot :) [12:25:16] 06Operations, 10Gerrit, 13Patch-For-Review: Investigate seemingly random Gerrit slow-downs - https://phabricator.wikimedia.org/T148478#3024792 (10Paladox) @hashar and @Gehel thanks for all your traces :). Looks like the i/o issue we had on friday. Though we also found gc was the problem on friday. Does gc ca... [12:28:36] RECOVERY - puppet last run on wtp1014 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [12:37:46] PROBLEM - puppet last run on labsdb1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:40:16] PROBLEM - puppet last run on cp3044 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:43:17] <_joe_> gerrit is unavailable for me now [12:43:27] works for me [12:43:43] quite slow, but it works [12:43:48] it went a bit slow for me [12:43:50] bit it worked [12:43:58] there is an open ticket about that [12:48:58] (03PS1) 10Jcrespo: Depool db1051,66,80,74,77,56,81,70,82 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337573 (https://phabricator.wikimedia.org/T150474) [12:50:07] the alarms about analytics100[12] are mine, I added them today, still tuning them [12:52:08] (03CR) 10Marostegui: [C: 031] Depool db1051,66,80,74,77,56,81,70,82 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337573 (https://phabricator.wikimedia.org/T150474) (owner: 10Jcrespo) [12:52:49] (03CR) 10Jcrespo: [C: 032] Depool db1051,66,80,74,77,56,81,70,82 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337573 (https://phabricator.wikimedia.org/T150474) (owner: 10Jcrespo) [12:53:51] _joe_, votes for restarting it? [12:55:16] <_joe_> jynus: not necessary, works again [12:55:45] <_joe_> uhm, no, very slow again [12:55:46] <_joe_> +1 [12:55:55] copper? [12:56:30] (03Merged) 10jenkins-bot: Depool db1051,66,80,74,77,56,81,70,82 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337573 (https://phabricator.wikimedia.org/T150474) (owner: 10Jcrespo) [12:57:07] (03PS1) 10Elukey: Fix and tune the new Analytics Hadoop alarms [puppet] - 10https://gerrit.wikimedia.org/r/337574 (https://phabricator.wikimedia.org/T88640) [12:57:12] (03CR) 10jenkins-bot: Depool db1051,66,80,74,77,56,81,70,82 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337573 (https://phabricator.wikimedia.org/T150474) (owner: 10Jcrespo) [12:57:25] nope, cobalt [12:58:03] it is not swapping, though [12:59:08] !log reloading/restarting gerrint on cobalt, too slow [12:59:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:48] is it coming back? [13:00:37] <_joe_> it's up [13:00:41] <_joe_> still very slow I'd say [13:00:45] but equaly slow [13:00:46] yep [13:00:51] maybe it is git, then? [13:01:02] <_joe_> let's give it a sec now [13:01:06] yes [13:01:11] cold caches, etc [13:01:19] <_joe_> has anyone done anything with cobalt today [13:01:20] <_joe_> ? [13:02:06] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_statistics_mediawiki] [13:02:26] PROBLEM - puppet last run on stat1003 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_statistics_mediawiki],Exec[git_pull_analytics/limn-ee-data] [13:03:06] https://phabricator.wikimedia.org/T148478#3024468 [13:04:34] load is at 40 [13:04:46] and the restart didn't help [13:05:16] not me [13:05:26] <_joe_> it's all iowait [13:05:34] <_joe_> 18.0 wa [13:05:40] <_joe_> just ssh'd in [13:05:44] /srv/gerrit/jvmlogs [13:06:46] (03PS1) 10Elukey: Fix and tune the new Analytics Hadoop alarms [puppet] - 10https://gerrit.wikimedia.org/r/337575 (https://phabricator.wikimedia.org/T88640) [13:06:52] <_joe_> so someone is reading a lot? [13:07:46] RECOVERY - puppet last run on labsdb1003 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [13:08:11] https://grafana.wikimedia.org/dashboard/file/server-board.json?var-server=cobalt&from=now-12h&to=now shows what Joe said [13:08:16] RECOVERY - puppet last run on cp3044 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [13:10:45] (03CR) 10Elukey: [C: 032] Fix and tune the new Analytics Hadoop alarms [puppet] - 10https://gerrit.wikimedia.org/r/337575 (https://phabricator.wikimedia.org/T88640) (owner: 10Elukey) [13:11:45] 06Operations, 06Release-Engineering-Team, 07HHVM, 07Wikimedia-Incident: 2016-10-17 API cluster overload - https://phabricator.wikimedia.org/T148652#3024909 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi @greg I think it is safe to close, we've mitigated the issue by having a separate cluster for asy... [13:11:58] just merged my patch, all good [13:15:24] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1051,66,80,74,77,56,81,70,82 (duration: 00m 40s) [13:15:24] (03PS1) 10Jcrespo: Revert "Depool db1051,66,80,74,77,56,81,70,82 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337576 [13:15:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:40] (03CR) 10Marostegui: [C: 031] Revert "Depool db1051,66,80,74,77,56,81,70,82 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337576 (owner: 10Jcrespo) [13:20:06] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [13:22:44] 06Operations, 10fundraising-tech-ops: disable/remove accounts for Brent Cohn from CPS data - https://phabricator.wikimedia.org/T158051#3024945 (10Jgreen) [13:24:55] (03PS3) 10Muehlenhoff: Add account validation script / cron job [puppet] - 10https://gerrit.wikimedia.org/r/337032 (https://phabricator.wikimedia.org/T142836) [13:25:47] (03CR) 10Jcrespo: [C: 032] Revert "Depool db1051,66,80,74,77,56,81,70,82 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337576 (owner: 10Jcrespo) [13:27:13] (03Merged) 10jenkins-bot: Revert "Depool db1051,66,80,74,77,56,81,70,82 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337576 (owner: 10Jcrespo) [13:27:22] (03CR) 10jenkins-bot: Revert "Depool db1051,66,80,74,77,56,81,70,82 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337576 (owner: 10Jcrespo) [13:27:26] RECOVERY - puppet last run on stat1003 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [13:28:09] 06Operations, 10fundraising-tech-ops: disable/remove accounts for Brent Cohn from CPS data - https://phabricator.wikimedia.org/T158051#3024989 (10Jgreen) fundraising mysql privs, yubikey, and ssh public key have been revoked [13:28:55] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1051,66,80,74,77,56,81,70,82 (duration: 00m 44s) [13:28:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:04] !log lvs400[34]: upgrade to jessie 8.7, pybal 1.13.5, reboot into kernel 4.4.2-3+wmf8 T155401 T147425 [13:32:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:11] T147425: pybal error "exceptions.ValueError: Value of arguments is not a string or stringlist" - https://phabricator.wikimedia.org/T147425 [13:32:12] T155401: Integrate jessie 8.7 point release - https://phabricator.wikimedia.org/T155401 [13:33:45] (03CR) 10Jgreen: "Brent Cohn no longer needs access, see https://phabricator.wikimedia.org/T158051 for more info." [puppet] - 10https://gerrit.wikimedia.org/r/336994 (owner: 10Muehlenhoff) [13:41:26] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 13Patch-For-Review, 15User-Elukey: Zookeeper heap usage patterns - https://phabricator.wikimedia.org/T157968#3025004 (10elukey) [13:44:02] (03PS1) 10Muehlenhoff: Record extented account expiry date for nettrom [puppet] - 10https://gerrit.wikimedia.org/r/337578 [13:50:30] !log lvs400[12]: upgrade to jessie 8.7, pybal 1.13.5, reboot into kernel 4.4.2-3+wmf8 T155401 T147425 [13:50:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:36] T147425: pybal error "exceptions.ValueError: Value of arguments is not a string or stringlist" - https://phabricator.wikimedia.org/T147425 [13:50:36] T155401: Integrate jessie 8.7 point release - https://phabricator.wikimedia.org/T155401 [13:53:16] (03PS1) 10Jcrespo: Depool db1055,72,83,56,84,76,78,87,71 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337579 (https://phabricator.wikimedia.org/T150474) [13:55:04] (03CR) 10Jcrespo: [C: 04-1] Depool db1055,72,83,56,84,76,78,87,71 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337579 (https://phabricator.wikimedia.org/T150474) (owner: 10Jcrespo) [13:55:53] (03CR) 10Muehlenhoff: [C: 032] Record extented account expiry date for nettrom [puppet] - 10https://gerrit.wikimedia.org/r/337578 (owner: 10Muehlenhoff) [13:56:16] (03PS2) 10Jcrespo: Depool db1055,72,83,56,84,76,78,87,71 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337579 (https://phabricator.wikimedia.org/T150474) [13:59:41] jouncebot: next [13:59:41] In 0 hour(s) and 0 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170214T1400) [14:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170214T1400). [14:00:04] addshore, Urbanecm, and dcausse: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [14:00:34] (03PS3) 10Hashar: New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337554 (https://phabricator.wikimedia.org/T158040) (owner: 10Urbanecm) [14:00:37] (03PS1) 10Muehlenhoff: Remove access credentials for bcohn [puppet] - 10https://gerrit.wikimedia.org/r/337580 (https://phabricator.wikimedia.org/T158051) [14:00:38] o/ [14:00:51] o/ [14:00:52] +2ed the change for dcausse / ElasticSearch [14:01:11] ok [14:01:16] (03CR) 10Hashar: [C: 032] New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337554 (https://phabricator.wikimedia.org/T158040) (owner: 10Urbanecm) [14:01:47] dcausse: I should have +2ed it earlier :D [14:01:54] o/ [14:01:57] :) [14:02:11] hashar: I'lll let you do everything? :) [14:02:26] addshore: I guess I can enable twocol thing for de :-} [14:02:32] cheers! :D [14:04:21] and my next challenge will be to find out why instances take longer and longer to start ( https://grafana.wikimedia.org/dashboard/db/nodepool?panelId=11&fullscreen&from=now-30d&to=now ) :D [14:05:02] (03Merged) 10jenkins-bot: New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337554 (https://phabricator.wikimedia.org/T158040) (owner: 10Urbanecm) [14:05:28] (03CR) 10Muehlenhoff: [C: 032] Remove access credentials for bcohn [puppet] - 10https://gerrit.wikimedia.org/r/337580 (https://phabricator.wikimedia.org/T158051) (owner: 10Muehlenhoff) [14:05:44] (03PS3) 10Hashar: Enable TwoColConflict on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332911 (https://phabricator.wikimedia.org/T155721) (owner: 10Addshore) [14:06:00] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332911 (https://phabricator.wikimedia.org/T155721) (owner: 10Addshore) [14:06:05] !log hashar@tin Synchronized wmf-config/throttle.php: Throttle rule for cswiki - T158040 (duration: 00m 40s) [14:06:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:10] T158040: Request for IP lift cap - 2017-02-15 - https://phabricator.wikimedia.org/T158040 [14:06:22] (03CR) 10Jcrespo: [C: 032] Depool db1055,72,83,56,84,76,78,87,71 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337579 (https://phabricator.wikimedia.org/T150474) (owner: 10Jcrespo) [14:06:33] dcausse: CI ETA ~ 5 minutes [14:06:44] ok [14:07:05] (03CR) 10jenkins-bot: New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337554 (https://phabricator.wikimedia.org/T158040) (owner: 10Urbanecm) [14:08:21] bah operations-mw-config-composer-hhvm-jessie lacks a composer cache [14:08:25] (03Merged) 10jenkins-bot: Enable TwoColConflict on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332911 (https://phabricator.wikimedia.org/T155721) (owner: 10Addshore) [14:08:49] (03Merged) 10jenkins-bot: Depool db1055,72,83,56,84,76,78,87,71 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337579 (https://phabricator.wikimedia.org/T150474) (owner: 10Jcrespo) [14:09:23] (03CR) 10jenkins-bot: Enable TwoColConflict on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/332911 (https://phabricator.wikimedia.org/T155721) (owner: 10Addshore) [14:09:24] addshore: waiting for canaries traffic [14:09:44] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: Enable TwoColConflict on dewiki - T155721 (duration: 00m 40s) [14:09:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:49] T155721: Deploy TwoColConflict extension to meta, dewiki and one RTL-wiki - https://phabricator.wikimedia.org/T155721 [14:09:49] addshore: done :-) [14:10:04] jynus: you can do the change to Depool db1055,72,83,56,84,76,78,87,71 [14:10:53] sorry, I didn't know there was a swat now [14:11:07] I can wait [14:11:19] jynus: no worries, you can proceed :} [14:11:20] and nobody will touch that file [14:11:24] all the swat mediawiki-config have been deployed now [14:11:27] !log lvs300[34]: upgrade to jessie 8.7, pybal 1.13.5, reboot into kernel 4.4.2-3+wmf8 T155401 T147425 [14:11:31] yes, but I need to revert it in 5 minutes [14:11:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:32] T147425: pybal error "exceptions.ValueError: Value of arguments is not a string or stringlist" - https://phabricator.wikimedia.org/T147425 [14:11:32] T155401: Integrate jessie 8.7 point release - https://phabricator.wikimedia.org/T155401 [14:11:36] lets pretend the depool was part of the swat :} [14:11:40] ok [14:11:56] thank you [14:12:09] jynus: mediawiki-config is all your. The last SWAT change is an extension update so that is disconnected (more or less) [14:12:40] 06Operations, 10fundraising-tech-ops, 13Patch-For-Review: disable/remove accounts for Brent Cohn from CPS data - https://phabricator.wikimedia.org/T158051#3024945 (10MoritzMuehlenhoff) His cluster access has been removed and he didn't have access to the "nda" LDAP group. [14:12:48] I will soon stop bothering you [14:13:00] will control pooling on etcd [14:13:41] (03PS1) 10Jcrespo: Revert "Depool db1055,72,83,56,84,76,78,87,71 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337582 [14:14:10] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1055,72,83,56,84,76,78,87,71 (duration: 00m 41s) [14:14:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:39] dcausse: it is on mwdebug1001 if there is something to test [14:15:33] hashar: not really, I just run mwscript eval.php to make sure the new config is here [14:15:48] but mwscript is not available on mwdebug* [14:15:49] dcausse: so I am going to sync it :) [14:15:51] sure [14:15:59] yeah [14:16:06] you gotta "scap pull" on terbium.eqiad.wmnet [14:16:06] I'll double check on terbium [14:16:10] and mwscript there [14:16:29] hashar: thanks! [14:16:33] sorry for dropping out there a bit! [14:16:46] !log hashar@tin Synchronized php-1.29.0-wmf.11/extensions/CirrusSearch/profiles/SimilarityProfiles.php: Explicitly use BM25 as default for wmf_defaults similarity profile (duration: 00m 47s) [14:16:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:19] dcausse: done [14:17:29] (03CR) 10Jcrespo: [C: 032] Revert "Depool db1055,72,83,56,84,76,78,87,71 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337582 (owner: 10Jcrespo) [14:17:30] dcausse: sorry I have missed your "I'll double check on terbium" [14:17:48] hashar: sounds good on terbium, thanks! [14:17:56] !log European SWAT is complete [14:17:58] \O/ [14:18:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:58] (03Merged) 10jenkins-bot: Revert "Depool db1055,72,83,56,84,76,78,87,71 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337582 (owner: 10Jcrespo) [14:20:05] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1055,72,83,56,84,76,78,87,71 (duration: 00m 40s) [14:20:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:15] Sory that I'm very very late. Was my patch for EU SWAT deployed by somebody? [14:27:44] (03PS1) 10Elukey: Move mw224[45] from appservers to imagescalers [puppet] - 10https://gerrit.wikimedia.org/r/337584 (https://phabricator.wikimedia.org/T156023) [14:28:45] (03PS1) 10Jcrespo: Depool db1073,89,90,91,92 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337586 (https://phabricator.wikimedia.org/T150474) [14:29:49] Urbanecm: the cswiki throttle rule? yes hashar did! :) [14:31:18] 06Operations, 06TCB-Team, 10Two-Column-Edit-Conflict-Merge, 13Patch-For-Review, and 2 others: Deploy TwoColConflict extension to production - https://phabricator.wikimedia.org/T150184#2776652 (10Addshore) 05Open>03Resolved The extension is now in production and live on dewiki [14:31:52] (03CR) 10Jcrespo: [C: 032] Depool db1073,89,90,91,92 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337586 (https://phabricator.wikimedia.org/T150474) (owner: 10Jcrespo) [14:32:15] addshore, yes, thank you for your message and thank hashar for the deploy! [14:33:50] (03Merged) 10jenkins-bot: Depool db1073,89,90,91,92 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337586 (https://phabricator.wikimedia.org/T150474) (owner: 10Jcrespo) [14:34:31] (03PS1) 10Jcrespo: Revert "Depool db1073,89,90,91,92 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337587 [14:35:34] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1073,89,90,91,92 (duration: 00m 40s) [14:35:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:00] (03CR) 10Jcrespo: [C: 032] Revert "Depool db1073,89,90,91,92 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337587 (owner: 10Jcrespo) [14:38:22] !log lvs300[12]: upgrade to jessie 8.7, pybal 1.13.5, reboot into kernel 4.4.2-3+wmf8 T155401 T147425 [14:38:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:28] T147425: pybal error "exceptions.ValueError: Value of arguments is not a string or stringlist" - https://phabricator.wikimedia.org/T147425 [14:38:28] T155401: Integrate jessie 8.7 point release - https://phabricator.wikimedia.org/T155401 [14:40:14] (03Merged) 10jenkins-bot: Revert "Depool db1073,89,90,91,92 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337587 (owner: 10Jcrespo) [14:41:27] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1073,89,90,91,92 (duration: 00m 41s) [14:41:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:21] (03CR) 10Elukey: [C: 032] Move mw224[45] from appservers to imagescalers [puppet] - 10https://gerrit.wikimedia.org/r/337584 (https://phabricator.wikimedia.org/T156023) (owner: 10Elukey) [14:47:43] (03PS1) 10Andrew Bogott: Horizon: Upgrade to mitaka [puppet] - 10https://gerrit.wikimedia.org/r/337591 [14:47:53] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: Change rack for servers in s1 in codfw - https://phabricator.wikimedia.org/T156478#3025240 (10Marostegui) @Papaul you think we can do db2062 sometime this week? Thanks! [14:48:50] !log installing php security updates on einsteinium [14:48:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:11] PROBLEM - DPKG on mw2244 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:51:25] this is me :) [14:52:11] RECOVERY - DPKG on mw2244 is OK: All packages OK [14:53:39] (03CR) 10Dereckson: [C: 031] Create Wikichanzo namespace for swwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337555 (https://phabricator.wikimedia.org/T158041) (owner: 10Raimond Spekking) [14:55:10] (03CR) 10Jcrespo: "This is unblocked T150474. I have converted the tables to InnoDB. There was indeed contention between readers and writers there- I didn't " [puppet] - 10https://gerrit.wikimedia.org/r/327667 (https://phabricator.wikimedia.org/T149210) (owner: 10Aaron Schulz) [14:56:30] (03PS15) 10Marostegui: Reporting tests with the private data script [puppet] - 10https://gerrit.wikimedia.org/r/328352 (https://phabricator.wikimedia.org/T153680) [14:56:31] PROBLEM - puppet last run on graphite1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:56:32] (03CR) 10Jcrespo: "I am ok to deploy this when you tell me, BTW." [puppet] - 10https://gerrit.wikimedia.org/r/327667 (https://phabricator.wikimedia.org/T149210) (owner: 10Aaron Schulz) [14:56:43] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: Change rack for servers in s1 in codfw - https://phabricator.wikimedia.org/T156478#3025254 (10Papaul) We can today. [14:57:40] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: Change rack for servers in s1 in codfw - https://phabricator.wikimedia.org/T156478#3025257 (10Marostegui) Awesome, I will depool it and get it ready to be moved Thanks! [14:59:26] (03PS1) 10Marostegui: db-codfw.php: Depool db2062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337592 (https://phabricator.wikimedia.org/T156478) [15:01:38] !log lvs10*: upgrade to pybal 1.13.5 T147425 [15:01:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:43] T147425: pybal error "exceptions.ValueError: Value of arguments is not a string or stringlist" - https://phabricator.wikimedia.org/T147425 [15:01:48] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337592 (https://phabricator.wikimedia.org/T156478) (owner: 10Marostegui) [15:03:08] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337592 (https://phabricator.wikimedia.org/T156478) (owner: 10Marostegui) [15:04:30] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2062 to change its rack - T156478 (duration: 00m 41s) [15:04:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:36] T156478: Change rack for servers in s1 in codfw - https://phabricator.wikimedia.org/T156478 [15:04:47] (03CR) 10Thcipriani: [C: 031] scap: move udp2log from fluorine to mwlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/335624 (https://phabricator.wikimedia.org/T123728) (owner: 10Filippo Giunchedi) [15:04:58] (03PS1) 10Dereckson: Allow sysops to add/revoke account creator on it.wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337594 (https://phabricator.wikimedia.org/T158062) [15:05:29] !log Shutdown mysql (and later the whole host) on db2062 for maintenance - T156478 [15:05:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:40] I'd lile to add this one to the SWAT but I see I'm slighlty late ^ [15:06:46] marostegui: you still are using tin? [15:06:57] Dereckson: I am done with my change [15:07:31] Okay, thanks, I'm going to SWAT it quickly in this case. [15:07:37] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337594 (https://phabricator.wikimedia.org/T158062) (owner: 10Dereckson) [15:07:52] 06Operations, 10Wikimedia-Logstash: Get 5xx logs into kibana/logstash - https://phabricator.wikimedia.org/T149451#3025330 (10Ottomata) > What about Spark streaming? Too much? It would be a good occasion to use it... Yeah it would! But I wouldn't feel good about maintaining this in the Analytics Cluster. We'd... [15:08:35] 06Operations, 10Pybal, 13Patch-For-Review: pybal error "exceptions.ValueError: Value of arguments is not a string or stringlist" - https://phabricator.wikimedia.org/T147425#3025332 (10ema) 05Open>03Resolved a:03ema I've upgraded all LVSs to pybal 1.13.5 which includes a fix for this: https://gerrit.wik... [15:09:33] (03Merged) 10jenkins-bot: Allow sysops to add/revoke account creator on it.wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337594 (https://phabricator.wikimedia.org/T158062) (owner: 10Dereckson) [15:09:50] Live on mwdebug1002. [15:09:53] (03CR) 1020after4: [C: 031] scap: move udp2log from fluorine to mwlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/335624 (https://phabricator.wikimedia.org/T123728) (owner: 10Filippo Giunchedi) [15:10:16] Works. [15:10:26] (03CR) 10Marostegui: [C: 032] Reporting tests with the private data script [puppet] - 10https://gerrit.wikimedia.org/r/328352 (https://phabricator.wikimedia.org/T153680) (owner: 10Marostegui) [15:11:06] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Allow sysops to add/revoke account creator on it.wikiversity (T158062) (duration: 00m 41s) [15:11:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:09] T158062: Allow administrators to add/remove users to/from account creators on it.wikiversity - https://phabricator.wikimedia.org/T158062 [15:12:02] Done. [15:12:48] (03PS3) 10Filippo Giunchedi: udp2log: mirror traffic from mwlog1001 to fluorine [puppet] - 10https://gerrit.wikimedia.org/r/335625 (https://phabricator.wikimedia.org/T123728) [15:12:50] (03PS2) 10Filippo Giunchedi: scap: move udp2log from fluorine to mwlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/335624 (https://phabricator.wikimedia.org/T123728) [15:12:52] (03PS2) 10Filippo Giunchedi: performance: switch xenon apache backend to mwlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/337569 (https://phabricator.wikimedia.org/T123728) [15:13:10] (03CR) 10Ottomata: [C: 032] Add default JAVA_OPTS to the zookeeper server class [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/337413 (https://phabricator.wikimedia.org/T157968) (owner: 10Elukey) [15:13:25] (03CR) 10Ottomata: [V: 032 C: 032] Add default JAVA_OPTS to the zookeeper server class [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/337413 (https://phabricator.wikimedia.org/T157968) (owner: 10Elukey) [15:16:26] 06Operations, 10DBA, 13Patch-For-Review, 05Prometheus-metrics-monitoring: implement performance_schema for mysql monitoring - https://phabricator.wikimedia.org/T99485#3025383 (10jcrespo) 05Open>03Resolved Performance schema is finally on all servers, and so does the sys schema. I will close this becaus... [15:17:52] 06Operations, 10DBA, 13Patch-For-Review, 05Prometheus-metrics-monitoring: implement performance_schema for mysql monitoring - https://phabricator.wikimedia.org/T99485#3025389 (10jcrespo) [15:17:55] 06Operations, 10DBA, 05Prometheus-metrics-monitoring: Decide storage backend for performance schema monitoring stats - https://phabricator.wikimedia.org/T119619#3025386 (10jcrespo) 05stalled>03Resolved a:03jcrespo We will finaly go, because of privacy concerns, for a private prometheus instance for the... [15:17:56] (03PS4) 10Filippo Giunchedi: udp2log: mirror traffic from mwlog1001 to fluorine [puppet] - 10https://gerrit.wikimedia.org/r/335625 (https://phabricator.wikimedia.org/T123728) [15:17:59] 06Operations, 10DBA, 10Traffic, 06WMF-Legal, and 2 others: dbtree loads third party resources (from jquery.com and google.com) - https://phabricator.wikimedia.org/T96499#3025390 (10jcrespo) [15:21:31] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] udp2log: mirror traffic from mwlog1001 to fluorine [puppet] - 10https://gerrit.wikimedia.org/r/335625 (https://phabricator.wikimedia.org/T123728) (owner: 10Filippo Giunchedi) [15:23:32] 06Operations, 10DBA, 06Labs, 07Tracking: Database replication problems - production and labs (tracking) - https://phabricator.wikimedia.org/T50930#3025409 (10jcrespo) [15:24:34] RECOVERY - puppet last run on graphite1003 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [15:26:30] (03PS3) 10Filippo Giunchedi: scap: move udp2log from fluorine to mwlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/335624 (https://phabricator.wikimedia.org/T123728) [15:30:14] (03CR) 10Giuseppe Lavagetto: "thanks for the review!" (0331 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/288881 (https://phabricator.wikimedia.org/T155823) (owner: 10Giuseppe Lavagetto) [15:30:32] 06Operations, 10ops-codfw, 10DBA: Several es20XX servers keep crashing (es2017, es2019, es2015, es2014) since 23 March - https://phabricator.wikimedia.org/T130702#3025417 (10jcrespo) a:05jcrespo>03None No crashes in the last 4 months, it seems? [15:30:47] (03PS9) 10Giuseppe Lavagetto: Add schema support [software/conftool] - 10https://gerrit.wikimedia.org/r/288881 (https://phabricator.wikimedia.org/T155823) [15:31:04] (03CR) 10Filippo Giunchedi: [C: 032] scap: move udp2log from fluorine to mwlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/335624 (https://phabricator.wikimedia.org/T123728) (owner: 10Filippo Giunchedi) [15:31:19] (03CR) 10jerkins-bot: [V: 04-1] Add schema support [software/conftool] - 10https://gerrit.wikimedia.org/r/288881 (https://phabricator.wikimedia.org/T155823) (owner: 10Giuseppe Lavagetto) [15:33:01] 06Operations, 10DBA: Populate the wikishared db on all dbstores - https://phabricator.wikimedia.org/T126252#3025427 (10jcrespo) p:05Normal>03Low a:05jcrespo>03None [15:33:04] <_joe_> that was expected ^^ [15:33:37] thcipriani|afk twentyafterfour merged the scap udp2log change, thanks! should be a noop as far as fluorine is concerned, i.e. logs will still end up on fluorine via mwlog1001 [15:34:48] 06Operations, 13Patch-For-Review: Upgrade fluorine to trusty/jessie - https://phabricator.wikimedia.org/T123728#3025434 (10fgiunchedi) [15:35:03] 06Operations, 10DBA: Populate the wikishared db on all dbstores - https://phabricator.wikimedia.org/T126252#3025435 (10jcrespo) [15:35:39] godog: cool, will keep an eye out next time we're scappin' :) [15:36:02] thcipriani|afk: awesome, let me know if you have any trouble with mwlog1001 too [15:36:06] 06Operations, 10DBA, 13Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3025436 (10jcrespo) a:05jcrespo>03None Let's have a look soon at the decom plan and paste it here when we are happy. [15:36:12] 06Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Disable caching on the main page for anonymous users - https://phabricator.wikimedia.org/T119366#3025438 (10Anomie) >>! In T119366#3023410, @Tgr wrote: > `#time` and co. are used on many pages and usually they do not require cache invalidation. For exa... [15:36:55] 06Operations, 10DBA: Puppetize grants for mysql analytics servers - https://phabricator.wikimedia.org/T114476#3025440 (10jcrespo) a:05jcrespo>03None [15:38:00] (03CR) 10Giuseppe Lavagetto: [C: 031] systemd: allow isequal to match programname in/rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/337411 (owner: 10Hashar) [15:39:50] 06Operations, 10DBA: Decommission db1015, db1035 and db1044 - https://phabricator.wikimedia.org/T148078#3025442 (10jcrespo) p:05Normal>03High a:05jcrespo>03None High because they will complain of lack of space soon. [15:40:11] 06Operations, 10ops-codfw, 10DBA: Several es20XX servers keep crashing (es2017, es2019, es2015, es2014) since 23 March - https://phabricator.wikimedia.org/T130702#3025445 (10Marostegui) >>! In T130702#3025417, @jcrespo wrote: > No crashes in the last 4 months, it seems? Indeed, the uptimes are looking very... [15:40:53] 06Operations, 10DBA: Decommission db1015, db1035 and db1044 - https://phabricator.wikimedia.org/T148078#3025447 (10jcrespo) [15:40:59] 06Operations, 10DBA, 13Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3025448 (10jcrespo) [15:41:37] 06Operations, 10DBA: Decommission db1015, db1035, db1044 and db1038 - https://phabricator.wikimedia.org/T148078#2714228 (10jcrespo) [15:42:09] 06Operations, 10DBA: Decommission db1015, db1035, db1044 and db1038 - https://phabricator.wikimedia.org/T148078#3025456 (10Marostegui) db1044 is db1095's master, so we need to look for another candidate within the shard (and change it to ROW) and make sure it has the same content as db1044 otherwise we will br... [15:43:18] 06Operations, 10DBA, 07Chinese-Sites: Drop *_old database tables from Wikimedia wikis - https://phabricator.wikimedia.org/T54932#3025459 (10jcrespo) a:05jcrespo>03None [15:45:52] (03PS4) 10Filippo Giunchedi: prometheus: add v6 reverse records [dns] - 10https://gerrit.wikimedia.org/r/337422 (https://phabricator.wikimedia.org/T154504) (owner: 10Dzahn) [15:48:13] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite2001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [15:48:33] PROBLEM - Text HTTP 5xx reqs/min on graphite2001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [15:48:45] (03PS1) 10Marostegui: db-codfw,db-eqiad.php: Change db2062 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337597 (https://phabricator.wikimedia.org/T156478) [15:49:33] PROBLEM - Esams HTTP 5xx reqs/min on graphite2001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [15:50:02] 06Operations, 10DBA, 06Labs, 07Tracking: Database replication problems - production and labs (tracking) - https://phabricator.wikimedia.org/T50930#3025478 (10jcrespo) [15:51:21] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, I've moved the reverses in the right origin zone" [dns] - 10https://gerrit.wikimedia.org/r/337422 (https://phabricator.wikimedia.org/T154504) (owner: 10Dzahn) [15:51:30] (03CR) 10Nuria: Analytics VCL: default to 'org' if top_domain is not set (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/337549 (https://phabricator.wikimedia.org/T138027) (owner: 10Ema) [15:51:44] 06Operations, 10DBA: Puppetize grants for mysql backups on dbstore hosts - https://phabricator.wikimedia.org/T111929#3025484 (10jcrespo) a:05jcrespo>03None [15:53:04] (03PS2) 10Hashar: jenkins: migrate to systemd [puppet] - 10https://gerrit.wikimedia.org/r/337404 [15:53:42] (03CR) 10Marostegui: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337597 (https://phabricator.wikimedia.org/T156478) (owner: 10Marostegui) [15:53:44] (03CR) 10Hashar: "A real world example https://gerrit.wikimedia.org/r/#/c/337404/1..2/modules/jenkins/manifests/init.pp" [puppet] - 10https://gerrit.wikimedia.org/r/337411 (owner: 10Hashar) [15:54:03] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite2001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [15:54:14] (03PS1) 10Rush: openstack: nova_fullstack_test changes to daemonize [puppet] - 10https://gerrit.wikimedia.org/r/337598 [15:55:27] hashar: any idea why is this taking so long? https://gerrit.wikimedia.org/r/#/c/337597/1 (I had to add the bot myself) [15:55:48] (03PS1) 10Papaul: DNS: change db2062 production DNS Bug:T156478 [dns] - 10https://gerrit.wikimedia.org/r/337600 [15:56:27] marostegui: oh that is interesting [15:56:28] (03PS2) 10Rush: openstack: nova_fullstack_test changes to daemonize [puppet] - 10https://gerrit.wikimedia.org/r/337598 [15:56:30] (03CR) 10jerkins-bot: [V: 04-1] jenkins: migrate to systemd [puppet] - 10https://gerrit.wikimedia.org/r/337404 (owner: 10Hashar) [15:56:34] marostegui: looks like no jobs get triggered [15:56:58] 06Operations, 10DBA, 10Monitoring: Display lag on grafana (prometheus) and dbtree from pt-heartbeat instead (or in addition) of Seconds_Behind_Master - https://phabricator.wikimedia.org/T141968#3025531 (10jcrespo) a:05jcrespo>03None [15:57:07] :/ [15:57:08] marostegui: err no sorry. They are triggered and about to complete after ~8 minutes [15:57:21] oh wow, 8 minutes :) [15:58:03] the job https://integration.wikimedia.org/ci/job/operations-mw-config-composer-hhvm-jessie/964/console took roughly 3 minutes to do the git fetch bah [15:58:27] is this all fallout from gerrit being slow? [15:58:28] (03CR) 10Marostegui: [C: 031] DNS: change db2062 production DNS Bug:T156478 [dns] - 10https://gerrit.wikimedia.org/r/337600 (owner: 10Papaul) [15:58:48] hashar: yeah, I can see it now there, it finally went thru after 9 minutes [15:58:48] chasemp: CI has its own git-daemon s [15:58:52] (03CR) 10Marostegui: [C: 032] db-codfw,db-eqiad.php: Change db2062 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337597 (https://phabricator.wikimedia.org/T156478) (owner: 10Marostegui) [15:59:00] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: Change rack for servers in s1 in codfw - https://phabricator.wikimedia.org/T156478#3025537 (10Papaul) @RobH we about to move db2062 in row D rack D6 to row B rack 5. I will like for you please if you have time to make some changes on both switches . o... [15:59:02] 06Operations, 10DBA: Adapt wmf-mariadb10 package for jessie or puppetize differently its service to adapt it to systemd - https://phabricator.wikimedia.org/T116903#3025538 (10jcrespo) 05stalled>03Open a:05jcrespo>03None This is now possible for 10.1, and I have packages for stretch that do that. We hav... [15:59:23] yeah, not saying I can explain it :) just asking [15:59:34] haha [15:59:48] (03PS2) 10Andrew Bogott: Horizon: Upgrade to mitaka [puppet] - 10https://gerrit.wikimedia.org/r/337591 [15:59:53] 06Operations, 10DBA: Drop the tables old_growth, hitcounter, click_tracking, click_tracking_user_properties from enwiki, maybe other schemas - https://phabricator.wikimedia.org/T115982#3025543 (10jcrespo) a:05jcrespo>03None [16:00:04] andrewbogott: Dear anthropoid, the time has come. Please deploy Horizon upgrade (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170214T1600). [16:00:23] (03Merged) 10jenkins-bot: db-codfw,db-eqiad.php: Change db2062 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337597 (https://phabricator.wikimedia.org/T156478) (owner: 10Marostegui) [16:01:13] 06Operations: Update kernel on db1011 - https://phabricator.wikimedia.org/T113720#3025546 (10jcrespo) 05Open>03Resolved db1011 was upgraded and restart the other day: ``` uname -a Linux db1011 3.13.0-107-generic ``` [16:01:33] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Change db2062 IP after its move to another rack - T156478 (duration: 00m 40s) [16:01:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:38] T156478: Change rack for servers in s1 in codfw - https://phabricator.wikimedia.org/T156478 [16:01:57] Feb 14 15:53:48 contint2001 git-daemon[116387]: Request upload-pack for '/operations/mediawiki-config' [16:01:57] Feb 14 15:56:34 contint2001 git-daemon[954]: [116387] Disconnected [16:02:05] marostegui: ^^that is all I got on the git-daemon side [16:02:26] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Change db2062 IP after its move to another rack - T156478 (duration: 00m 39s) [16:02:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:36] 06Operations, 10DBA, 13Patch-For-Review: Reduce memory commitment on database hosts with many objects, specially s3, dbstore/research and labs - https://phabricator.wikimedia.org/T107282#3025553 (10jcrespo) 05Open>03Resolved This was done on dbstore2 manifest. We have not seen reasons to do it on the oth... [16:02:37] (03CR) 10jerkins-bot: [V: 04-1] openstack: nova_fullstack_test changes to daemonize [puppet] - 10https://gerrit.wikimedia.org/r/337598 (owner: 10Rush) [16:02:39] 15:53:48 > git -c core.askpass=true fetch --tags --progress git://contint2001.wikimedia.org/operations/mediawiki-config +refs/heads/*:refs/remotes/origin/* [16:02:39] 15:56:44 [16:02:48] that is the jenkins console output [16:02:51] who knows what happened over those 3 minutes :/ [16:02:55] hashar: it is also weird that I had to add the bot manually, no? [16:03:03] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite2001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:03:18] marostegui: when you send a patch there is no one that commented it. [16:03:38] 06Operations, 06Collaboration-Team-Triage, 10DBA: Move echo tables from local wiki databases onto extension1 cluster for mediawikiwiki, metawiki, and officewiki - https://phabricator.wikimedia.org/T119154#3025558 (10jcrespo) a:05jcrespo>03None [16:03:39] once CI has finished running the build, it will add a comment and some Verified vote, which will add it as a reviewer [16:03:52] Ah, I didn't know that process :) [16:04:04] It is normally so fast that I never got to that situation that I had to add the bot myself [16:04:21] most repos (if not all) require a Verified +2 vote for a change to be submittable [16:04:35] and the gentleman agreement is that solely jenkins-bot votes verified+2 [16:04:59] though a human technically has the rights to set the V+2 vote. Typically when CI has some issue, a test is flappy or whatever else reason [16:05:20] so the best is to head to https://integration.wikimedia.org/zuul/ [16:05:24] find your change # [16:05:35] eventually use the Filter [___________] field to filter by the repo name [16:05:53] PROBLEM - MariaDB Slave Lag: s5 on db2038 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 310.20 seconds [16:05:55] (03PS1) 10Zhuyifei1999: Install libicu52 on python & python2 base images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/337603 (https://phabricator.wikimedia.org/T157744) [16:06:05] (03CR) 10Andrew Bogott: [C: 032] Horizon: Upgrade to mitaka [puppet] - 10https://gerrit.wikimedia.org/r/337591 (owner: 10Andrew Bogott) [16:06:18] I will check that slave lag [16:06:33] RECOVERY - Esams HTTP 5xx reqs/min on graphite2001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:06:36] !log Updated wikimania app to 5c44d06 Removed stale translations [16:06:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:56] (03CR) 10jenkins-bot: db-codfw.php: Depool db2062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337592 (https://phabricator.wikimedia.org/T156478) (owner: 10Marostegui) [16:07:11] (03CR) 10jenkins-bot: Allow sysops to add/revoke account creator on it.wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337594 (https://phabricator.wikimedia.org/T158062) (owner: 10Dereckson) [16:07:20] (03CR) 10jenkins-bot: Revert "Depool db1073,89,90,91,92 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337587 (owner: 10Jcrespo) [16:08:16] (03CR) 10jenkins-bot: db-codfw,db-eqiad.php: Change db2062 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337597 (https://phabricator.wikimedia.org/T156478) (owner: 10Marostegui) [16:08:50] (03CR) 10jenkins-bot: Revert "Depool db1055,72,83,56,84,76,78,87,71 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337582 (owner: 10Jcrespo) [16:09:24] (03CR) 10jenkins-bot: Depool db1055,72,83,56,84,76,78,87,71 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337579 (https://phabricator.wikimedia.org/T150474) (owner: 10Jcrespo) [16:09:34] RECOVERY - Text HTTP 5xx reqs/min on graphite2001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:10:14] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite2001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:10:43] PROBLEM - puppet last run on californium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[python3-novaclient] [16:11:53] RECOVERY - MariaDB Slave Lag: s5 on db2038 is OK: OK slave_sql_lag Replication lag: 0.43 seconds [16:12:44] hashar: sorry, I was checking that slave that lagged. Thanks for the info, I didn't know about that zuul status page :) [16:12:47] (03CR) 10jenkins-bot: Depool db1073,89,90,91,92 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337586 (https://phabricator.wikimedia.org/T150474) (owner: 10Jcrespo) [16:13:10] marostegui: that shows the internal state of the various Zuul pipelines/queues [16:13:39] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: [epic] System level upgrade for cirrus / elasticsearch - https://phabricator.wikimedia.org/T151324#3025606 (10dcausse) [16:13:53] PROBLEM - DPKG on californium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [16:14:41] 06Operations, 10ORES, 10Revision-Scoring-As-A-Service-Backlog, 06Services (watching): Limit resources used by ORES - https://phabricator.wikimedia.org/T146664#3025623 (10Fjalapeno) The Reading team is currently investigating whether it needs to procure additional hardware to support the #mobile-content-ser... [16:14:55] hashar: so if it is being slow as a few minutes ago, where would you see it? basically the queue growing? [16:15:23] 06Operations, 10Mobile-Content-Service, 10ORES, 10Revision-Scoring-As-A-Service-Backlog, 06Services (watching): Limit resources used by ORES - https://phabricator.wikimedia.org/T146664#3025627 (10Fjalapeno) [16:15:53] !log dist-upgrade californium (as part of the liberty->mitaka upgrade) [16:15:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:50] (03PS1) 10Filippo Giunchedi: deployment::server: enable jessie-wikimedia/experimental [puppet] - 10https://gerrit.wikimedia.org/r/337605 (https://phabricator.wikimedia.org/T140927) [16:16:53] RECOVERY - DPKG on californium is OK: All packages OK [16:17:43] RECOVERY - puppet last run on californium is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [16:18:18] (03CR) 10Chad: [C: 031] deployment::server: enable jessie-wikimedia/experimental [puppet] - 10https://gerrit.wikimedia.org/r/337605 (https://phabricator.wikimedia.org/T140927) (owner: 10Filippo Giunchedi) [16:18:23] !log rebooting californium [16:18:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:58] 06Operations, 10Mobile-Content-Service, 10ORES, 10Revision-Scoring-As-A-Service-Backlog, 06Services (watching): Limit resources used by ORES - https://phabricator.wikimedia.org/T146664#3025650 (10Halfak) Hey! Thanks for the ping. It turns out that the referenced CPU/memory issues weren't due to ORES ov... [16:25:33] PROBLEM - Text HTTP 5xx reqs/min on graphite2001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [16:25:57] high text everywhere? [16:26:14] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite2001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [16:26:33] PROBLEM - Esams HTTP 5xx reqs/min on graphite2001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [16:26:42] 503 spike or staying high? [16:27:03] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite2001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [16:27:12] there was a spike before [16:29:43] stream.wm.o? [16:29:55] or piwiki? [16:30:42] I think it is stream [16:31:03] two high spikes, thing look ok now [16:31:45] it might be piwik, checking [16:31:57] (03PS2) 10Marostegui: DNS: change db2062 production DNS Bug:T156478 [dns] - 10https://gerrit.wikimedia.org/r/337600 (owner: 10Papaul) [16:32:13] is it misc ? [16:32:34] there may be some piwik errors [16:32:41] but most come from the stream [16:32:50] (03CR) 10Marostegui: [C: 032] DNS: change db2062 production DNS Bug:T156478 [dns] - 10https://gerrit.wikimedia.org/r/337600 (owner: 10Papaul) [16:33:22] it seems text to me right? [16:33:30] cc: ema [16:34:13] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite2001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [16:34:53] it is not varnish [16:35:13] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite2001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:36:01] well I can see two peaks in https://grafana.wikimedia.org/dashboard/db/varnish-http-errors?from=now-3h&to=now [16:36:31] yes, I checked the errors at those 2 spikes on oxygen [16:37:03] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite2001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:37:04] ah ok I might have misinterpreted the "it is not varnish" [16:37:06] okok sorry [16:37:27] well, I mean it literally [16:37:34] it is not varnish [16:38:04] they are 503 coming from just a particular url of the stream [16:38:27] sure sure now I got what you mean [16:38:33] PROBLEM - Text HTTP 5xx reqs/min on graphite2001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [16:40:13] 06Operations, 10Gerrit, 13Patch-For-Review: Investigate seemingly random Gerrit slow-downs - https://phabricator.wikimedia.org/T148478#3025706 (10Gehel) @Paladox looking at the GC logs from last Friday (Feb 10) I don't really see a GC issue, but more a memory allocation issue. That is, I see the multiple ful... [16:40:19] jynus: "url of the stream" you mean stream.wm.o? From what I see the 503 peaks seems for text, not misc [16:41:00] Andrew is checking btw [16:41:26] 06Operations, 10Gerrit, 13Patch-For-Review: Investigate seemingly random Gerrit slow-downs - https://phabricator.wikimedia.org/T148478#3025707 (10Paladox) >>! In T148478#3025706, @Gehel wrote: > @Paladox looking at the GC logs from last Friday (Feb 10) I don't really see a GC issue, but more a memory allocat... [16:41:42] I am checking https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes?var-site=All&var-cache_type=text&var-status_type=5&from=now-3h&to=now btw [16:42:15] the stream.wm.org 502s look like a RCStream problem [16:42:24] will investigate [16:42:41] yes, most errors come from stream.wikimedia.org [16:43:13] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite2001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [16:43:33] PROBLEM - Esams HTTP 5xx reqs/min on graphite2001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [16:43:42] I do not see high mediawiki errors [16:45:03] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite2001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [16:46:16] (03CR) 10DCausse: [C: 031] Configure cirrus per-index settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336936 (https://phabricator.wikimedia.org/T155578) (owner: 10EBernhardson) [16:46:25] jynus: you sure? tailing 5xx.json, i see plenty for stream, but also lots of others too [16:46:30] lots of upload.wm.org [16:46:44] oh, I filter upload by default [16:47:00] because most of the 5XX are in reality 404 badly returned [16:47:12] grep '16:38' /srv/log/webrequest/5xx.json | grep -v upload | less [16:47:18] grep '16:41' /srv/log/webrequest/5xx.json | grep -v upload | less [16:47:28] also, https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes?var-site=All&var-cache_type=text&var-status_type=5&from=now-3h&to=now&panelId=2&fullscreen [16:47:30] is text only [16:47:33] stream is misc [16:47:38] but yeah, i'll look into the stream ones [16:47:42] upload is not text either [16:47:46] oh true. [16:49:16] 06Operations, 10Gerrit, 13Patch-For-Review: Investigate seemingly random Gerrit slow-downs - https://phabricator.wikimedia.org/T148478#3025720 (10demon) >>! In T148478#3025707, @Paladox wrote: > We didn't have this problem with gerrit's old server ytterbium. I believe ytterbium had an ssd, but I'm not really... [16:50:03] bblack: , yt? [16:50:33] PROBLEM - puppet last run on ocg1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:51:48] 06Operations, 10Gerrit, 13Patch-For-Review: Investigate seemingly random Gerrit slow-downs - https://phabricator.wikimedia.org/T148478#3025744 (10Paladox) Oh, I wonder why ytterbium didn't have this problem. Could this be a bug in gerrit that is not really noticeable because upstream use google servers so th... [16:53:44] ottomata: would it be possible to have something like text_5xx.json, upload_5xx.json, etc.. rather than a single one? [16:53:48] 06Operations, 10Gerrit, 13Patch-For-Review: Investigate seemingly random Gerrit slow-downs - https://phabricator.wikimedia.org/T148478#3025747 (10Gehel) >>! In T148478#3025707, @Paladox wrote: > Oh, would using ssd alleviate these problems as ssd are faster then hdd? Faster / better IO never hurts! Especial... [16:53:56] on oxygen I mean [16:54:03] it would simplify a lot investigations [16:54:16] you check Varnish return codes, then you peek the right fiel [16:54:18] *file [16:54:34] elukey: it is possible, sure [16:54:40] ottomata: ? [16:54:59] 06Operations, 10MediaWiki-General-or-Unknown: Investigate spike in 500s during asw-c2-eqiad replacement - https://phabricator.wikimedia.org/T156475#3025762 (10jcrespo) One possibility would be and edge case of how the latest changes to the load balancer works- so that queries get stuck rather than timing out e... [16:55:08] 06Operations: Encrypt all the things - https://phabricator.wikimedia.org/T111653#3025766 (10Jgreen) [16:55:52] bblack: jynus noticed a bunch of 5xx erros (502s) coming from rcstream [16:56:02] all have to do with /socket.io, so its rcstream (not eventstreams) [16:56:12] nginx rcstream_error.log on rcs1001 has a lot of [16:56:27] upstream prematurely closed connection while reading response header from upstream [16:56:30] jynus: I can see some DBReplication spikes in https://grafana.wikimedia.org/dashboard/db/production-logging?panelId=13&fullscreen&from=now-3h&to=now-3m, may be related ? [16:56:40] ok [16:56:46] not quite as worried about that [16:56:50] what about the text 5xx rate? [16:57:02] (03PS4) 10RobH: new shell user Nithum Thain [puppet] - 10https://gerrit.wikimedia.org/r/337438 [16:57:03] jynus: says there aren't many? dunno [16:57:06] (also incidentally, noticed that our POST traffic to test went up by like a factor of 4 starting 6 days ago...) [16:57:13] s/test/text/ [16:57:29] bblack: we were wondering what was failing text :( [16:58:08] that just peaked a lot https://grafana.wikimedia.org/dashboard/db/varnish-http-errors?from=now-3h&to=now [16:58:14] 06Operations, 10DBA: Decommission db1015, db1035, db1044 and db1038 - https://phabricator.wikimedia.org/T148078#3025774 (10jcrespo) Have a look at my shard planning, I think I had some options there. [16:58:45] ottomata, replication errors are not errors [16:58:46] (03CR) 10RobH: [C: 032] new shell user Nithum Thain [puppet] - 10https://gerrit.wikimedia.org/r/337438 (owner: 10RobH) [16:58:56] they do not return 5xx [16:59:40] yeah okok it was me btw, I wanted to ask because I saw the peaks [17:00:04] godog, moritzm, and _joe_: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170214T1700). [17:00:04] hashar, RainbowSparkles, and brion: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [17:00:12] you will also see that the spikes do not fit the 503 spikes [17:00:19] \o/ [17:00:50] 06Operations, 10DBA: Decommission db1015, db1035, db1044 and db1038 - https://phabricator.wikimedia.org/T148078#3025793 (10Marostegui) Yeah, you placed db1064 there as a master for it for s4. We need to make sure they have the same data as otherwise ROW will not like that [17:00:55] jynus: yep sure, I just asked to double check [17:01:22] no recent code deploys? [17:03:11] 06Operations, 10Ops-Access-Requests, 06Research-and-Data, 13Patch-For-Review: Cluster Access for Nithum Thain - https://phabricator.wikimedia.org/T157724#3025803 (10RobH) @ellery/@nithum: I screwed up and I just realized that while @Nithum's access is live, since he is a contractor I should have put in a... [17:03:27] it's pretty easy to grep usually, though [17:03:29] 06Operations, 10ops-codfw, 06DC-Ops, 10hardware-requests: decom install2001 - https://phabricator.wikimedia.org/T157840#3025804 (10Papaul) [17:04:03] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite2001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [17:04:36] the text 503 spikes seem to be api queries, but that's not very telling [17:04:58] even a general random small rate of 503s from a failing server or whatever would be likely to show up as mostly API queries [17:05:13] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite2001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [17:06:03] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite2001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:06:13] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite2001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:06:25] bblack: To answer your question: no, no recent deploys (at least of MW) [17:06:33] RECOVERY - Esams HTTP 5xx reqs/min on graphite2001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:07:17] 1489 "?action=opensearch [17:07:18] 5080 "?action=query [17:07:32] ^ biggest 503s are api.php with those, still drilling down [17:08:05] 06Operations, 06Analytics-Kanban, 10netops: Review ACLs for the Analytics VLAN - https://phabricator.wikimedia.org/T157435#3025831 (10Ottomata) > term udplog { + 1 > Remove IPs the term analytics-publicIP-v4: +1 > Review the IPs in term ssh Don't know anything about this, but also not sure why we have speci... [17:08:16] 2 "?action=opensearch&search=right_to_come_out_with_a_strong_selling_time_to_share_some_of_it_i_come_back_like_an_academic_newsline_thirsting_for_knowledge_that_might_help_you_now_this_article_shows_the_speed_limit_score_downstream_now_is_very_very&limit=15" [17:09:28] it might be useful to look at fatal logs / logstash [17:09:33] PROBLEM - Text HTTP 5xx reqs/min on graphite2001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [17:10:13] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite2001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [17:10:29] 06Operations, 10fundraising-tech-ops, 13Patch-For-Review: disable/remove accounts for Brent Cohn from CPS data - https://phabricator.wikimedia.org/T158051#3025837 (10Jgreen) p:05Triage>03High [17:10:33] PROBLEM - Esams HTTP 5xx reqs/min on graphite2001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [17:10:50] bblack: I was trying but didn't find anything useful [17:10:52] :( [17:11:48] 06Operations, 10DBA: Decommission db1015, db1035, db1044 and db1038 - https://phabricator.wikimedia.org/T148078#3025843 (10jcrespo) s4 is... special in that regard. [17:11:48] well, 503s like this generally mean MW is failing in some way [17:12:00] perhaps due to bad external input, but still [17:12:08] that opensearch query I pasted above sounds suspicious [17:12:51] 06Operations, 10DBA: Decommission db1015, db1035, db1044 and db1038 - https://phabricator.wikimedia.org/T148078#3025853 (10Marostegui) worst case scenario we can move db1044's data to db1064 :-) [17:13:03] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite2001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [17:13:22] some kind of opensearch abuse? [17:13:26] 06Operations, 10Gerrit, 13Patch-For-Review: Investigate seemingly random Gerrit slow-downs - https://phabricator.wikimedia.org/T148478#3025855 (10demon) >>! In T148478#3025747, @Gehel wrote: > Faster / better IO never hurts! Especially since we seem to have [[ https://grafana.wikimedia.org/dashboard/file/ser... [17:14:37] the strange opensearch queries are coming from Akamai IPs... [17:15:43] but they're only a small fraction of the 503 spike traffic [17:15:49] dcausse, ebernhardson: opensearch (see above) looks related to Cirrus (but I don't know that side all that much) [17:17:28] (03Draft1) 10Paladox: Gerrit: Lower heap size to 10gb [puppet] - 10https://gerrit.wikimedia.org/r/337609 (https://phabricator.wikimedia.org/T148478) [17:17:32] (03PS2) 10Paladox: Gerrit: Lower heap size to 10gb [puppet] - 10https://gerrit.wikimedia.org/r/337609 (https://phabricator.wikimedia.org/T148478) [17:18:33] RECOVERY - puppet last run on ocg1002 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [17:19:13] (03CR) 10Chad: [C: 04-1] "No. I'd rather move in smaller increments *towards* 10GB (or lower), not all at once." [puppet] - 10https://gerrit.wikimedia.org/r/337609 (https://phabricator.wikimedia.org/T148478) (owner: 10Paladox) [17:19:14] gehel: opensearch is comp suggest, I don't see anything wrong on the backend :/ [17:19:24] (03CR) 10Chad: [C: 04-1] "Don't worry, I'll take care of this" [puppet] - 10https://gerrit.wikimedia.org/r/337609 (https://phabricator.wikimedia.org/T148478) (owner: 10Paladox) [17:19:47] gehel: opensearch == prefix search / completion suggester (dependeing on namespace) [17:19:47] (03CR) 10Paladox: "> Don't worry, I'll take care of this" [puppet] - 10https://gerrit.wikimedia.org/r/337609 (https://phabricator.wikimedia.org/T148478) (owner: 10Paladox) [17:19:50] * ebernhardson scrolls up to find problem [17:19:51] I see a few scattered errors to comp suggest in logstash, but not nearly enough to cause the spike you're seeing [17:20:04] I have to check if we don't return a 5xx if the query is too long [17:20:08] they're several small spikes coming and going [17:20:09] (03Abandoned) 10Hashar: Gerrit: Lower heap size to 10gb [puppet] - 10https://gerrit.wikimedia.org/r/337609 (https://phabricator.wikimedia.org/T148478) (owner: 10Paladox) [17:20:23] 16:50 is one timestamp I'm focusing on right now [17:20:30] ok [17:20:49] there were ~13K 503's during that minute [17:20:50] https://logstash.wikimedia.org/goto/9e3877ae3178ed7324c84fece1f37e9a [17:21:49] top 3 when paring it down to the first query arg were: [17:21:52] 517 "?action=parse [17:21:52] 1385 "?action=opensearch [17:21:52] 4813 "?action=query [17:22:03] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite2001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [17:22:07] it's unusual that the pattern isn't more-concentrated in one type of query [17:22:33] RECOVERY - Esams HTTP 5xx reqs/min on graphite2001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:22:33] RECOVERY - Text HTTP 5xx reqs/min on graphite2001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:22:42] action=query scares me more than action=opensearch [17:22:52] action=query can hit a *lot* of nasty codepaths, easily [17:22:56] RainbowSprinkles: yea the logs cirrus made don't look to indicate a particular problem there. [17:23:03] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite2001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:23:07] for action=query would need to also bucket by list= and/or generator= [17:23:13] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite2001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:24:03] only ~2K of them had a generator argument at all, and <1K had a list argument [17:24:07] is it possible to have the list of backend that return 5xx, could be some overloaded appservers? [17:24:16] yeah looking into it [17:27:43] PROBLEM - puppet last run on prometheus1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:29:03] 06Operations, 10Mobile-Content-Service, 10ORES, 10Revision-Scoring-As-A-Service-Backlog, 06Services (watching): Limit resources used by ORES - https://phabricator.wikimedia.org/T146664#3025899 (10GWicke) @Halfak, the referenced issues were very definitely ORES using almost all memory. Things have certain... [17:29:44] bblack: I am grepping api_appservers eqiad with salt to count 503s [17:30:20] nothing really comes up [17:30:29] I'm starting to think it's actually a bad varnish backend [17:30:38] it seems that way, but I can't find the real cause underneath [17:30:46] (grep proxy-server/503 /var/log/apache2/other_vhosts_access.log | wc -l fyi) [17:34:16] !log bblack@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp1067.eqiad.wmnet [17:34:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:43] the bulk of the 503 spikes seem to be generated by varnish being unable to connect (or similar) to backend service, but almost always from cp1067 and not other machines [17:34:55] I can't yet find a cause why, but, depooled it [17:37:22] 06Operations, 10Ops-Access-Requests, 06Research-and-Data, 13Patch-For-Review: Cluster Access for Nithum Thain - https://phabricator.wikimedia.org/T157724#3026017 (10RStallman-legalteam) Hi RobH, I can give you the info. The current Memo of Understanding between Nithum and WMF is through June 15, 2017. We u... [17:38:12] 06Operations, 10Ops-Access-Requests, 06Research-and-Data, 13Patch-For-Review: Cluster Access for Nithum Thain - https://phabricator.wikimedia.org/T157724#3026019 (10RobH) @RStallman-legalteam: Do we have who at WMF should get the email notification of the account expiry? (Typically that person's direct ma... [17:38:32] bblack: can I ask how did you find the cp1067 (possible) cause? [17:38:39] (when you have finished) [17:41:04] well 16:50 is one of the spike times on cache_text 5xx graph [17:41:08] bblack@oxygen:~$ grep '16:50' /srv/log/webrequest/5xx.json |grep -w 503|jq .x_cache|less [17:41:27] almost all the x_cache lines there start with "cp1067 int" [17:41:42] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Request for access to stat1003 for Sam Tarling - https://phabricator.wikimedia.org/T157483#3026031 (10RobH) 05Resolved>03Open I'm reopening this task, because it turns out I made a mistake when we first setup Sam Tarlings account. During the proce... [17:41:49] (meaning cp1067 generated the 503 internally, as a backend cache, meaning it had issues getting a response out of the applayer) [17:42:08] if it had been randomly distributed around, I'd say applayer. but when it's all coming from one cache box, probably the cache box :) [17:43:38] bblack: sure from oxygen, didn't think about that, thanks :) [17:44:06] yeah analytics has way better versions of that stuff [17:44:10] 06Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Disable caching on the main page for anonymous users - https://phabricator.wikimedia.org/T119366#3026040 (10Tgr) >>! In T119366#3025438, @Anomie wrote: > Well, ideally it would limit cache expiry to "however much time is left until the comparison chang... [17:44:19] 06Operations, 10Mobile-Content-Service, 10ORES, 10Revision-Scoring-As-A-Service-Backlog, 06Services (watching): Limit resources used by ORES - https://phabricator.wikimedia.org/T146664#3026041 (10mobrovac) >>! In T146664#3025899, @GWicke wrote: > @Halfak, the referenced issues were very definitely ORES u... [17:44:26] bblack: not in real time though :( [17:44:31] but oxygen is the simplistic text logs exported for un-analytical opsen to use our primitive unix CLI tools on :) [17:45:00] one thing that it might be good to do is separate 5xx.json in per cache files (text_5xx.json, etc..) [17:45:09] it should be easy enough [17:45:15] or just include the cache cluster as one of the json fields [17:45:33] in the long run that's going to get less-meaningful, I wouldn't invest a lot of time in that direction [17:45:54] (cache clusters will merge up more, there probably won't be distinct "text" and "misc") [17:45:59] yes, but not sure what is easier since IIRC it is using kafkatee/kafkacat straight from kafka [17:46:27] plus some "cheatsheet" maybe? In a wiki page or something [17:46:32] :) [17:46:45] could write down some standard tips I guess? I donno [17:47:03] every analysis is unique. look at the json fields for patterns, use grep/sort/uniq/jq to tease things out [17:47:30] (03Draft1) 10Paladox: Gerrit: Converts Velocity template into soy template [puppet] - 10https://gerrit.wikimedia.org/r/337613 [17:47:31] sometimes it's the client ip field that gives up answers, or the x_cache field, or uri_path or uri_query, etc.. [17:47:32] (03PS2) 10Paladox: Gerrit: Converts ChangeSubject Velocity template into soy template [puppet] - 10https://gerrit.wikimedia.org/r/337613 (https://phabricator.wikimedia.org/T158008) [17:47:41] oh yes but there might be some preliminary tests that are common (like grepping a specific field | sort | uniq -c etc..) [17:48:00] (03CR) 10Paladox: "I doint know how to do this conversion for the its/ .vm templates." [puppet] - 10https://gerrit.wikimedia.org/r/337613 (https://phabricator.wikimedia.org/T158008) (owner: 10Paladox) [17:48:03] anyhow, only ideas :) [17:48:11] sure, yeah [17:48:19] I'll work on splitting 5xx.json if there is consensus [17:48:20] this was one of my earlier fruitless CLI lines for instance: [17:48:21] grep '16:50' 5xx.json |grep -v upload|grep -w 503|grep -v '10\.68\.'|grep '/w/api\.php'|jq .uri_query|cut -d'&' -f1 |sort|uniq -c|sort -n [17:48:22] 06Operations, 10Ops-Access-Requests, 06Research-and-Data, 13Patch-For-Review: Cluster Access for Nithum Thain - https://phabricator.wikimedia.org/T157724#3026047 (10ellery) That would be @DarTar . [17:49:01] jq .uri_query| for example is a nice one, writing it down :) [17:49:06] elukey: really, like within 6 months there won't be a cache_misc vs cache_text distinction to rely on [17:49:33] PROBLEM - puppet last run on mw1208 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:49:43] there will just be "upload" (which will have like 2 distinct legit HTTP hostnames in it) and "text" (pretty much everything else) [17:49:50] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Request for access to stat1003 for Sam Tarling - https://phabricator.wikimedia.org/T157483#3026051 (10RobH) a:05RobH>03MoritzMuehlenhoff I chatted with @RStallman-legalteam about this, and Sam Tarling's NDA doesn't actually have an expiry date. I'... [17:50:29] 06Operations, 10Ops-Access-Requests, 06Research-and-Data, 13Patch-For-Review: Cluster Access for Nithum Thain - https://phabricator.wikimedia.org/T157724#3026053 (10RobH) Awesome, I'll correct the access now (it won't affect actual usage, as that is already live.) [17:50:50] bblack: okok, will drop the idea :) [17:51:09] but MAYBE we could think about splitting in other creative ways [17:51:11] not sure [17:51:18] ok I am stopping here [17:51:21] :) [17:51:34] <_joe_> elukey: what about we feed that stream to spark and do live queries? [17:51:39] <_joe_> that would be interesting :P [17:51:49] <_joe_> it's your realm too [17:52:46] _joe_ I discussed it with the team a while ago, I can dig a bit more! [17:53:19] with Kafka 0.10 we'll have timestamps and it could be even cooler to use spark [17:53:20] elukey: or throw some metrics at prometheus with tags that allows to have aggregated and detailed data :-P [17:53:23] * volans hides [17:53:43] WHATEVER WORKS :D [17:54:43] RECOVERY - puppet last run on prometheus1004 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [17:57:12] (03CR) 10Chad: [C: 031] "We can do this one for starters. The recursive file{} stanza in jetty.pp should already install this for us :)" [puppet] - 10https://gerrit.wikimedia.org/r/337613 (https://phabricator.wikimedia.org/T158008) (owner: 10Paladox) [17:58:23] 06Operations, 10ops-codfw, 10hardware-requests: decommission ms2001 & ms2002 - https://phabricator.wikimedia.org/T157991#3026075 (10Papaul) p:05Triage>03Normal [17:58:41] 06Operations, 10ops-codfw, 10hardware-requests: decommission ms2001 & ms2002 - https://phabricator.wikimedia.org/T157991#3022661 (10Papaul) Disk wipe in progress [17:58:41] (03CR) 10Paladox: ":), thanks." [puppet] - 10https://gerrit.wikimedia.org/r/337613 (https://phabricator.wikimedia.org/T158008) (owner: 10Paladox) [18:00:04] gwicke, cscott, arlolra, subbu, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170214T1800). [18:00:48] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: Change rack for servers in s1 in codfw - https://phabricator.wikimedia.org/T156478#3026081 (10RobH) >>! In T156478#3025537, @Papaul wrote: > @RobH we about to move db2062 in row D rack D6 to row B rack 5. I will like for you please if you have time to m... [18:01:49] (03CR) 10Chad: [C: 031] "No change in compiler, as expected: https://puppet-compiler.wmflabs.org/5456/cobalt.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/337613 (https://phabricator.wikimedia.org/T158008) (owner: 10Paladox) [18:02:01] (03PS1) 10RobH: adding info to nithum's shell account [puppet] - 10https://gerrit.wikimedia.org/r/337614 [18:03:00] (03CR) 10JustBerry: [C: 031] "Looks consistent and per Phab discussion. No build broke in the process." [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/337603 (https://phabricator.wikimedia.org/T157744) (owner: 10Zhuyifei1999) [18:03:22] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: Change rack for servers in s1 in codfw - https://phabricator.wikimedia.org/T156478#3026085 (10Marostegui) db2062 has been moved to B5 DNS updated db-eqiad,codfw files updated mysql started replication started and server catching up tendril updated Than... [18:04:12] 06Operations, 10Mobile-Content-Service, 10ORES, 10Revision-Scoring-As-A-Service-Backlog, 06Services (watching): Limit resources used by ORES - https://phabricator.wikimedia.org/T146664#3026086 (10Halfak) @mobrovac, I'd not been notified about #Operations coming to a conclusion about moving ORES out of SC... [18:05:44] (03CR) 10RobH: [C: 032] adding info to nithum's shell account [puppet] - 10https://gerrit.wikimedia.org/r/337614 (owner: 10RobH) [18:06:45] 06Operations, 10Ops-Access-Requests, 06Research-and-Data, 13Patch-For-Review: Cluster Access for Nithum Thain - https://phabricator.wikimedia.org/T157724#3026094 (10RobH) 05Open>03Resolved Fixed to add in expiry info and contact, so resolving this task again. [18:07:21] 06Operations, 10Mobile-Content-Service, 10ORES, 10Revision-Scoring-As-A-Service-Backlog, 06Services (watching): Limit resources used by ORES - https://phabricator.wikimedia.org/T146664#3026099 (10Halfak) @gwicke, I remember looking into that event and determining that it was not ORES using all of the mem... [18:11:03] !log Purged https://he.wikipedia.org/static/images/mobile/copyright/wikipedia-wordmark-he.svg with purgeList.php [18:11:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:33] (03CR) 10Mobrovac: [C: 031] systemd: allow isequal to match programname in/rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/337411 (owner: 10Hashar) [18:11:57] 06Operations, 10Mobile-Content-Service, 10ORES, 10Revision-Scoring-As-A-Service-Backlog, 06Services (watching): Limit resources used by ORES - https://phabricator.wikimedia.org/T146664#3026131 (10Halfak) So, rather than continue this debate on an unrelated phab task, here's what I propose. 1. @Gwicke,... [18:12:00] 06Operations, 10fundraising-tech-ops, 13Patch-For-Review: disable/remove accounts for Brent Cohn from CPS data - https://phabricator.wikimedia.org/T158051#3026133 (10Jgreen) fundraising shell access revoked civi client cert revoked [18:13:04] 06Operations, 10Mobile-Content-Service, 10ORES, 10Revision-Scoring-As-A-Service-Backlog, 06Services (watching): Limit resources used by ORES - https://phabricator.wikimedia.org/T146664#3026136 (10Halfak) [18:13:16] 06Operations, 10Mobile-Content-Service, 10ORES, 10Revision-Scoring-As-A-Service-Backlog, 06Services (watching): Limit resources used by ORES - https://phabricator.wikimedia.org/T146664#2667780 (10Halfak) [18:16:39] Did puppet swat get canceled during the error flurry? [18:17:27] !log arlolra@tin Started deploy [parsoid/deploy@1bfb86b]: Updating Parsoid to 79ccfb93 [18:17:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:34] RECOVERY - puppet last run on mw1208 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [18:19:47] 06Operations, 10Mobile-Content-Service, 10ORES, 10Revision-Scoring-As-A-Service-Backlog, 06Services (watching): Limit resources used by ORES - https://phabricator.wikimedia.org/T146664#3026188 (10Fjalapeno) @Halfak ahh… thanks for the link! Forgot about that ticket when looking around. [18:26:03] (03CR) 10Paladox: [C: 031] Install libicu52 on python & python2 base images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/337603 (https://phabricator.wikimedia.org/T157744) (owner: 10Zhuyifei1999) [18:27:26] !log arlolra@tin Finished deploy [parsoid/deploy@1bfb86b]: Updating Parsoid to 79ccfb93 (duration: 09m 58s) [18:27:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:06] !log starting branch cut for 1.29.0-wmf.12 [18:28:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:42] brion: I think so yeah (re: puppet swat), I wasn't around tho [18:32:33] thcipriani: LMK when ok to merge https://gerrit.wikimedia.org/r/#/c/337605/ btw [18:34:12] godog: will do. How long are you around today? This might be best to merge today after train but before evening SWAT. If not today we can find another window. I don't want to disrupt deployment windows is my only concern with it, really. [18:35:20] Awww, let's live dangerously [18:35:41] (03PS2) 10Muehlenhoff: Don't enable the Diamond ntpd collector if systemd-timesyncd is used [puppet] - 10https://gerrit.wikimedia.org/r/337009 (https://phabricator.wikimedia.org/T157794) [18:35:41] :D [18:36:23] thcipriani: likely another hour or so, worst case I'll do it tomorrow EU morning [18:36:53] today's valentine day gift was receiving 2 cubic meter pallet shipment from ireland with our stuff [18:37:27] godog: yeah EU morning might work best then :) [18:37:36] thcipriani: ok! [18:37:58] * thcipriani feels that new git excitement [18:38:19] (03PS1) 10Andrew Bogott: Horizon: add explicit "!" policies for unsupport services. [puppet] - 10https://gerrit.wikimedia.org/r/337618 [18:39:03] godog: thank you for the git packaging and backporting — will make a big difference for train for sure :) [18:39:25] !log Updated Parsoid to 79ccfb93 (T58381, T108216) [18:39:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:30] T108216: Indented pre blocks just broken - https://phabricator.wikimedia.org/T108216 [18:39:30] T58381: Change DOM rendering of to be or similar? - https://phabricator.wikimedia.org/T58381 [18:39:46] thcipriani: np, looking forward to moving to stretch so we won't even need it anymore [18:40:51] yup: future is bright :) [18:40:56] (03PS5) 10Filippo Giunchedi: prometheus: add v6 reverse records [dns] - 10https://gerrit.wikimedia.org/r/337422 (https://phabricator.wikimedia.org/T154504) (owner: 10Dzahn) [18:41:39] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: add v6 reverse records [dns] - 10https://gerrit.wikimedia.org/r/337422 (https://phabricator.wikimedia.org/T154504) (owner: 10Dzahn) [18:41:43] godog: no worries, just have a small tweak to video scaler config. Whenever it goes out is fine, no rush :) [18:44:17] (03PS2) 10Andrew Bogott: Horizon: add explicit "!" policies for unsupport services. [puppet] - 10https://gerrit.wikimedia.org/r/337618 [18:44:19] (03PS1) 10Andrew Bogott: Horizon: Backport a newton fix to Mitaka [puppet] - 10https://gerrit.wikimedia.org/r/337619 [18:45:20] (03CR) 10jerkins-bot: [V: 04-1] Horizon: Backport a newton fix to Mitaka [puppet] - 10https://gerrit.wikimedia.org/r/337619 (owner: 10Andrew Bogott) [18:45:24] (03CR) 10jerkins-bot: [V: 04-1] Horizon: add explicit "!" policies for unsupport services. [puppet] - 10https://gerrit.wikimedia.org/r/337618 (owner: 10Andrew Bogott) [18:46:27] (03CR) 10RobH: [C: 031] "\o/ standardization \o/" [puppet] - 10https://gerrit.wikimedia.org/r/337378 (https://phabricator.wikimedia.org/T151326) (owner: 10Gehel) [18:48:28] (03PS2) 10Andrew Bogott: Horizon: Backport a newton fix to Mitaka [puppet] - 10https://gerrit.wikimedia.org/r/337619 [18:48:30] (03PS3) 10Andrew Bogott: Horizon: add explicit "!" policies for unsupport services. [puppet] - 10https://gerrit.wikimedia.org/r/337618 [18:51:06] (03PS2) 10Filippo Giunchedi: Enable Prometheus exporter on restbase1007 (canary) [puppet] - 10https://gerrit.wikimedia.org/r/337493 (https://phabricator.wikimedia.org/T155120) (owner: 10Eevans) [18:54:39] (03CR) 10Filippo Giunchedi: "LGTM, minor nit on the description" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/337552 (https://phabricator.wikimedia.org/T70113) (owner: 10Hashar) [18:55:33] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 263 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [18:55:44] (03PS3) 10Andrew Bogott: Horizon: Backport a newton fix to Mitaka [puppet] - 10https://gerrit.wikimedia.org/r/337619 [18:55:46] (03PS4) 10Andrew Bogott: Horizon: add explicit "!" policies for unsupport services. [puppet] - 10https://gerrit.wikimedia.org/r/337618 [18:56:20] (03CR) 10Filippo Giunchedi: [C: 032] Enable Prometheus exporter on restbase1007 (canary) [puppet] - 10https://gerrit.wikimedia.org/r/337493 (https://phabricator.wikimedia.org/T155120) (owner: 10Eevans) [18:59:37] (03PS4) 10Andrew Bogott: Horizon: Backport a newton fix to Mitaka [puppet] - 10https://gerrit.wikimedia.org/r/337619 (https://phabricator.wikimedia.org/T158099) [18:59:39] (03PS5) 10Andrew Bogott: Horizon: add explicit "!" policies for unsupport services. [puppet] - 10https://gerrit.wikimedia.org/r/337618 (https://phabricator.wikimedia.org/T158099) [19:00:13] PROBLEM - puppet last run on labsdb1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:00:34] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 14 probes of 263 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [19:01:19] (03CR) 10Andrew Bogott: [C: 032] Horizon: Backport a newton fix to Mitaka [puppet] - 10https://gerrit.wikimedia.org/r/337619 (https://phabricator.wikimedia.org/T158099) (owner: 10Andrew Bogott) [19:01:26] (03PS5) 10Andrew Bogott: Horizon: Backport a newton fix to Mitaka [puppet] - 10https://gerrit.wikimedia.org/r/337619 (https://phabricator.wikimedia.org/T158099) [19:01:33] (03CR) 10Andrew Bogott: [C: 032] Horizon: add explicit "!" policies for unsupport services. [puppet] - 10https://gerrit.wikimedia.org/r/337618 (https://phabricator.wikimedia.org/T158099) (owner: 10Andrew Bogott) [19:01:40] (03PS6) 10Andrew Bogott: Horizon: add explicit "!" policies for unsupport services. [puppet] - 10https://gerrit.wikimedia.org/r/337618 (https://phabricator.wikimedia.org/T158099) [19:03:18] (03CR) 10Filippo Giunchedi: [C: 04-1] "The idea LGTM, though it seems simpler to reuse the existing ssl nginx vhost" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/336852 (owner: 10Giuseppe Lavagetto) [19:04:35] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Puppet changes required for elasticsearch 5.x upgrade - https://phabricator.wikimedia.org/T155578#3026509 (10EBernhardson) [19:07:34] (03PS3) 10Filippo Giunchedi: prometheus: temporary rsync server for metrics migration [puppet] - 10https://gerrit.wikimedia.org/r/330348 (https://phabricator.wikimedia.org/T148408) [19:10:00] (03PS6) 10Krinkle: Don't use computed dblist in production (nowikidatadescriptiontaglines) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334462 [19:10:02] (03CR) 10Krinkle: "Rebased." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334462 (owner: 10Krinkle) [19:10:11] (03CR) 10Krinkle: [C: 032] Don't use computed dblist in production (nowikidatadescriptiontaglines) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334462 (owner: 10Krinkle) [19:11:20] Verifying no-op on mwdebug1001 [19:11:34] (03Merged) 10jenkins-bot: Don't use computed dblist in production (nowikidatadescriptiontaglines) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334462 (owner: 10Krinkle) [19:11:43] (03CR) 10jenkins-bot: Don't use computed dblist in production (nowikidatadescriptiontaglines) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334462 (owner: 10Krinkle) [19:12:58] Krinkle: one sec. Lemme clean up wikiversions.json [19:13:06] Okay, haven't pulled yet [19:13:28] Just looking to make sure x-debug-wikimedia logging still works in logstash. Noticed some issues with it over the weekend. [19:13:48] Krinkle: Should be good now [19:13:56] okay [19:14:35] thcipriani: btw, scap pull on mwd1 says "cannot delete non-empty dir" for older wmf.4/5 dirs [19:15:14] thcipriani: yuvi does the labs container rebuilding right? [19:15:14] yeah, I was just looking at that. Rsync permission things. [19:16:14] Hm.. we no longer have hte channel grouping on the x-debug dashboard? (The one that says 1000 DEBUG, 100 INFO, 0 ERROR for example) [19:16:32] https://logstash.wikimedia.org/app/kibana#/dashboard/x-debug?_g=(refreshInterval:(display:Off,pause:!f,value:0),time:(from:now-1h,mode:quick,to:now))&_a=(filters:!((%27$state%27:(store:appState),meta:(alias:!n,disabled:!f,index:%27logstash-*%27,key:_type,negate:!f,value:mediawiki),query:(match:(_type:(query:mediawiki))))),options:(darkTheme:!f),panels:!((col [19:16:32] :1,id:Events-Over-Time,panelIndex:14,row:1,size_x:12,size_y:2,type:visualization),(col:1,columns:!(level,channel,host,wiki,message),id:MediaWiki-Events-List,panelIndex:15,row:3,size_x:12,size_y:11,sort:!(%27@timestamp%27,desc),type:search)),query:(query_string:(analyze_wildcard:!t,query:%27reqId:%22WKNXQQpAIHsAAD2U1pAAAAAB%22%27)),title:x-debug,uiState:()) [19:16:33] JustBerry: not 100% sure, but that seems right to me. You should check in #wikimedia-labs [19:16:42] ugh.. sorry long url [19:16:44] another win for kibana urls [19:16:58] there is a share button that creats a tiny url :P [19:16:58] thcipriani: stale ;p [19:17:27] yeah, but this is the "short" url created by the debug extension itself [19:17:34] That should be shorter possibly as well. [19:17:44] for some definition of "short" [19:17:47] ;) [19:18:37] actual short url: https://logstash.wikimedia.org/goto/ed3717f897e2a5a1cc94be1c64decb9c [19:18:43] 06Operations, 10ops-codfw: wtp2019 has faulty memory - https://phabricator.wikimedia.org/T146009#2647413 (10RobH) wtp2019 / wmf6180 is covered under Dell warranty until 2018-01-19. So if it is faulty, we can get a replacement under warranty. [19:19:16] 06Operations, 10ops-codfw: wtp2019 has faulty memory - https://phabricator.wikimedia.org/T146009#3026598 (10RobH) a:03Papaul [19:19:51] the ui to get a short url is weird. [19:20:27] share this link: [insanely long link] [two-arrows-touching] [clipboard] [19:20:54] (03CR) 10Muehlenhoff: "https://gerrit.wikimedia.org/r/#/c/336420/ (not yet merged) adds generic support for the experimental, let's better base on that than addi" [puppet] - 10https://gerrit.wikimedia.org/r/337605 (https://phabricator.wikimedia.org/T140927) (owner: 10Filippo Giunchedi) [19:20:57] Krinkle: when we upgraded logstash we had to make the whole dashboard description be part of the link generated by the browser plugin. I went with a pretty basic dashboard [19:21:14] bd808: yeah, it no longer takes query paramaters? [19:21:16] we'll be updating kibana to a new version soon-ish, no clue if that will be better or not [19:21:35] Krinkle: nope. they decided not to port that functionality [19:22:07] !log krinkle@tin Synchronized dblists/: I67194fceffd3f61 (duration: 01m 37s) [19:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:39] no the new version of kibana doesn't look any better in that respect, main change is instead of [two-arrow-touching] icon you get a blue link that says 'short url' [19:23:47] !log krinkle@tin Synchronized docroot/noc/conf: I67194fceffd3f61 (duration: 00m 48s) [19:23:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:57] Done [19:24:36] (03PS2) 10Dzahn: partman: delete raid1-lvm-ext4 recipe [puppet] - 10https://gerrit.wikimedia.org/r/337532 (https://phabricator.wikimedia.org/T156955) [19:24:59] robh: ^ one to delete that we don't use.. in response to your ticket [19:25:48] * Krinkle misses the "Create task" button in Phabricator [19:27:11] 06Operations, 06DC-Ops: Lots of hosts with hyperthreading disabled - https://phabricator.wikimedia.org/T156140#2965447 (10RobH) Unfortunately, I don't know of a way to enable hyperthreading without rebooting the server into the bios. This means downtime for each one of these hosts. [19:27:40] Krinkle: someone pointed out you now have to add it to favorites. It works well enough, but perhaps we need some default favorites everyone gets [19:27:56] I have it in the star already (it was by default for me) [19:28:03] but it's an extra click and under an icon with no label [19:28:07] indeed [19:28:11] RECOVERY - puppet last run on labsdb1007 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [19:28:24] (03CR) 10Dzahn: [C: 04-1] "of course it can wait, i had no expectation that it gets merged right away. i just knew at some point we'd want to remove it and wanted to" [puppet] - 10https://gerrit.wikimedia.org/r/337204 (owner: 10Dzahn) [19:28:35] for some reason my favorites defaulted to empty ... oh well [19:42:45] (03PS3) 10Dzahn: Gerrit: Converts ChangeSubject Velocity template into soy template [puppet] - 10https://gerrit.wikimedia.org/r/337613 (https://phabricator.wikimedia.org/T158008) (owner: 10Paladox) [19:44:58] mutante: That one can go out whenever. Won't need a restart since it's unused until we upgrade ^ [19:49:36] 06Operations, 06DC-Ops: Lots of hosts with hyperthreading disabled - https://phabricator.wikimedia.org/T156140#2965447 (10Dzahn) I found this somewhere that they say is for disabling HT "`racadm set BIOS.ProcSettings.LogicalProc Disabled`" but have not tried it. [http://www.gooksu.com/2015/04/27/racadm-quick-... [19:50:05] (03CR) 10Dzahn: [C: 032] Gerrit: Converts ChangeSubject Velocity template into soy template [puppet] - 10https://gerrit.wikimedia.org/r/337613 (https://phabricator.wikimedia.org/T158008) (owner: 10Paladox) [19:50:17] mutante ^^ thanks :) [19:50:22] RainbowSprinkles: yep, just talked to paladox :) [19:50:40] i was wondering for a second it doesnt need to be added in .pp, but i see [19:51:14] mutante: Yeah, that entire etc/ directory is copied in recursively from puppet [19:51:59] *nod* [19:52:33] (03PS6) 10Ottomata: Drop wdqs_extract partitions older than 90 days [puppet] - 10https://gerrit.wikimedia.org/r/335437 (https://phabricator.wikimedia.org/T146915) (owner: 10Nschaaf) [19:55:22] (03CR) 10Ottomata: [C: 032] Drop wdqs_extract partitions older than 90 days [puppet] - 10https://gerrit.wikimedia.org/r/335437 (https://phabricator.wikimedia.org/T146915) (owner: 10Nschaaf) [20:00:04] thcipriani: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170214T2000). Please do the needful. [20:00:14] * thcipriani does [20:03:54] !log thcipriani@tin Started scap: testwiki to php-1.29.0-wmf.12 and rebuild l10n cache [20:03:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:13] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Migrate labsdb1005/1006/1007 to jessie - https://phabricator.wikimedia.org/T123731#3026797 (10yuvipanda) We announced a while ago we're gonna do this on the 15th. [20:05:55] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Migrate labsdb1005/1006/1007 to jessie - https://phabricator.wikimedia.org/T123731#3026807 (10jcrespo) Let's use T157358 for this. Postgres is a different beast. [20:06:58] (03PS4) 10Dzahn: lint: 'include base::firewall' -> 'include ::base::firewall' [puppet] - 10https://gerrit.wikimedia.org/r/337201 [20:09:56] (03PS1) 10Yuvipanda: python: Install icu dev files [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/337634 (https://phabricator.wikimedia.org/T157744) [20:22:43] !log Update site statistics for pam.wikipedia (T158110, now 454 images) [20:22:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:47] T158110: Update statistics count on pam.wikipedia - https://phabricator.wikimedia.org/T158110 [20:23:30] (03PS1) 10Jdlrobson: Cleanup popups beta cluster config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337636 [20:26:38] (03PS3) 10Rush: openstack: nova_fullstack_test changes to daemonize [puppet] - 10https://gerrit.wikimedia.org/r/337598 [20:26:59] (03PS4) 10Rush: openstack: nova_fullstack_test changes to daemonize [puppet] - 10https://gerrit.wikimedia.org/r/337598 [20:28:03] (03PS21) 10Rush: labstore: Diamond collector to track directory sizes [puppet] - 10https://gerrit.wikimedia.org/r/335855 (https://phabricator.wikimedia.org/T126623) (owner: 10Madhuvishy) [20:31:32] 06Operations, 10Mobile-Content-Service, 10ORES, 10Revision-Scoring-As-A-Service-Backlog, 06Services (watching): Limit resources used by ORES - https://phabricator.wikimedia.org/T146664#3026919 (10mobrovac) >>! In T146664#3026086, @Halfak wrote: > @mobrovac, I'd not been notified about #Operations coming... [20:32:20] (03CR) 10jerkins-bot: [V: 04-1] openstack: nova_fullstack_test changes to daemonize [puppet] - 10https://gerrit.wikimedia.org/r/337598 (owner: 10Rush) [20:33:41] !log otto@tin Started deploy [analytics/refinery@67c3924]: Deploying refinery with update to drop hourly partitions script [20:33:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:08] (03PS5) 10Rush: openstack: nova_fullstack_test changes to daemonize [puppet] - 10https://gerrit.wikimedia.org/r/337598 [20:36:06] !log otto@tin Finished deploy [analytics/refinery@67c3924]: Deploying refinery with update to drop hourly partitions script (duration: 02m 25s) [20:36:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:14] (03PS1) 10Ottomata: Use --partition-type hive for refinery-drop-wdqs-extract-partitions job [puppet] - 10https://gerrit.wikimedia.org/r/337639 (https://phabricator.wikimedia.org/T146915) [20:36:27] (03CR) 10Hashar: zuul: monitor Gearman queue growing out of control (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/337552 (https://phabricator.wikimedia.org/T70113) (owner: 10Hashar) [20:36:41] (03PS2) 10Hashar: zuul: monitor Gearman queue growing out of control [puppet] - 10https://gerrit.wikimedia.org/r/337552 (https://phabricator.wikimedia.org/T70113) [20:38:19] 06Operations, 10Mobile-Content-Service, 10ORES, 10Revision-Scoring-As-A-Service-Backlog, 06Services (watching): Limit resources used by ORES - https://phabricator.wikimedia.org/T146664#3027004 (10Halfak) @mobrovac, let me try again. Who from #operations did you talk to? Was that agreement public? Can... [20:38:20] (03PS1) 10Ottomata: Update published-datasets-readme.txt [puppet] - 10https://gerrit.wikimedia.org/r/337640 [20:39:13] (03PS3) 10Hashar: jenkins: migrate to systemd [puppet] - 10https://gerrit.wikimedia.org/r/337404 [20:39:15] (03CR) 10Ottomata: [C: 032] Use --partition-type hive for refinery-drop-wdqs-extract-partitions job [puppet] - 10https://gerrit.wikimedia.org/r/337639 (https://phabricator.wikimedia.org/T146915) (owner: 10Ottomata) [20:40:47] (03PS2) 10Ottomata: Update published-datasets-readme.txt [puppet] - 10https://gerrit.wikimedia.org/r/337640 [20:40:51] (03CR) 10Ottomata: [V: 032 C: 032] Update published-datasets-readme.txt [puppet] - 10https://gerrit.wikimedia.org/r/337640 (owner: 10Ottomata) [20:41:31] (03PS6) 10Rush: openstack: nova_fullstack_test changes to daemonize [puppet] - 10https://gerrit.wikimedia.org/r/337598 [20:42:41] PROBLEM - salt-minion processes on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:42:41] PROBLEM - dhclient process on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:42:51] PROBLEM - DPKG on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:42:51] PROBLEM - puppet last run on meitnerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:44:31] RECOVERY - salt-minion processes on meitnerium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [20:44:31] RECOVERY - dhclient process on meitnerium is OK: PROCS OK: 0 processes with command name dhclient [20:44:41] RECOVERY - DPKG on meitnerium is OK: All packages OK [20:44:41] RECOVERY - puppet last run on meitnerium is OK: OK: Puppet is currently enabled, last run 15 minutes ago with 0 failures [20:47:16] mobrovac: ready to signal readiness to comms? [20:50:51] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [20:51:59] !log thcipriani@tin Finished scap: testwiki to php-1.29.0-wmf.12 and rebuild l10n cache (duration: 48m 04s) [20:52:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:18] 06Operations, 06DC-Ops: Lots of hosts with hyperthreading disabled - https://phabricator.wikimedia.org/T156140#3027045 (10Dzahn) Tested on spare server "gadolinium". status before changes, disabled: ``` /admin1-> racadm get BIOS.ProcSettings.LogicalProc [Key=BIOS.Setup.1-1#ProcSettings] LogicalProc=Disabled... [20:53:04] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban: Move cloudera packages to a separate archive section - https://phabricator.wikimedia.org/T155726#3027062 (10Ottomata) a:03Ottomata [21:00:16] 06Operations, 06DC-Ops: Lots of hosts with hyperthreading disabled - https://phabricator.wikimedia.org/T156140#3027085 (10Dzahn) and then after 10 minutes or so ... "� Last Status Message: Task Failed .. Task Status: Failed " .. yea.. well.. good that we tried :p [21:00:48] !log thcipriani@tin Started scap: testwiki to php-1.29.0-wmf.12 and rebuild l10n cache (wikiversions.json not updated previously) [21:00:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:29] (03PS1) 10Ottomata: Add cdh-trusty and cdh-jessie reprepro updates and mirror them to a new cdh component [puppet] - 10https://gerrit.wikimedia.org/r/337657 (https://phabricator.wikimedia.org/T155726) [21:07:58] (03PS8) 10Rush: nfs: Snapshot backup device on secondary DC before replicating latest from remote [puppet] - 10https://gerrit.wikimedia.org/r/334692 (https://phabricator.wikimedia.org/T149870) (owner: 10Madhuvishy) [21:10:13] !log thcipriani@tin Finished scap: testwiki to php-1.29.0-wmf.12 and rebuild l10n cache (wikiversions.json not updated previously) (duration: 09m 25s) [21:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:54] (03PS22) 10Madhuvishy: labstore: Diamond collector to track directory sizes [puppet] - 10https://gerrit.wikimedia.org/r/335855 (https://phabricator.wikimedia.org/T126623) [21:11:09] (03CR) 10Madhuvishy: [V: 032 C: 032] labstore: Diamond collector to track directory sizes [puppet] - 10https://gerrit.wikimedia.org/r/335855 (https://phabricator.wikimedia.org/T126623) (owner: 10Madhuvishy) [21:12:36] (03CR) 10Faidon Liambotis: [C: 04-1] "Minor comments inside. More importantly, you need to modify the jessie-wikimedia and trusty-wikimedia stanzas of the distributions file to" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/337657 (https://phabricator.wikimedia.org/T155726) (owner: 10Ottomata) [21:12:51] (03CR) 10Hashar: jenkins: allow access log to be flipped (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/337385 (owner: 10Hashar) [21:15:02] (03CR) 10Ottomata: Add cdh-trusty and cdh-jessie reprepro updates and mirror them to a new cdh component (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/337657 (https://phabricator.wikimedia.org/T155726) (owner: 10Ottomata) [21:16:51] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [21:22:45] 06Operations, 06DC-Ops: Lots of hosts with hyperthreading disabled - https://phabricator.wikimedia.org/T156140#3027187 (10Dzahn) It gets better... now we have these pending and failing job that pops up at reboot so i need to clean up again and delete them. Why multiple jobs? attempts from the past? ``` /admi... [21:22:45] AaronSchulz: for some reason after scap I'm now seeing a lot of Fatal error: Call to undefined method __PHP_Incomplete_Class::hasReached() in /srv/mediawiki/php-1.29.0-wmf.11/includes/libs/rdbms/loadbalancer/LoadBalancer.php on line 491 which is very much like Friday's https://phabricator.wikimedia.org/T157831 [21:24:03] are you sure the code is there for the patch? [21:24:15] (03PS1) 10Milimetric: [WIP] DO NOT MERGE [puppet] - 10https://gerrit.wikimedia.org/r/337672 (https://phabricator.wikimedia.org/T125854) [21:24:27] AaronSchulz: it's not there, it's only there for wmf.12, this is for wmf.11 for some reason. [21:24:36] I cherry picked here: https://gerrit.wikimedia.org/r/#/c/337669/1 [21:24:51] (03PS3) 10Nuria: Changes to perf consumer of event logging events [puppet] - 10https://gerrit.wikimedia.org/r/337158 (https://phabricator.wikimedia.org/T156760) [21:25:41] PROBLEM - MediaWiki exceptions and fatals per minute on graphite2001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] [21:25:59] AaronSchulz: I don't know why a sync would make the error appear on wmf.11 unless something is being pushed out of the cache (which is possible). Anyway, if the cherry pick looks right could you +2 so I can get the errors back to normal? [21:27:30] thanks [21:27:38] * AaronSchulz thought that was merged already [21:28:11] anyway, CR added [21:28:14] yeah, it merged in master and is on the latest branch that I'm pushing out today, just not backported. [21:28:24] (03PS4) 10Nuria: Changes to perf consumer of event logging events [puppet] - 10https://gerrit.wikimedia.org/r/337158 (https://phabricator.wikimedia.org/T156760) [21:28:41] RECOVERY - MediaWiki exceptions and fatals per minute on graphite2001 is OK: OK: Less than 70.00% above the threshold [25.0] [21:29:30] oh, that makes sense then [21:30:33] (03PS5) 10Nuria: Changes to perf consumer of event logging events [puppet] - 10https://gerrit.wikimedia.org/r/337158 (https://phabricator.wikimedia.org/T156760) [21:30:52] it's weird that it hasn't been a problem in wmf.11 until I pushed out the latest changes. [21:31:41] PROBLEM - MediaWiki exceptions and fatals per minute on graphite2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [50.0] [21:33:21] (03CR) 10Rush: [C: 031] "small notes I don't think are critical but worth asking" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/334692 (https://phabricator.wikimedia.org/T149870) (owner: 10Madhuvishy) [21:33:30] (03PS9) 10Rush: nfs: Snapshot backup device on secondary DC before replicating latest from remote [puppet] - 10https://gerrit.wikimedia.org/r/334692 (https://phabricator.wikimedia.org/T149870) (owner: 10Madhuvishy) [21:39:02] (03CR) 10jerkins-bot: [V: 04-1] Changes to perf consumer of event logging events [puppet] - 10https://gerrit.wikimedia.org/r/337158 (https://phabricator.wikimedia.org/T156760) (owner: 10Nuria) [21:40:30] (03PS2) 10Ottomata: Symlink reportupdater output into published-datasets [puppet] - 10https://gerrit.wikimedia.org/r/337672 (https://phabricator.wikimedia.org/T125854) (owner: 10Milimetric) [21:42:17] (03PS1) 10Madhuvishy: labstore: Fix sudo priveleges for user diamond [puppet] - 10https://gerrit.wikimedia.org/r/337713 [21:43:47] thcipriani: could be het-deploy related, with different namespaced values going in the key [21:44:36] ideally, the key name would have v1/v2 or something in a case like that (otherwise they go back and forth). [21:44:48] (03CR) 10Milimetric: [C: 031] "this looks good to me. When we merge it we have to deploy all the dashboards, otherwise they'll start seeing stale data. And we have to " [puppet] - 10https://gerrit.wikimedia.org/r/337672 (https://phabricator.wikimedia.org/T125854) (owner: 10Milimetric) [21:47:52] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Install and configure new WDQS nodes on codfw - https://phabricator.wikimedia.org/T144380#3027311 (10Smalyshev) [21:47:55] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Move data storage to /srv/wdqs/ on codfw WDQS nodes - https://phabricator.wikimedia.org/T144536#3027310 (10Smalyshev) 05Open>03Resolved [21:47:56] !log thcipriani@tin Synchronized php-1.29.0-wmf.11/includes/libs/rdbms/loadbalancer/LoadBalancer.php: [[gerrit:337669|Type check the APC value in LoadBalancer::doWait()]] (duration: 00m 50s) [21:48:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:57] (03PS2) 10Ottomata: Add cloudera-trusty and cloudera-jessie reprepro updates and mirror them to a new cloudera component [puppet] - 10https://gerrit.wikimedia.org/r/337657 (https://phabricator.wikimedia.org/T155726) [21:50:54] (03PS5) 10Dzahn: lint: 'include base::firewall' -> 'include ::base::firewall' [puppet] - 10https://gerrit.wikimedia.org/r/337201 [21:53:34] (03CR) 10Madhuvishy: [C: 032] labstore: Fix sudo priveleges for user diamond [puppet] - 10https://gerrit.wikimedia.org/r/337713 (owner: 10Madhuvishy) [21:53:41] RECOVERY - MediaWiki exceptions and fatals per minute on graphite2001 is OK: OK: Less than 70.00% above the threshold [25.0] [21:56:14] (03PS1) 10Thcipriani: Group0 to 1.29.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337744 [21:58:12] (03PS6) 10EBernhardson: [WIP] Drop mediawiki logs in HDFS after 90 days [puppet] - 10https://gerrit.wikimedia.org/r/335140 [21:58:26] thcipriani, Krinkle: not urgent, but backporting https://gerrit.wikimedia.org/r/#/c/337730/ to 1.29.0-wmf.12 avoids the key competetion [21:58:48] (03CR) 10Thcipriani: [C: 032] Group0 to 1.29.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337744 (owner: 10Thcipriani) [21:59:07] though each already doing their instanceof should be enough to avoid errors by now [22:00:21] (03Merged) 10jenkins-bot: Group0 to 1.29.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337744 (owner: 10Thcipriani) [22:00:34] (03CR) 10jenkins-bot: Group0 to 1.29.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337744 (owner: 10Thcipriani) [22:00:52] (03PS3) 10Ottomata: Add cloudera-trusty and cloudera-jessie reprepro updates and mirror them to a new cloudera component [puppet] - 10https://gerrit.wikimedia.org/r/337657 (https://phabricator.wikimedia.org/T155726) [22:01:12] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 13Patch-For-Review: Move cloudera packages to a separate archive section - https://phabricator.wikimedia.org/T155726#3027384 (10Ottomata) TODO after CDH upgrade, remove old cloudera/thirdparty updates. [22:01:42] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 to 1.29.0-wmf.12 [22:01:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:48] (03CR) 10Ottomata: "I didn't create the thirdparty/cloudera component, as it looked alittle awkward to me to have a 'cloudera' directory at https://apt.wikime" [puppet] - 10https://gerrit.wikimedia.org/r/337657 (https://phabricator.wikimedia.org/T155726) (owner: 10Ottomata) [22:03:23] !log otto@tin Started deploy [analytics/refinery@4cd6305]: Deploying refinery with another update to drop hourly partitions script [22:03:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:04] !log otto@tin Finished deploy [analytics/refinery@4cd6305]: Deploying refinery with another update to drop hourly partitions script (duration: 01m 41s) [22:05:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:06:10] !log otto@tin Started deploy [analytics/refinery@4cd6305]: Deploying refinery with another update to drop hourly partitions script [22:06:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:23] !log otto@tin Finished deploy [analytics/refinery@4cd6305]: Deploying refinery with another update to drop hourly partitions script (duration: 01m 13s) [22:07:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:08:36] AaronSchulz: this error looks similar but it's currently happening less than the previous: https://phabricator.wikimedia.org/T158127 [22:09:07] 06Operations, 06DC-Ops: Lots of hosts with hyperthreading disabled - https://phabricator.wikimedia.org/T156140#3027444 (10Dzahn) "racadm jobqueue delete -i JID_CLEARALL_FORCE" was supposedly for deleting all jobs, but also doesn't work in this version. RAC992: Invalid job: JID_CLEARALL_FORCE. "racadm racrese... [22:10:19] (03PS1) 10Thcipriani: Revert "Group0 to 1.29.0-wmf.12" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337753 [22:10:26] (03PS6) 10Dzahn: lint: 'include base::firewall' -> 'include ::base::firewall' [puppet] - 10https://gerrit.wikimedia.org/r/337201 [22:11:17] (03CR) 10Thcipriani: [C: 032] Revert "Group0 to 1.29.0-wmf.12" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337753 (owner: 10Thcipriani) [22:11:48] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 to 1.29.0-wmf.11 for T158127 [22:11:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:11:53] T158127: Catchable fatal error: Object of class __PHP_Incomplete_Class could not be converted to string in /srv/mediawiki/php-1.29.0-wmf.12/includes/libs/rdbms/ChronologyProtector.php on line 124 - https://phabricator.wikimedia.org/T158127 [22:12:04] (03PS1) 10Madhuvishy: labstore: Remove misplaced init in DirectorySizeCollector [puppet] - 10https://gerrit.wikimedia.org/r/337754 [22:13:21] thcipriani: yeah, same kind of thing (old positions in cache for some users that editing recently) [22:13:30] (03Merged) 10jenkins-bot: Revert "Group0 to 1.29.0-wmf.12" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337753 (owner: 10Thcipriani) [22:13:43] (03CR) 10jenkins-bot: Revert "Group0 to 1.29.0-wmf.12" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337753 (owner: 10Thcipriani) [22:15:28] (03CR) 10Madhuvishy: [V: 032 C: 032] labstore: Remove misplaced init in DirectorySizeCollector [puppet] - 10https://gerrit.wikimedia.org/r/337754 (owner: 10Madhuvishy) [22:15:43] !log start staged nova-fullstack testing daemon on labnet1002 for metric inspection [22:15:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:28] (03PS7) 10Rush: openstack: nova_fullstack_test changes to daemonize [puppet] - 10https://gerrit.wikimedia.org/r/337598 [22:16:56] (03CR) 10Thcipriani: [C: 032] Cleanup popups beta cluster config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337636 (owner: 10Jdlrobson) [22:18:01] ugh, I'll improve that !empty() check [22:18:11] 06Operations, 06DC-Ops: Lots of hosts with hyperthreading disabled - https://phabricator.wikimedia.org/T156140#3027512 (10RobH) So the test host may have to have an onsite manually reset all its bios/drac settings and then set back up bios/drac. I'd think that would clear all pending jobs on the ilom interfac... [22:18:27] (03Merged) 10jenkins-bot: Cleanup popups beta cluster config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337636 (owner: 10Jdlrobson) [22:18:36] (03CR) 10jenkins-bot: Cleanup popups beta cluster config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337636 (owner: 10Jdlrobson) [22:19:19] (03PS8) 10Rush: openstack: nova_fullstack_test changes to daemonize [puppet] - 10https://gerrit.wikimedia.org/r/337598 [22:21:58] (03CR) 10Dzahn: [C: 032] "ok, i'll try it _one_ more time, but that's it heh :)" [puppet] - 10https://gerrit.wikimedia.org/r/337201 (owner: 10Dzahn) [22:22:14] (03PS7) 10Dzahn: lint: 'include base::firewall' -> 'include ::base::firewall' [puppet] - 10https://gerrit.wikimedia.org/r/337201 [22:23:10] Matiia: o/ [22:23:17] hi [22:23:40] thcipriani, Krinkle: https://gerrit.wikimedia.org/r/#/c/337755/1 [22:23:55] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings-labs.php: [[gerrit:337636|Cleanup popups beta cluster config]] (beta-only-change) (duration: 00m 41s) [22:23:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:41] o/ hey folks. Can anyone tell me if akosiaris is around these days? [22:25:27] halfak: no, he is not around these days [22:25:34] he is on vacation [22:25:54] Gotcha. I figured he still was. Just couldn't figure out when he's planning to be back [22:26:23] akosiaris would usually help me with things like https://phabricator.wikimedia.org/T157222 [22:27:03] I want to get an estimate together before the annual plan exercise next week. [22:27:28] halfak: eh.. expect first week of March [22:27:58] I wonder if I should just start a thread on the ops mailing list or there's someone explicitly filling in for Alex on his ORES-related obligations [22:29:12] halfak: that would be good, yes [22:29:54] OK will do. Thanks :) [22:30:13] yw [22:35:21] (03CR) 10BryanDavis: [C: 032] python: Install icu dev files [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/337634 (https://phabricator.wikimedia.org/T157744) (owner: 10Yuvipanda) [22:35:49] (03CR) 10Dzahn: [C: 032] "was already compiled on '*'" [puppet] - 10https://gerrit.wikimedia.org/r/337201 (owner: 10Dzahn) [22:37:05] (03Merged) 10jenkins-bot: python: Install icu dev files [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/337634 (https://phabricator.wikimedia.org/T157744) (owner: 10Yuvipanda) [22:41:59] 06Operations, 10ops-eqiad: hard-reset DRAC gadolinium.mgmt.eqiad.wmnet - https://phabricator.wikimedia.org/T158131#3027571 (10Dzahn) [22:42:44] 06Operations, 10ops-eqiad: hard-reset DRAC gadolinium.mgmt.eqiad.wmnet - https://phabricator.wikimedia.org/T158131#3027586 (10Dzahn) p:05Triage>03Normal normal prio, or even low. it's a spare server. i linked this ticket in the google sheet. [22:44:27] 06Operations, 06DC-Ops: Lots of hosts with hyperthreading disabled - https://phabricator.wikimedia.org/T156140#3027591 (10Dzahn) @Robh Ok, yep. Thanks. Created T158131 and linked that in the "spares" sheet. [22:48:39] (03CR) 10Dzahn: [C: 032] "this has also been compiled on everything http://puppet-compiler.wmflabs.org/5441/" [puppet] - 10https://gerrit.wikimedia.org/r/337202 (owner: 10Dzahn) [22:50:13] (03CR) 10Dzahn: "of course there is a rebase conflict now. incredibly hard to merge since we are moving fast and compiler runs for hours... sigh" [puppet] - 10https://gerrit.wikimedia.org/r/337202 (owner: 10Dzahn) [22:50:56] (03PS9) 10Rush: openstack: nova_fullstack_test changes to daemonize [puppet] - 10https://gerrit.wikimedia.org/r/337598 [22:52:53] (03Abandoned) 10Dzahn: lint: 'include standard' -> 'include ::standard' [puppet] - 10https://gerrit.wikimedia.org/r/337202 (owner: 10Dzahn) [22:57:37] (03PS2) 10Dzahn: aptrepo: rsync the entire /srv/ automatically, not just /srv/wikimedia/ [puppet] - 10https://gerrit.wikimedia.org/r/337498 [22:58:16] (03PS3) 10Dzahn: aptrepo: rsync the entire /srv/ automatically, not just /srv/wikimedia/ [puppet] - 10https://gerrit.wikimedia.org/r/337498 [23:00:24] (03CR) 10Andrew Bogott: openstack: nova_fullstack_test changes to daemonize (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/337598 (owner: 10Rush) [23:02:14] (03PS6) 10Andrew Bogott: quarry: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334309 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [23:06:12] (03CR) 10Andrew Bogott: [C: 032] quarry: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/334309 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [23:06:41] (03CR) 10Dzahn: [C: 032] aptrepo: rsync the entire /srv/ automatically, not just /srv/wikimedia/ [puppet] - 10https://gerrit.wikimedia.org/r/337498 (owner: 10Dzahn) [23:06:52] (03PS4) 10Dzahn: aptrepo: rsync the entire /srv/ automatically, not just /srv/wikimedia/ [puppet] - 10https://gerrit.wikimedia.org/r/337498 [23:17:41] PROBLEM - puppet last run on mw1298 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:20:04] (03Restored) 10Dzahn: lint: 'include standard' -> 'include ::standard' [puppet] - 10https://gerrit.wikimedia.org/r/337202 (owner: 10Dzahn) [23:21:35] oh hey before i forget... [23:21:57] i've got a small puppet change for tweaking the videoscaler job queue runner counts [23:22:23] i had queued it in today's puppet swat but that didn't happen due to a flurry of errors at the time. should i poke someone again about it or will it get gotten back to later? [23:35:34] (03PS2) 10Dzahn: lint: 'include standard' -> 'include ::standard' [puppet] - 10https://gerrit.wikimedia.org/r/337202 [23:46:41] RECOVERY - puppet last run on mw1298 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [23:50:22] (03CR) 10Dzahn: [C: 032] "hashar: thanks for the extra check, rebased and was compiled on all" [puppet] - 10https://gerrit.wikimedia.org/r/337202 (owner: 10Dzahn)