[00:02:12] (03PS1) 10Ayounsi: Reserve IPs for eqdfw GTT transport vlans [dns] - 10https://gerrit.wikimedia.org/r/443010 [00:02:57] (03CR) 10Ayounsi: [C: 032] Reserve IPs for eqdfw GTT transport vlans [dns] - 10https://gerrit.wikimedia.org/r/443010 (owner: 10Ayounsi) [00:21:39] RECOVERY - Check systemd state on labcontrol1004 is OK: OK - running: The system is fully operational [00:24:59] PROBLEM - Check systemd state on labcontrol1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:31:53] hmm [00:32:08] andrewbogott chasemp ^^ [00:32:13] I think that's ok for now paladox. labcontrol1004 is not active [00:32:33] that's one of the 2 new servers we are in the process of bringing online [00:32:56] ah ok [00:32:58] I'll see if I can silence it more than it is already [01:09:24] (03CR) 10Krinkle: [C: 032] Use perftools/xhgui-collector instead of perftools/xhgui [mediawiki-config] - 10https://gerrit.wikimedia.org/r/432016 (owner: 10Krinkle) [01:09:42] RoanKattouw: I assume the deploy is finished? [01:10:48] (03Merged) 10jenkins-bot: Use perftools/xhgui-collector instead of perftools/xhgui [mediawiki-config] - 10https://gerrit.wikimedia.org/r/432016 (owner: 10Krinkle) [01:11:03] (03CR) 10jenkins-bot: Use perftools/xhgui-collector instead of perftools/xhgui [mediawiki-config] - 10https://gerrit.wikimedia.org/r/432016 (owner: 10Krinkle) [01:11:23] Yes [01:11:29] Oh crap I forgot to run that full scap for Niharika [01:11:40] CI took so damn long that I forgot about it [01:12:22] k, well, just give me a minute in that case. [01:14:07] * Krinkle staging on deploy1001 and mwdebug1002 [01:17:04] (03PS1) 10Krinkle: profiler: Put 'global' keyword back [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443016 [01:17:17] (03CR) 10Krinkle: [C: 032] profiler: Put 'global' keyword back [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443016 (owner: 10Krinkle) [01:18:32] (03Merged) 10jenkins-bot: profiler: Put 'global' keyword back [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443016 (owner: 10Krinkle) [01:19:22] !log krinkle@deploy1001 Synchronized vendor/: I39e592d837 - remove redundant composer packages for xhgui profiling (duration: 00m 52s) [01:19:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:19:43] RoanKattouw: OK.. You can do your scap now :) [01:19:49] OK will do [01:20:21] !log catrope@deploy1001 Started scap: Full scap to fix i18n issues on testwiki [01:20:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:31:02] (03CR) 10Krinkle: "Checked-picked to beta's puppetmaster since 2-3 weeks. Fixed beta's deployment host." [puppet] - 10https://gerrit.wikimedia.org/r/436601 (https://phabricator.wikimedia.org/T195314) (owner: 10Krinkle) [01:53:04] !log catrope@deploy1001 Finished scap: Full scap to fix i18n issues on testwiki (duration: 32m 42s) [01:53:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:18:41] (03CR) 10jenkins-bot: profiler: Put 'global' keyword back [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443016 (owner: 10Krinkle) [02:21:28] RECOVERY - Check systemd state on labcontrol1004 is OK: OK - running: The system is fully operational [03:27:59] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 606.74 seconds [03:31:19] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 252.37 seconds [04:03:32] Thanks RoanKattouw! [04:41:35] (03CR) 10Smalyshev: Enable smater logging for wdqs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/443001 (https://phabricator.wikimedia.org/T197645) (owner: 10Smalyshev) [04:48:55] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool es1015" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442891 [04:50:15] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool es1015" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442891 (owner: 10Marostegui) [04:51:27] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool es1015" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442891 (owner: 10Marostegui) [04:51:44] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool es1015" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442891 (owner: 10Marostegui) [04:51:59] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1101:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443021 [04:52:03] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1101:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443021 [04:52:36] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool es1015 (duration: 00m 52s) [04:52:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:53:18] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1101:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443021 (owner: 10Marostegui) [04:54:29] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1101:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443021 (owner: 10Marostegui) [04:55:34] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1101:3318 after alter table (duration: 00m 50s) [04:55:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:55:56] (03PS1) 10Marostegui: db-eqiad.php: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443022 (https://phabricator.wikimedia.org/T191316) [04:57:15] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1101:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443021 (owner: 10Marostegui) [04:59:25] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443022 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [05:00:04] !log /script unload irssinotifier [05:00:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:00:15] gah! [05:00:17] :) [05:00:33] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443022 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [05:01:43] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1104 for alter table (duration: 00m 50s) [05:01:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:02:02] !log Deploy schema change on db1104 T191316 T192926 T89737 T195193 [05:02:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:02:08] T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737 [05:02:09] T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926 [05:02:09] T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193 [05:02:09] T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316 [05:04:01] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443022 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [05:04:09] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: Degraded RAID on dbstore1002 - https://phabricator.wikimedia.org/T197707 (10Marostegui) Still on-going ``` root@dbstore1002:~# megacli -PDRbld -ShowProg -PhysDrv [32:5] -aALL Rebuild Progress on Device at Enclosure 32, Slot 5 Completed 44% in 1019 Min... [05:10:22] 10Operations, 10monitoring: dbstore1001 backup jobs failed between 2016-10-19 and 2016-11-23 - https://phabricator.wikimedia.org/T151579 (10Marostegui) [05:11:23] 10Operations, 10DBA, 10monitoring, 10Patch-For-Review: Create script to monitor db dumps for backups are successful - https://phabricator.wikimedia.org/T151999 (10Marostegui) 05Open>03Resolved a:03jcrespo I am going to close this as resolved as per T151999#4038297 and going to create a task for the "... [05:40:14] !log force umount of dumps labstore nfs mountpoints on stat100[56]/notebook100[34] to reduce load (also too many open files) [05:40:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:49:32] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Datasets-General-or-Unknown: rack upgraded storage capacity in labstore100[67].eqiad.wmnet - https://phabricator.wikimedia.org/T196651 (10elukey) As a follow up, I had to force umount the labstore1006's nfs mountpoint on stat100[5,6] and notebook100[3,4] since the... [06:05:56] (03PS2) 10Jcrespo: mariadb: Prepare for reimage of es1016 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/442795 [06:07:12] (03CR) 10Jcrespo: [C: 032] mariadb: Prepare for reimage of es1016 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/442795 (owner: 10Jcrespo) [06:21:36] !log stop and reimage es1016 [06:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:28] PROBLEM - puppet last run on conf2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:00:44] RECOVERY - puppet last run on conf2003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:05:48] (03PS1) 10Jcrespo: mariadb: Reenable notifications on es1016 after reimage [puppet] - 10https://gerrit.wikimedia.org/r/443028 [07:07:25] PROBLEM - Nginx local proxy to apache on mw2289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:08:15] RECOVERY - Nginx local proxy to apache on mw2289 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.189 second response time [07:09:05] (03PS1) 10ArielGlenn: add labstore106 back to hosts that get rsync from stats servers [puppet] - 10https://gerrit.wikimedia.org/r/443029 [07:11:14] (03PS2) 10ArielGlenn: add labstore1006 back to hosts that get rsync from stats servers [puppet] - 10https://gerrit.wikimedia.org/r/443029 [07:11:46] (03PS2) 10Muehlenhoff: Move php5 packages to contint class [puppet] - 10https://gerrit.wikimedia.org/r/438164 [07:13:24] (03CR) 10ArielGlenn: [C: 032] add labstore1006 back to hosts that get rsync from stats servers [puppet] - 10https://gerrit.wikimedia.org/r/443029 (owner: 10ArielGlenn) [07:13:34] (03CR) 10Jcrespo: [C: 032] mariadb: Reenable notifications on es1016 after reimage [puppet] - 10https://gerrit.wikimedia.org/r/443028 (owner: 10Jcrespo) [07:13:43] (03PS2) 10Jcrespo: mariadb: Reenable notifications on es1016 after reimage [puppet] - 10https://gerrit.wikimedia.org/r/443028 [07:17:42] (03CR) 10Muehlenhoff: [C: 031] "Looks good!" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/442876 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [07:41:12] (03PS1) 10Marostegui: install_server: Allow reimage db2054 [puppet] - 10https://gerrit.wikimedia.org/r/443032 [07:41:38] (03PS1) 10Jcrespo: Revert "mariadb: Depool es1016 for reimage to stretch" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443033 [07:43:18] (03PS2) 10Jcrespo: Revert "mariadb: Depool es1016 for reimage to stretch" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443033 [07:43:22] (03PS2) 10Marostegui: install_server: Allow reimage db2054 [puppet] - 10https://gerrit.wikimedia.org/r/443032 [07:45:06] (03CR) 10Marostegui: [C: 032] install_server: Allow reimage db2054 [puppet] - 10https://gerrit.wikimedia.org/r/443032 (owner: 10Marostegui) [07:46:42] (03PS1) 10Marostegui: db-codfw.php: Depool db2054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443035 [07:47:37] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool es1016 for reimage to stretch" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443033 (owner: 10Jcrespo) [07:48:55] (03Merged) 10jenkins-bot: Revert "mariadb: Depool es1016 for reimage to stretch" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443033 (owner: 10Jcrespo) [07:49:49] (03CR) 10jenkins-bot: Revert "mariadb: Depool es1016 for reimage to stretch" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443033 (owner: 10Jcrespo) [07:50:39] (03PS1) 10Muehlenhoff: Enable microcode updates for all elasticsearch hosts [puppet] - 10https://gerrit.wikimedia.org/r/443038 (https://phabricator.wikimedia.org/T127825) [07:51:09] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443035 (owner: 10Marostegui) [07:52:31] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443035 (owner: 10Marostegui) [07:53:46] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2054 for reimage (duration: 00m 50s) [07:53:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:50] !log Stop MySQL on db2054 for reimage [07:53:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:11] (03CR) 10jenkins-bot: db-codfw.php: Depool db2054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443035 (owner: 10Marostegui) [07:56:14] 10Operations, 10netops: Allow labnet/labnodepool/labvirt to connect to debmonitor hosts/443 - https://phabricator.wikimedia.org/T198375 (10MoritzMuehlenhoff) Thanks, confirmed working fine. All missing hosts were able to ingest their package data and servermon and debdeploy are now tracking the same number of... [07:57:23] marostegui: are you deploying? [07:57:30] I just did [07:57:49] not a problem, but you pulled also my change [07:58:06] only mentioning it because on other case it could be a problem [07:58:30] yeah, I saw your change, but I thought you already deployed it [07:58:38] not yet [07:58:52] I was waiting for the +2 [07:59:37] but it arrived at 09:48 [07:59:39] no? [08:00:40] I am just saying check how many changes you pull, sometimes you find random things [08:00:49] even non-db stuff [08:01:02] yeah [08:01:27] so what did you deploy, only codfw? [08:01:35] yes [08:01:46] then this is strange [08:01:52] because I saw my change deployed [08:01:58] which is only eqiad [08:02:09] 07:53:46 Synchronized wmf-config/db-codfw.php: Depool db2054 for reimage (duration: 00m 50s) [08:02:12] oh, it wasn't [08:02:23] I was looking at the wrong place, I will do it now [08:03:48] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool es1016 (duration: 00m 50s) [08:03:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:48] 10Operations, 10Analytics, 10Traffic: Size of headers processed by varnish? - https://phabricator.wikimedia.org/T198152 (10ema) p:05Triage>03Normal [08:23:43] (03PS1) 10Vgutierrez: varnishkafka: Set TLS curve list and sigalgs list defaults [puppet] - 10https://gerrit.wikimedia.org/r/443043 (https://phabricator.wikimedia.org/T182993) [08:24:17] (03CR) 10Elukey: [C: 031] varnishkafka: Set TLS curve list and sigalgs list defaults [puppet] - 10https://gerrit.wikimedia.org/r/443043 (https://phabricator.wikimedia.org/T182993) (owner: 10Vgutierrez) [08:27:47] !log upgrade php on phab1001 to 5.6.33+dfsg-0+deb8u1 [08:27:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:30] moritzm: fyi ^ [08:35:49] (03CR) 10Vgutierrez: [C: 032] "pcc looks happy and shows noop in upload and misc and the expected changes in text: https://puppet-compiler.wmflabs.org/compiler02/11613/" [puppet] - 10https://gerrit.wikimedia.org/r/443043 (https://phabricator.wikimedia.org/T182993) (owner: 10Vgutierrez) [08:39:23] !log Apply new TLS varnishkafka settings in cache::text nodes - T182993 [08:39:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:26] T182993: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993 [08:39:59] (03PS1) 10Alexandros Kosiaris: phabricator: Use the mysql native driver [puppet] - 10https://gerrit.wikimedia.org/r/443045 [08:46:24] (03CR) 10Gehel: [C: 031] "LGTM, no problem identified on the test hosts." [puppet] - 10https://gerrit.wikimedia.org/r/443038 (https://phabricator.wikimedia.org/T127825) (owner: 10Muehlenhoff) [08:49:45] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: Reach out to Google about @yahoo.com emails not reaching gmail inboxes (when sent to mailing lists) - https://phabricator.wikimedia.org/T146841 (10Aklapper) >>! In T146841#3729163, @Dzahn wrote: > @Seb35 @Peachey88 @Herron since T168467 is resolved meanwhile,... [08:50:00] (03PS1) 10Marostegui: mariadb: Enable notifications on db2054 [puppet] - 10https://gerrit.wikimedia.org/r/443046 [08:51:08] 10Operations, 10Analytics, 10Traffic: Size of headers processed by varnish? - https://phabricator.wikimedia.org/T198152 (10ema) Both [[ https://varnish-cache.org/docs/5.1/reference/varnishd.html#http-req-hdr-len | varnish ]] and [[http://nginx.org/en/docs/http/ngx_http_core_module.html#large_client_header_bu... [08:52:59] (03CR) 10Marostegui: [C: 032] mariadb: Enable notifications on db2054 [puppet] - 10https://gerrit.wikimedia.org/r/443046 (owner: 10Marostegui) [08:57:40] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2054" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443048 [09:02:02] (03PS1) 10Muehlenhoff: Add component/vp9 [puppet] - 10https://gerrit.wikimedia.org/r/443052 (https://phabricator.wikimedia.org/T190333) [09:05:16] (03PS2) 10Muehlenhoff: Add component/vp9 [puppet] - 10https://gerrit.wikimedia.org/r/443052 (https://phabricator.wikimedia.org/T190333) [09:05:18] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2054" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443048 (owner: 10Marostegui) [09:06:29] (03CR) 10Muehlenhoff: [C: 032] Add component/vp9 [puppet] - 10https://gerrit.wikimedia.org/r/443052 (https://phabricator.wikimedia.org/T190333) (owner: 10Muehlenhoff) [09:06:40] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2054" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443048 (owner: 10Marostegui) [09:08:33] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2054 after reimage (duration: 00m 51s) [09:08:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:35] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2054" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443048 (owner: 10Marostegui) [09:16:56] (03PS1) 10Marostegui: install_server: Allow reimage db2058 [puppet] - 10https://gerrit.wikimedia.org/r/443054 [09:18:34] !log uploaded libvpx 1.7.0-3+wmf1 to apt.wikimedia.org/stretch-wikimedia (component/vp9) (T190333) [09:18:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:38] T190333: Backport libvpx 1.7.0, ffmpeg packages for VP9 -row-mt option - https://phabricator.wikimedia.org/T190333 [09:18:41] (03PS1) 10Marostegui: db-codfw.php: Depool db2058 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443055 [09:18:51] (03PS2) 10Marostegui: install_server: Allow reimage db2058 [puppet] - 10https://gerrit.wikimedia.org/r/443054 [09:19:41] (03CR) 10Marostegui: [C: 032] install_server: Allow reimage db2058 [puppet] - 10https://gerrit.wikimedia.org/r/443054 (owner: 10Marostegui) [09:20:38] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2058 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443055 (owner: 10Marostegui) [09:22:07] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2058 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443055 (owner: 10Marostegui) [09:23:14] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2058 for reimage (duration: 00m 50s) [09:23:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:17] !log Stop MySQL on db2058 to reimage it [09:23:18] (03CR) 10Gehel: Enable smater logging for wdqs (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/443001 (https://phabricator.wikimedia.org/T197645) (owner: 10Smalyshev) [09:23:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:50] (03CR) 10Alexandros Kosiaris: [C: 031] Improve validation on host package updates [software/debmonitor] - 10https://gerrit.wikimedia.org/r/442876 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [09:25:45] (03CR) 10jenkins-bot: db-codfw.php: Depool db2058 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443055 (owner: 10Marostegui) [09:26:30] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Datasets-General-or-Unknown: rack upgraded storage capacity in labstore100[67].eqiad.wmnet - https://phabricator.wikimedia.org/T196651 (10ArielGlenn) Fsck on labstore1006 completed last night, it did not take long. labstore1006 after 'repair' of logical dri... [09:26:47] (03CR) 10Alexandros Kosiaris: [C: 04-1] manage.py: add custom command for GC (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/442901 (owner: 10Volans) [09:35:25] (03PS2) 10Volans: Add custom Django management command [software/debmonitor] - 10https://gerrit.wikimedia.org/r/442901 [09:35:41] (03PS3) 10Gehel: Enable smater logging for wdqs [puppet] - 10https://gerrit.wikimedia.org/r/443001 (https://phabricator.wikimedia.org/T197645) (owner: 10Smalyshev) [09:36:09] (03CR) 10jerkins-bot: [V: 04-1] Enable smater logging for wdqs [puppet] - 10https://gerrit.wikimedia.org/r/443001 (https://phabricator.wikimedia.org/T197645) (owner: 10Smalyshev) [09:36:15] (03CR) 10Gehel: Enable smater logging for wdqs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/443001 (https://phabricator.wikimedia.org/T197645) (owner: 10Smalyshev) [09:36:25] (03CR) 10jerkins-bot: [V: 04-1] Add custom Django management command [software/debmonitor] - 10https://gerrit.wikimedia.org/r/442901 (owner: 10Volans) [09:36:37] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 4 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 (10mobrovac) [09:39:26] (03CR) 10Volans: "Addressed comment, thanks for the review!" (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/442901 (owner: 10Volans) [09:41:09] PROBLEM - Disk space on labstore1006 is CRITICAL: DISK CRITICAL - free space: / 33468 MB (3% inode=99%) [09:42:44] (03PS4) 10Gehel: Enable smater logging for wdqs [puppet] - 10https://gerrit.wikimedia.org/r/443001 (https://phabricator.wikimedia.org/T197645) (owner: 10Smalyshev) [09:43:25] (03CR) 10jerkins-bot: [V: 04-1] Enable smater logging for wdqs [puppet] - 10https://gerrit.wikimedia.org/r/443001 (https://phabricator.wikimedia.org/T197645) (owner: 10Smalyshev) [09:44:46] (03CR) 10Alexandros Kosiaris: [C: 031] Add custom Django management command [software/debmonitor] - 10https://gerrit.wikimedia.org/r/442901 (owner: 10Volans) [09:45:34] 10Operations, 10Discovery, 10Discovery-Search: migrate elasticsearch cirrus cluster to RAID0 - https://phabricator.wikimedia.org/T198391 (10Gehel) About RAID10 / RAID5(0), in both cases it would require adding more disk to those servers, which is something we are trying to avoid. The JBOD approach could work... [09:46:41] (03CR) 10Gehel: "puppet compiler looks happy: https://puppet-compiler.wmflabs.org/compiler02/11616/" [puppet] - 10https://gerrit.wikimedia.org/r/443001 (https://phabricator.wikimedia.org/T197645) (owner: 10Smalyshev) [09:48:14] (03PS5) 10Gehel: Enable smater logging for wdqs [puppet] - 10https://gerrit.wikimedia.org/r/443001 (https://phabricator.wikimedia.org/T197645) (owner: 10Smalyshev) [09:56:38] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2058" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443061 [09:56:48] (03PS1) 10Marostegui: Revert "install_server: Allow reimage db2058" [puppet] - 10https://gerrit.wikimedia.org/r/443062 [10:00:27] (03CR) 10Marostegui: [C: 032] Revert "install_server: Allow reimage db2058" [puppet] - 10https://gerrit.wikimedia.org/r/443062 (owner: 10Marostegui) [10:10:47] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1104" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443065 [10:10:53] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1104" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443065 [10:12:08] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1104" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443065 (owner: 10Marostegui) [10:13:27] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1104" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443065 (owner: 10Marostegui) [10:13:44] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1104" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443065 (owner: 10Marostegui) [10:14:36] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1104 after alter table (duration: 00m 50s) [10:14:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:00] (03PS2) 10Muehlenhoff: Enable microcode updates for all elasticsearch hosts [puppet] - 10https://gerrit.wikimedia.org/r/443038 (https://phabricator.wikimedia.org/T127825) [10:15:17] (03CR) 10Gehel: [C: 031] "LGTM - I'd love to have Giuseppe input on this, but this is already enough of an improvement that we could go forward as-is." [puppet] - 10https://gerrit.wikimedia.org/r/440498 (owner: 10EBernhardson) [10:15:21] (03PS1) 10Marostegui: db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443067 (https://phabricator.wikimedia.org/T191316) [10:15:50] (03CR) 10Muehlenhoff: [C: 032] Enable microcode updates for all elasticsearch hosts [puppet] - 10https://gerrit.wikimedia.org/r/443038 (https://phabricator.wikimedia.org/T127825) (owner: 10Muehlenhoff) [10:17:04] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443067 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [10:18:06] !log mobrovac@deploy1001 Started deploy [citoid/deploy@40cdff7]: Update citoid to fd77117 - T165105 T197853 [10:18:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:10] T165105: Some requests for DOIs are failing or very slow; if we have a DOI and the request is taking too long, just use CrossRef data instead. - https://phabricator.wikimedia.org/T165105 [10:18:11] T197853: Trailing single quote in DOIs breaks citoid - https://phabricator.wikimedia.org/T197853 [10:18:25] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443067 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [10:19:39] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1092 for alter table (duration: 00m 50s) [10:19:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:01] !log Deploy schema change on db1092 T191316 T192926 T89737 T195193 [10:20:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:07] T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737 [10:20:08] T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926 [10:20:08] T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193 [10:20:08] T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316 [10:21:07] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443067 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [10:21:32] !log mobrovac@deploy1001 Finished deploy [citoid/deploy@40cdff7]: Update citoid to fd77117 - T165105 T197853 (duration: 03m 26s) [10:21:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:36] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2058" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443061 (owner: 10Marostegui) [10:33:52] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2058" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443061 (owner: 10Marostegui) [10:34:56] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2058 after reimage (duration: 00m 51s) [10:34:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:19] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2058" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443061 (owner: 10Marostegui) [11:01:56] !log installing lame security updates [11:01:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:21] 10Operations, 10Operations-Software-Development, 10Phabricator, 10Technical-Debt: Update Puppet repo code that uses maniphest.update and maniphest.createtask conduit api - https://phabricator.wikimedia.org/T159045 (10Aklapper) [11:05:09] (03PS1) 10Muehlenhoff: Add library hint for lame [puppet] - 10https://gerrit.wikimedia.org/r/443076 [11:05:58] (03CR) 10Muehlenhoff: [C: 032] Add library hint for lame [puppet] - 10https://gerrit.wikimedia.org/r/443076 (owner: 10Muehlenhoff) [11:11:11] !log installing zsh updates from jessie 8.11 point release [11:11:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:25] (03CR) 10Ladsgroup: "This can be deployed now. Probably on Monday I guess but it's in prod now." [puppet] - 10https://gerrit.wikimedia.org/r/440986 (https://phabricator.wikimedia.org/T147169) (owner: 10Ladsgroup) [11:21:53] (03PS1) 10Muehlenhoff: Fix Cumin alias for kafka/analytics [puppet] - 10https://gerrit.wikimedia.org/r/443077 [11:44:03] (03PS1) 10Muehlenhoff: Add library hint for xerces-c [puppet] - 10https://gerrit.wikimedia.org/r/443078 [11:44:35] (03CR) 10jerkins-bot: [V: 04-1] Add library hint for xerces-c [puppet] - 10https://gerrit.wikimedia.org/r/443078 (owner: 10Muehlenhoff) [11:49:57] PROBLEM - Check systemd state on labstore1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:03:27] RECOVERY - Check systemd state on labstore1006 is OK: OK - running: The system is fully operational [12:30:55] !log uploaded ffmpeg 3.2.10-1~deb9u1+wmf1 to apt.wikimedia.org/stretch-wikimedia (component/vp9) (linked against libvpx 1.7 and with backported row-mt support) (T190333) [12:30:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:58] T190333: Backport libvpx 1.7.0, ffmpeg packages for VP9 -row-mt option - https://phabricator.wikimedia.org/T190333 [12:50:06] PROBLEM - Check systemd state on labstore1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:54:56] (03CR) 10Paladox: "We are planning to reinstall with stretch (by switching to phab1002 and then back to phab1001)" [puppet] - 10https://gerrit.wikimedia.org/r/443045 (owner: 10Alexandros Kosiaris) [12:55:05] (03CR) 10Paladox: [C: 031] phabricator: Use the mysql native driver [puppet] - 10https://gerrit.wikimedia.org/r/443045 (owner: 10Alexandros Kosiaris) [12:59:51] (03PS2) 10Elukey: Fix Cumin alias for kafka/analytics [puppet] - 10https://gerrit.wikimedia.org/r/443077 (owner: 10Muehlenhoff) [13:00:42] (03CR) 10Elukey: [C: 032] Fix Cumin alias for kafka/analytics [puppet] - 10https://gerrit.wikimedia.org/r/443077 (owner: 10Muehlenhoff) [13:01:36] 10Operations, 10Analytics, 10Analytics-Kanban, 10User-Elukey: Tune Varnishkafka delivery errors to be more sensitive - https://phabricator.wikimedia.org/T173492 (10elukey) [13:03:09] (03PS1) 10Elukey: icing::monitor::analytics: move per host vk alarms to aggregates [puppet] - 10https://gerrit.wikimedia.org/r/443086 (https://phabricator.wikimedia.org/T173492) [13:03:13] ^ apergos do we know why this is flapping? [13:03:33] no [13:03:52] ● systemd-timedated.service loaded failed failed Time & Date Service [13:04:05] sorry, I was / am adding to the incident report [13:04:07] from yesterday [13:12:15] /dev/dm-0 916G 916G 0 100% / [13:12:17] chasemp: [13:12:46] interesting [13:15:56] RECOVERY - Disk space on labstore1006 is OK: DISK OK [13:15:58] something wrote to /srv/dumps while the filesystem was unmounted [13:16:35] I unmounted, cleared out the crap, remounted just now [13:17:57] RECOVERY - Check systemd state on labstore1006 is OK: OK - running: The system is fully operational [13:19:37] and that's done [13:21:24] (03CR) 10Ottomata: "+1 to the general idea, but I don't think this should be in the 'analytics' profile. With the exception of maybe eventlogging, none of th" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/443086 (https://phabricator.wikimedia.org/T173492) (owner: 10Elukey) [13:22:27] (03CR) 10Ottomata: "Q Tho: Won't this mean that if a single vk host starts dropping messages, we won't get alerts?" [puppet] - 10https://gerrit.wikimedia.org/r/443086 (https://phabricator.wikimedia.org/T173492) (owner: 10Elukey) [13:31:10] (03CR) 10Elukey: "> +1 to the general idea, but I don't think this should be in the" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/443086 (https://phabricator.wikimedia.org/T173492) (owner: 10Elukey) [13:39:37] (03PS2) 10Elukey: profile::cache::kafka::alerts: move per host vk alarms to aggregates [puppet] - 10https://gerrit.wikimedia.org/r/443086 (https://phabricator.wikimedia.org/T173492) [13:52:10] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: labvirt1009 HP Raid alert - https://phabricator.wikimedia.org/T198479 (10Andrew) [13:52:18] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: labvirt1009 HP Raid alert - https://phabricator.wikimedia.org/T198479 (10Andrew) p:05Triage>03High [14:12:04] 04Critical Alert for device asw-a-eqiad.mgmt.eqiad.wmnet - Critical syslog messages [14:13:04] 04Critical Alert for device asw-b-eqiad.mgmt.eqiad.wmnet - Critical syslog messages [14:13:13] XioNoX: ^ [14:13:25] thx, looking [14:13:47] " #1: msg => fatal: no kex alg; " [14:13:56] 10Operations, 10Security-Team, 10Traffic, 10Wikimedia-General-or-Unknown: Add restrictive CSP to upload.wikimedia.org - https://phabricator.wikimedia.org/T117618 (10Aklapper) [14:15:14] it is not mgmt only [14:15:36] chasemp: did you try to ssh to asw-a/b using weird ssh settings? [14:15:54] oh ignore me, unrelated bgp issue [14:16:13] jynus: bgp issue? [14:16:20] XioNoX: yes, looking to cleanup the cloud-hosts-d vlan to coopt that 1120 vlan-id as discussed [14:16:33] not bgp issue [14:16:49] that is how it is named on icinga [14:16:56] ignore me [14:17:03] 04̶C̶r̶i̶t̶i̶c̶a̶l Device asw-a-eqiad.mgmt.eqiad.wmnet recovered from Critical syslog messages [14:17:14] chasemp: ok, wanted to understand that "critical" syslog message [14:17:20] asw-a can't handle my diffie-hellman-group-exchange-sha256 bad self [14:17:55] XioNoX: ack, I had pinned settings for asw2* but same negotiation won't work on old devices, sorry for the noise man [14:17:58] no idea why juniper classifies it as "critical" anyway [14:18:04] that is...weird [14:18:11] hehe [14:19:11] probably views it as probing [14:22:03] 04̶C̶r̶i̶t̶i̶c̶a̶l Device asw-b-eqiad.mgmt.eqiad.wmnet recovered from Critical syslog messages [14:25:15] 10Operations, 10TimedMediaHandler-Transcode, 10Patch-For-Review: Backport libvpx 1.7.0, ffmpeg packages for VP9 -row-mt option - https://phabricator.wikimedia.org/T190333 (10MoritzMuehlenhoff) I've backported the row-mt patch to ffmpeg 3.2. This allows us to stick with the ffmpeg security updates for Debian... [14:27:51] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1092" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443092 [14:29:01] 10Operations, 10Kubernetes, 10Security-Other: Network segmentation for WMF servers - https://phabricator.wikimedia.org/T101912 (10Aklapper) [14:40:40] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1092" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443092 (owner: 10Marostegui) [14:41:59] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1092" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443092 (owner: 10Marostegui) [14:42:16] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1092" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443092 (owner: 10Marostegui) [14:43:04] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1092 after alter table (duration: 00m 50s) [14:43:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:18] (03CR) 10Ottomata: [C: 031] "profile::cache::kafka::alerts, I like :)" [puppet] - 10https://gerrit.wikimedia.org/r/443086 (https://phabricator.wikimedia.org/T173492) (owner: 10Elukey) [14:46:03] (03Abandoned) 10Gehel: elasticsearch: raise alerting limit for free disk space [puppet] - 10https://gerrit.wikimedia.org/r/430066 (https://phabricator.wikimedia.org/T192972) (owner: 10Gehel) [14:54:07] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet operation_type={create_container,image_status,podsandbox_status,remove_container,start_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [14:55:17] RECOVERY - kubelet operational latencies on kubernetes1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:20:34] (03PS1) 10Papaul: DHCP: Add MAC address entries for graphite2003 [puppet] - 10https://gerrit.wikimedia.org/r/443100 (https://phabricator.wikimedia.org/T196483) [15:20:39] 10Operations, 10Pywikibot-General, 10Pywikibot-core, 10Wikimedia-Mailing-lists: recent e-mails missing from pywikibot archive (due to wrong file system permissions) - https://phabricator.wikimedia.org/T107769 (10Aklapper) [15:23:05] 04Critical Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80% [15:23:34] XioNoX: --^ ? [15:23:51] thx [15:24:49] come on, fridays are supposed to be quiet [15:25:09] Hello, traffic spike [15:25:10] https://librenms.wikimedia.org/graphs/to=1530285600/id=11600/type=port_bits/from=1530199200/ [15:26:26] well, our ashburn IXP port is saturating [15:26:42] for traffic originate from us, or towards us [15:26:43] ? [15:26:57] outbound traffic [15:27:09] tiny amount of inbound traffic, huge outbound [15:28:01] bblack, vgutierrez, ema, might need some help figuring out what's going on [15:28:28] I can see the change in traffic, but not in number of requests [15:28:33] in ~5/10min the equinix IXP dashboard should tell us toward whom that traffic is [15:31:07] there is no drop from any other providers, so it's not just traffic re-routing there [15:31:08] I think it is upload [15:31:30] I also see commons slightly high load [15:31:39] usually upload is somehow related, whenever it's more about bytes than reqs [15:32:02] I know there were also some changes related to Commons yesterday that were rolled out and then reverted, and I didn't follow closely [15:32:09] there were related complaints about commons images, etc [15:32:13] swift: https://grafana.wikimedia.org/dashboard/db/prometheus-cluster-breakdown?orgId=1 [15:32:14] maybe somewhere inter-related with this [15:32:34] cache upload: https://grafana.wikimedia.org/dashboard/db/prometheus-cluster-breakdown?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=cache_upload&var-instance=All&from=1530275542254&to=1530286342254 [15:33:28] https://grafana.wikimedia.org/dashboard/db/varnish-caching?refresh=15m&orgId=1&from=now-3h&to=now&var-cluster=cache_upload&var-site=All&var-status=1&var-status=2&var-status=3&var-status=4&var-status=5 [15:33:42] ^ upload hitrate dropped by a full percent as well (which is a ton more misses to the backends) [15:34:05] not sure if related but I see a big spike of traffic for for example ms-be https://librenms.wikimedia.org/graphs/to=1530286200/id=2303/type=port_bits/from=1530199800/ [15:34:31] the slope of the hitrate dropoff was over the period ~15:09->15:14 [15:34:33] ms-be is switf, yes, related [15:34:34] that would almost certainly be related :) [15:35:12] so, at the very least there's a big pattern shift that's causing more outbound traffic and more cache misses, which will indirectly cause more network on cache_upload+ms_be, and more load on ms_be in general [15:35:52] I can't disable that port otherwise it will most likely staturate a transit link instead [15:35:56] as best I can tell it's only the eqiad cache frontends seeing this [15:36:05] let's convert this in a cookbook incident scenario next week :) [15:36:10] so it's eqiad-specific, which likely means it's specific to a certain client network [15:37:00] still waiting on the equinix IXP dashboard to update and show where traffic is going [15:37:36] (limiting the hitrate impact graph to eqiad, the dropoff is more like 97.5% -> 93.6%, which means the backend reqrate to swift there has multiplied by ~2.5x. [15:37:39] ) [15:37:56] (for eqiad's reqs, which are a fraction of overall swift reqs, but still) [15:38:08] it is back to normal now [15:38:21] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: labvirt1009 HP Raid alert - https://phabricator.wikimedia.org/T198479 (10Bstorm) Should be. It's an HP Smart P420i in a RAID 10 logical disk and is the only failure. Unless the disk itself isn't a hot swap form factor, it should be good, right?... [15:38:49] 1.1GB for all upload caches [15:40:09] I'm digging through oxygen sampled-1000 for IP/network stuff, should have something shortly [15:41:01] seems at least somewhat distributed [15:41:40] I will check performance impact [15:41:41] in terms of destinations or URIs? [15:41:50] client IPs I mean [15:42:02] although now that I'm filtering down to /24 before aggregating, it's looking like Google [15:42:53] googlebot is high in the list of client networks for 15:00 -> 15:39 anyways, but that may be normal heh :) [15:43:05] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% [15:44:03] I don't remember seeing this alert before from librenms-wmf...could it be this is somewhat common over time (pun) but we were not noticing it? [15:44:37] it's relatively-new, but been ... a month or two? [15:45:13] anyways, there's no obvious source network that's doing a huge percentage of requests on its own, all the traffic looks fairly-reasonably distributed, which is odd. [15:45:29] I see a big spike ~2Gbps toward VADATA-DC2-IX-01 ASN: 16509, google doesn't seem to show a spike (yet at least) [15:45:36] ack, a month or two is long enough for it to be anomalous for sure [15:45:39] but it may be this is a problem that's hiding in request-counts, because they're very few but very very large outbound requests [15:46:18] maybe someone ripping down a bunch of huge tiff/djvu from one or more 1Gbps+ hosts that happen to be located right by us in VA? [15:46:21] wasn't there a varnish tool that showed the top objects in terms of object size? [15:46:24] (too late now I guess) [15:46:37] not that correlates well for this problem [15:47:15] 16509 == EC2? [15:47:42] (03CR) 10PleaseStand: "Is this change actually "part II" instead of "part I"? I would think the use of the variable (in CommonSettings.php) should be removed fir" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441518 (owner: 10Jforrester) [15:48:00] so yeah, maybe an individual user doing something heavy from an EC2 instance at low latency from eqiad, not realizing the impact [15:48:02] indeed, AWS [15:48:11] (or from more than one) [15:48:46] we have req ratelimits we impose on everyone by default to ensure some sanity-cap, but not byte ratelimits [15:50:12] is a byte ratelimit possible? [15:50:29] aka, what to do if this happens again? [15:51:24] I don't know, it's not something we have an easy switch for anyways, would require some engineering and thinking [15:53:32] peering with ec2 ;) [15:53:36] private peering I mean [15:54:45] sounds like a heavy option :) [15:55:09] also not a quick one :) [15:56:40] being able to easily parse out top requesters by cache misses would be interesting [15:57:46] yeah although for this particular case it wouldn't matter much if they were cache hits or not [15:58:05] there's varnishtop [15:58:16] but that's based on #requests, not #bytes I think [15:58:27] (maybe it can do that too nowadays, i haven't used it in a long while) [15:58:42] and analytics may have interesting tools for it nowadays too [16:00:53] Equinix graph settled, all the spike is toward ASW [16:00:55] ASW [16:00:59] er, AWS [16:06:53] (03PS1) 10Reedy: Remove duplicate phpunit entry from composer.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443108 [16:07:09] 3 GiB/s for 25 min, so 4500 GiB? [16:07:28] Could be just the download of a relatively small collection of videos from Commons [16:07:33] (03CR) 10BryanDavis: [C: 031] Remove duplicate phpunit entry from composer.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443108 (owner: 10Reedy) [16:08:07] usually it's hard, at least with a single host/thread, for one user to suck so much bandwidth from us though. it's easier if they're very low-latency to us, as EC2 us-east-1 likely is :) [16:10:21] Nemo_bis: 3GBYtes [16:15:20] (03PS1) 10Andrew Bogott: Openstack: add region-migrate script and config [puppet] - 10https://gerrit.wikimedia.org/r/443109 [16:15:47] (03CR) 10jerkins-bot: [V: 04-1] Openstack: add region-migrate script and config [puppet] - 10https://gerrit.wikimedia.org/r/443109 (owner: 10Andrew Bogott) [16:17:39] (03PS2) 10Andrew Bogott: Openstack: add region-migrate script and config [puppet] - 10https://gerrit.wikimedia.org/r/443109 [16:18:04] (03CR) 10jerkins-bot: [V: 04-1] Openstack: add region-migrate script and config [puppet] - 10https://gerrit.wikimedia.org/r/443109 (owner: 10Andrew Bogott) [16:19:24] (03PS3) 10Andrew Bogott: Openstack: add region-migrate script and config [puppet] - 10https://gerrit.wikimedia.org/r/443109 [16:36:30] (03CR) 10Andrew Bogott: [C: 032] Openstack: add region-migrate script and config [puppet] - 10https://gerrit.wikimedia.org/r/443109 (owner: 10Andrew Bogott) [16:39:43] (03PS1) 10Andrew Bogott: nova region-migrate: fix erb mistake [puppet] - 10https://gerrit.wikimedia.org/r/443112 [16:40:15] (03CR) 10Andrew Bogott: [C: 032] nova region-migrate: fix erb mistake [puppet] - 10https://gerrit.wikimedia.org/r/443112 (owner: 10Andrew Bogott) [16:43:22] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: Reach out to Google about @yahoo.com emails not reaching gmail inboxes (when sent to mailing lists) - https://phabricator.wikimedia.org/T146841 (10herron) For wikitech-l the DMARC moderation policy is accept, which is known to cause issues with ISPs using stri... [16:52:51] (03PS1) 10Eevans: restbase: cleanup remainging detritus from storage transition [puppet] - 10https://gerrit.wikimedia.org/r/443114 (https://phabricator.wikimedia.org/T191659) [16:58:25] 10Operations, 10Traffic, 10netops: Can't reach eqiad or esams from Comcast in Portland, Oregon - https://phabricator.wikimedia.org/T198502 (10brion) [16:59:20] looks like a big outage with comcast, so not sure if anything to be done [17:00:03] 10Operations, 10Traffic, 10netops: Can't reach eqiad or esams from Comcast in Portland, Oregon - https://phabricator.wikimedia.org/T198502 (10ayounsi) a:03ayounsi [17:04:14] 10Operations, 10Traffic, 10netops: Can't reach eqiad or esams from Comcast in Portland, Oregon - https://phabricator.wikimedia.org/T198502 (10ayounsi) 1st look seem to indicate an issue between Telia and Comcast or within Comcast. ``` ayounsi@bast1002:~$ mtr 73.37.60.183 -z --report-wide Start: Fri Jun 29... [17:09:15] 10Operations, 10Traffic, 10netops: Can't reach eqiad or esams from Comcast in Portland, Oregon - https://phabricator.wikimedia.org/T198502 (10brion) ``` $ sudo mtr bast1002.wikimedia.org -z --report-wide Password: Start: 2018-06-29T10:07:32-0700 HOST: Orac.local L... [17:11:59] 10Operations, 10Traffic, 10netops: Can't reach eqiad or esams from Comcast in Portland, Oregon - https://phabricator.wikimedia.org/T198502 (10brion) and in ipv4: ``` $ sudo mtr bast1002.wikimedia.org -z --report-wide -4 Start: 2018-06-29T10:10:31-0700 HOST: Orac.local... [17:17:56] 10Operations, 10Traffic, 10netops: Can't reach eqiad or esams from Comcast in Portland, Oregon - https://phabricator.wikimedia.org/T198502 (10ayounsi) Telia's NOC contacted. [17:21:24] !log deactivating v6 BGP session to Telia on cr2-eqiad - T198502 [17:21:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:28] T198502: Can't reach eqiad or esams from Comcast in Portland, Oregon - https://phabricator.wikimedia.org/T198502 [17:23:49] 10Operations, 10Traffic, 10netops: Can't reach eqiad or esams from Comcast in Portland, Oregon - https://phabricator.wikimedia.org/T198502 (10brion) Now getting ``` $ sudo mtr bast1002.wikimedia.org -z --report-wide Password: Start: 2018-06-29T10:22:50-0700 HOST: Orac.local... [17:28:19] !log Re-activating v6 BGP session to Telia on cr2-eqiad - T198502 [17:28:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:23] T198502: Can't reach eqiad or esams from Comcast in Portland, Oregon - https://phabricator.wikimedia.org/T198502 [17:32:08] 10Operations, 10Traffic, 10netops: Can't reach eqiad or esams from Comcast in Portland, Oregon - https://phabricator.wikimedia.org/T198502 (10bearND) I'm affected, too. Comcast in CO. [17:32:13] 10Operations, 10Traffic, 10netops: Can't reach eqiad or esams from Comcast in Portland, Oregon - https://phabricator.wikimedia.org/T198502 (10ayounsi) Traffic takes another path, GTT to us, HE back, but still no luck, so the issue seems to be within Comcast. Looking at some Netops IRC channels, there seem t... [17:36:04] James_F: o/ Is there a task for broken i18n on testwiki? I can't find one. [17:37:50] Niharika: No, just follow-up from breakage two weeks ago. :-( [17:38:16] Okay. Running full scap yesterday did not rebuild pagetriage messages, unfortunately. [17:39:09] Yeah, I think there's something more seriously wrong with the wiki. [17:39:51] Wasn't there some cache refactoring issues? [17:40:00] either MCR or something Aaron was working on [17:40:05] And I thought there was a patch to fix that [17:40:22] Yeah, I pinged Aaron but he said he didn't think it was him. [17:40:29] * James_F digs. [17:40:51] https://phabricator.wikimedia.org/T197450 [17:41:24] test2 also has missing messages in Preferences > Gadgets. [17:59:16] 10Operations, 10Gadgets, 10MediaWiki-Cache, 10Performance-Team: test.wp is using test2.wp's message cache - https://phabricator.wikimedia.org/T197450 (10Reedy) p:05Triage>03High [18:05:02] 10Operations, 10Traffic, 10netops: Can't reach eqiad or esams from Comcast in Portland, Oregon - https://phabricator.wikimedia.org/T198502 (10bearND) translatewiki.net was also affected for me. But both TWN and Gerrit are back for me now. [18:10:31] 10Operations, 10Traffic, 10netops: Can't reach eqiad or esams from Comcast in Portland, Oregon - https://phabricator.wikimedia.org/T198502 (10brion) 05Open>03Resolved Seems to have cleared up for me too now. Marking resolved. \o/ [18:11:27] 10Operations, 10Analytics, 10Traffic: Size of headers processed by varnish? - https://phabricator.wikimedia.org/T198152 (10Nuria) Ya, 8k seems quite a bit, not sure why would we need more than that in either end. [18:17:48] 10Operations, 10Research, 10Research-collaborations, 10Research-management, and 2 others: Remove shell access for ironholds on 2018-06-29 - https://phabricator.wikimedia.org/T197895 (10herron) a:03herron [18:18:03] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Give Amir access to access to researchers group - https://phabricator.wikimedia.org/T198211 (10herron) a:03herron [18:18:56] (03PS2) 10Herron: adding amire80 to researchers [puppet] - 10https://gerrit.wikimedia.org/r/442143 (https://phabricator.wikimedia.org/T198211) (owner: 10RobH) [18:21:57] (03CR) 10Herron: [C: 032] adding amire80 to researchers [puppet] - 10https://gerrit.wikimedia.org/r/442143 (https://phabricator.wikimedia.org/T198211) (owner: 10RobH) [18:24:03] (03PS4) 10Herron: remove oliver's access to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/441434 (https://phabricator.wikimedia.org/T197895) (owner: 10RobH) [18:25:52] (03CR) 10Herron: [C: 032] remove oliver's access to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/441434 (https://phabricator.wikimedia.org/T197895) (owner: 10RobH) [18:40:17] (03CR) 10Smalyshev: [C: 031] Enable smater logging for wdqs [puppet] - 10https://gerrit.wikimedia.org/r/443001 (https://phabricator.wikimedia.org/T197645) (owner: 10Smalyshev) [18:45:17] 10Operations, 10Research, 10Research-collaborations, 10Research-management, and 2 others: Remove shell access for ironholds on 2018-06-29 - https://phabricator.wikimedia.org/T197895 (10herron) 05Open>03Resolved This has been merged and the access removal is propagating out across the fleet now. ``` No... [18:45:20] 10Operations, 10Ops-Access-Reviews, 10Research, 10Research-collaborations, and 3 others: Request access to data for Wikimedia Donation Patterns research - https://phabricator.wikimedia.org/T188945 (10herron) [18:48:36] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Give Amir access to access to researchers group - https://phabricator.wikimedia.org/T198211 (10herron) 05Open>03Resolved This has been completed ``` stat1006:~# id amire80 uid=2076(amire80) gid=500(wikidev) groups=500(wikidev),726(statistics-users... [18:50:50] (03CR) 10Eevans: [C: 031] "[PC output](http://puppet-compiler.wmflabs.org/11619) shows that all this does remove some redundant IPs from ferm (it is functionally a n" [puppet] - 10https://gerrit.wikimedia.org/r/443114 (https://phabricator.wikimedia.org/T191659) (owner: 10Eevans) [19:03:13] PROBLEM - puppet last run on stat1005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): User[ironholds] [19:06:48] ^ that seems related to what I recently merged — looking [19:11:33] (03PS1) 10Rush: cloud: labs* VLANs are renamed in the switches [dns] - 10https://gerrit.wikimedia.org/r/443135 [19:11:51] !log stat1005:~# killall -u ironholds T197895 [19:11:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:57] T197895: Remove shell access for ironholds on 2018-06-29 - https://phabricator.wikimedia.org/T197895 [19:13:23] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:18:40] (03PS2) 10Eevans: restbase: cleanup remainging detritus from storage transition [puppet] - 10https://gerrit.wikimedia.org/r/443114 (https://phabricator.wikimedia.org/T191659) [19:18:41] (03PS1) 10Eevans: WIP: restbase: use lower threshold when monitoring instance-data partition [puppet] - 10https://gerrit.wikimedia.org/r/443137 (https://phabricator.wikimedia.org/T191659) [19:19:27] (03CR) 10jerkins-bot: [V: 04-1] WIP: restbase: use lower threshold when monitoring instance-data partition [puppet] - 10https://gerrit.wikimedia.org/r/443137 (https://phabricator.wikimedia.org/T191659) (owner: 10Eevans) [19:19:37] (03PS3) 10Eevans: restbase: cleanup remainging detritus from storage transition [puppet] - 10https://gerrit.wikimedia.org/r/443114 (https://phabricator.wikimedia.org/T191659) [19:19:39] (03PS2) 10Eevans: WIP: restbase: use lower threshold when monitoring instance-data partition [puppet] - 10https://gerrit.wikimedia.org/r/443137 (https://phabricator.wikimedia.org/T191659) [19:20:18] (03CR) 10jerkins-bot: [V: 04-1] WIP: restbase: use lower threshold when monitoring instance-data partition [puppet] - 10https://gerrit.wikimedia.org/r/443137 (https://phabricator.wikimedia.org/T191659) (owner: 10Eevans) [19:20:50] (03PS4) 10Eevans: restbase: cleanup remaining detritus from storage transition [puppet] - 10https://gerrit.wikimedia.org/r/443114 (https://phabricator.wikimedia.org/T191659) [19:20:52] (03PS3) 10Eevans: WIP: restbase: use lower threshold when monitoring instance-data partition [puppet] - 10https://gerrit.wikimedia.org/r/443137 (https://phabricator.wikimedia.org/T191659) [19:21:23] (03CR) 10jerkins-bot: [V: 04-1] WIP: restbase: use lower threshold when monitoring instance-data partition [puppet] - 10https://gerrit.wikimedia.org/r/443137 (https://phabricator.wikimedia.org/T191659) (owner: 10Eevans) [19:23:33] (03PS4) 10Eevans: WIP: restbase: use lower threshold when monitoring instance-data partition [puppet] - 10https://gerrit.wikimedia.org/r/443137 (https://phabricator.wikimedia.org/T191659) [19:24:12] (03CR) 10jerkins-bot: [V: 04-1] WIP: restbase: use lower threshold when monitoring instance-data partition [puppet] - 10https://gerrit.wikimedia.org/r/443137 (https://phabricator.wikimedia.org/T191659) (owner: 10Eevans) [19:32:38] (03CR) 10Bmansurov: [C: 04-1] "to be deployed on 7/5" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441568 (https://phabricator.wikimedia.org/T191086) (owner: 10Bmansurov) [19:56:18] (03PS1) 10Andrew Bogott: nova region-migrate: move region-migrate.conf into the envscripts class [puppet] - 10https://gerrit.wikimedia.org/r/443139 [20:33:43] (03CR) 10Andrew Bogott: [C: 032] nova region-migrate: move region-migrate.conf into the envscripts class [puppet] - 10https://gerrit.wikimedia.org/r/443139 (owner: 10Andrew Bogott) [20:36:20] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: WDQS timeout on the public eqiad cluster - https://phabricator.wikimedia.org/T198042 (10Smalyshev) Do we have anything left to do here? [20:39:26] (03PS1) 10Nehajha: Removing gridengine as default and encouraging the use of Kubernetes [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/443190 (https://phabricator.wikimedia.org/T154504) [20:44:09] (03PS1) 10Andrew Bogott: region-migrate.conf: remove some injurious quote marks [puppet] - 10https://gerrit.wikimedia.org/r/443191 [20:47:56] (03CR) 10Andrew Bogott: [C: 032] region-migrate.conf: remove some injurious quote marks [puppet] - 10https://gerrit.wikimedia.org/r/443191 (owner: 10Andrew Bogott) [21:09:21] 10Operations, 10fundraising-tech-ops, 10netops: NAT and DNS for fundraising monitor host - https://phabricator.wikimedia.org/T198516 (10cwdent) [21:20:49] (03PS1) 10Rush: openstack: eqiad1 initial neutron l3-agent manifest [puppet] - 10https://gerrit.wikimedia.org/r/443197 (https://phabricator.wikimedia.org/T196633) [21:23:14] (03PS2) 10Rush: openstack: eqiad1 initial neutron l3-agent manifest [puppet] - 10https://gerrit.wikimedia.org/r/443197 (https://phabricator.wikimedia.org/T196633) [21:27:57] (03PS3) 10Rush: openstack: eqiad1 initial neutron l3-agent manifest [puppet] - 10https://gerrit.wikimedia.org/r/443197 (https://phabricator.wikimedia.org/T196633) [21:32:53] (03CR) 10Rush: [C: 032] openstack: eqiad1 initial neutron l3-agent manifest [puppet] - 10https://gerrit.wikimedia.org/r/443197 (https://phabricator.wikimedia.org/T196633) (owner: 10Rush) [21:37:57] !log (late log) of VLAN creations for T184209 cloud-instances2-b-eqiad and cloud-instance-transport1-b-eqiad [21:38:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:26] MatmaRex: backport just merged, got a few minutes to deploy this with me? [21:40:15] thcipriani: yeah. i've just figured out how to allow myself to rollback changes on test.wiki [21:41:15] awesome, ok, it's live on mwdebug1002 [21:41:45] thank you for your help [21:43:21] thcipriani: i'm testing, the page is taking ages to load. i've had that happen before with mwdebug1002, dunno why [21:43:56] ugh, it crapped out, actually. [Wzan7wpAAC4AAGvGsZAAAAAK] 2018-06-29 21:43:28: Fatal exception of type "Wikimedia\Rdbms\DBQueryError" [21:44:22] but it rollbacked correctly: https://test.wikipedia.org/w/index.php?title=T198449&action=history [21:44:23] T198449: Rollback from autopatrolled user was marked as not patrolled - https://phabricator.wikimedia.org/T198449 [21:44:31] and the change is patrolled on https://test.wikipedia.org/wiki/Special:RecentChanges [21:45:04] thcipriani: can you look up the exception for Wzan7wpAAC4AAGvGsZAAAAAK? i don't see it in logstash [21:45:28] i just got that when making that rollback (which succeeded, otherwise) [21:46:40] * thcipriani looking [21:47:14] > Error: 1205 Lock wait timeout exceeded; [21:47:30] (which is another concerning trend. that's the third time in two days where i can't lookup an exception in logstash) [21:48:45] https://phabricator.wikimedia.org/P7318 [21:52:27] thcipriani: i have no idea if we should proceed. it might be an issue specifically with mwdebug1002; the last time i was testing something on it this week, it took a super long time to give me a response, just like this time [21:52:59] (except this time it exceeded a lock wait timeout because of it) [21:54:01] MatmaRex: me either re:should proceed. In light of the fact that it's Friday, I'll rollback and comment on the task. thank you for all your help! If it were a cleaner path forward could have saved some folks some headache over the weekend :( [21:54:27] yeah, I'm not comfortable right now. [21:55:38] yeah, i understand. [21:55:41] this is dumb, eh [21:55:48] logstash has some timeouts and slow queries logged when i search for "WikiPage::lockAndGetLatest", by the way [22:23:47] (03CR) 10BryanDavis: [C: 04-1] "I'm not sure that the order of checks here is quite right. The tool could have a ~/.webservicerc file without having set a --backend value" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/443190 (https://phabricator.wikimedia.org/T154504) (owner: 10Nehajha) [22:40:49] 10Operations, 10fundraising-tech-ops, 10netops: NAT and DNS for fundraising monitor host - https://phabricator.wikimedia.org/T198516 (10ayounsi) I don't neither. Note that 8.155.80.208.in-addr.arpa domain name pointer frbast1001.wikimedia.org. [22:44:19] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: WDQS timeout on the public eqiad cluster - https://phabricator.wikimedia.org/T198042 (10Gehel) I think we're all done here. [22:44:55] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Investigate HTTP 500 on POST request to WDQS - https://phabricator.wikimedia.org/T198055 (10Gehel) Looks good, we can close this. [22:44:59] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: WDQS timeout on the public eqiad cluster - https://phabricator.wikimedia.org/T198042 (10Gehel) [22:45:02] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Investigate HTTP 500 on POST request to WDQS - https://phabricator.wikimedia.org/T198055 (10Gehel) 05Open>03Resolved a:03Gehel [22:58:43] 10Operations, 10fundraising-tech-ops, 10netops: NAT and DNS for fundraising monitor host - https://phabricator.wikimedia.org/T198516 (10cwdent) @ayounsi ah yes thanks, I forgot to update the documentation for that, but just did