[00:02:12] <wikibugs>	 (03PS1) 10Ayounsi: Reserve IPs for eqdfw GTT transport vlans [dns] - 10https://gerrit.wikimedia.org/r/443010
[00:02:57] <wikibugs>	 (03CR) 10Ayounsi: [C: 032] Reserve IPs for eqdfw GTT transport vlans [dns] - 10https://gerrit.wikimedia.org/r/443010 (owner: 10Ayounsi)
[00:21:39] <icinga-wm>	 RECOVERY - Check systemd state on labcontrol1004 is OK: OK - running: The system is fully operational
[00:24:59] <icinga-wm>	 PROBLEM - Check systemd state on labcontrol1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[00:31:53] <paladox>	 hmm
[00:32:08] <paladox>	 andrewbogott chasemp ^^
[00:32:13] <bd808>	 I think that's ok for now paladox. labcontrol1004 is not active
[00:32:33] <bd808>	 that's one of the 2 new servers we are in the process of bringing online
[00:32:56] <paladox>	 ah ok
[00:32:58] <andrewbogott>	 I'll see if I can silence it more than it is already
[01:09:24] <wikibugs>	 (03CR) 10Krinkle: [C: 032] Use perftools/xhgui-collector instead of perftools/xhgui [mediawiki-config] - 10https://gerrit.wikimedia.org/r/432016 (owner: 10Krinkle)
[01:09:42] <Krinkle>	 RoanKattouw: I assume the deploy is finished?
[01:10:48] <wikibugs>	 (03Merged) 10jenkins-bot: Use perftools/xhgui-collector instead of perftools/xhgui [mediawiki-config] - 10https://gerrit.wikimedia.org/r/432016 (owner: 10Krinkle)
[01:11:03] <wikibugs>	 (03CR) 10jenkins-bot: Use perftools/xhgui-collector instead of perftools/xhgui [mediawiki-config] - 10https://gerrit.wikimedia.org/r/432016 (owner: 10Krinkle)
[01:11:23] <RoanKattouw>	 Yes
[01:11:29] <RoanKattouw>	 Oh crap I forgot to run that full scap for Niharika 
[01:11:40] <RoanKattouw>	 CI took so damn long that I forgot about it
[01:12:22] <Krinkle>	 k, well, just give me a minute in that case.
[01:14:07] * Krinkle staging on deploy1001 and mwdebug1002
[01:17:04] <wikibugs>	 (03PS1) 10Krinkle: profiler: Put 'global' keyword back [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443016
[01:17:17] <wikibugs>	 (03CR) 10Krinkle: [C: 032] profiler: Put 'global' keyword back [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443016 (owner: 10Krinkle)
[01:18:32] <wikibugs>	 (03Merged) 10jenkins-bot: profiler: Put 'global' keyword back [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443016 (owner: 10Krinkle)
[01:19:22] <logmsgbot>	 !log krinkle@deploy1001 Synchronized vendor/: I39e592d837 - remove redundant composer packages for xhgui profiling (duration: 00m 52s)
[01:19:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:19:43] <Krinkle>	 RoanKattouw: OK.. You can do your scap now :)
[01:19:49] <RoanKattouw>	 OK will do
[01:20:21] <logmsgbot>	 !log catrope@deploy1001 Started scap: Full scap to fix i18n issues on testwiki
[01:20:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:31:02] <wikibugs>	 (03CR) 10Krinkle: "Checked-picked to beta's puppetmaster since 2-3 weeks. Fixed beta's deployment host." [puppet] - 10https://gerrit.wikimedia.org/r/436601 (https://phabricator.wikimedia.org/T195314) (owner: 10Krinkle)
[01:53:04] <logmsgbot>	 !log catrope@deploy1001 Finished scap: Full scap to fix i18n issues on testwiki (duration: 32m 42s)
[01:53:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:18:41] <wikibugs>	 (03CR) 10jenkins-bot: profiler: Put 'global' keyword back [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443016 (owner: 10Krinkle)
[02:21:28] <icinga-wm>	 RECOVERY - Check systemd state on labcontrol1004 is OK: OK - running: The system is fully operational
[03:27:59] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 606.74 seconds
[03:31:19] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 252.37 seconds
[04:03:32] <Niharika>	 Thanks RoanKattouw!
[04:41:35] <wikibugs>	 (03CR) 10Smalyshev: Enable smater logging for wdqs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/443001 (https://phabricator.wikimedia.org/T197645) (owner: 10Smalyshev)
[04:48:55] <wikibugs>	 (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool es1015" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442891
[04:50:15] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool es1015" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442891 (owner: 10Marostegui)
[04:51:27] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool es1015" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442891 (owner: 10Marostegui)
[04:51:44] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool es1015" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442891 (owner: 10Marostegui)
[04:51:59] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1101:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443021
[04:52:03] <wikibugs>	 (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1101:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443021
[04:52:36] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool es1015 (duration: 00m 52s)
[04:52:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:53:18] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1101:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443021 (owner: 10Marostegui)
[04:54:29] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1101:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443021 (owner: 10Marostegui)
[04:55:34] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1101:3318 after alter table (duration: 00m 50s)
[04:55:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:55:56] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443022 (https://phabricator.wikimedia.org/T191316)
[04:57:15] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1101:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443021 (owner: 10Marostegui)
[04:59:25] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443022 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui)
[05:00:04] <marostegui>	 !log /script unload irssinotifier
[05:00:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:00:15] <marostegui>	 gah!
[05:00:17] <marostegui>	 :)
[05:00:33] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443022 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui)
[05:01:43] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1104 for alter table (duration: 00m 50s)
[05:01:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:02:02] <marostegui>	 !log Deploy schema change on db1104 T191316 T192926 T89737 T195193
[05:02:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:02:08] <stashbot>	 T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737
[05:02:09] <stashbot>	 T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926
[05:02:09] <stashbot>	 T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193
[05:02:09] <stashbot>	 T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316
[05:04:01] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443022 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui)
[05:04:09] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: Degraded RAID on dbstore1002 - https://phabricator.wikimedia.org/T197707 (10Marostegui) Still on-going ``` root@dbstore1002:~# megacli -PDRbld -ShowProg -PhysDrv [32:5] -aALL  Rebuild Progress on Device at Enclosure 32, Slot 5 Completed 44% in 1019 Min...
[05:10:22] <wikibugs>	 10Operations, 10monitoring: dbstore1001 backup jobs failed between 2016-10-19 and 2016-11-23 - https://phabricator.wikimedia.org/T151579 (10Marostegui)
[05:11:23] <wikibugs>	 10Operations, 10DBA, 10monitoring, 10Patch-For-Review: Create script to monitor db dumps for backups are successful - https://phabricator.wikimedia.org/T151999 (10Marostegui) 05Open>03Resolved a:03jcrespo I am going to close this as resolved as per T151999#4038297 and going to create a task for the "...
[05:40:14] <elukey>	 !log force umount of dumps labstore nfs mountpoints on stat100[56]/notebook100[34] to reduce load (also too many open files)
[05:40:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:49:32] <wikibugs>	 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Datasets-General-or-Unknown: rack upgraded storage capacity in labstore100[67].eqiad.wmnet - https://phabricator.wikimedia.org/T196651 (10elukey) As a follow up, I had to force umount the labstore1006's nfs mountpoint on stat100[5,6] and notebook100[3,4] since the...
[06:05:56] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Prepare for reimage of es1016 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/442795
[06:07:12] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Prepare for reimage of es1016 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/442795 (owner: 10Jcrespo)
[06:21:36] <jynus>	 !log stop and reimage es1016
[06:21:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:30:28] <icinga-wm>	 PROBLEM - puppet last run on conf2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[07:00:44] <icinga-wm>	 RECOVERY - puppet last run on conf2003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[07:05:48] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Reenable notifications on es1016 after reimage [puppet] - 10https://gerrit.wikimedia.org/r/443028
[07:07:25] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw2289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[07:08:15] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw2289 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.189 second response time
[07:09:05] <wikibugs>	 (03PS1) 10ArielGlenn: add labstore106 back to hosts that get rsync from stats servers [puppet] - 10https://gerrit.wikimedia.org/r/443029
[07:11:14] <wikibugs>	 (03PS2) 10ArielGlenn: add labstore1006 back to hosts that get rsync from stats servers [puppet] - 10https://gerrit.wikimedia.org/r/443029
[07:11:46] <wikibugs>	 (03PS2) 10Muehlenhoff: Move php5 packages to contint class [puppet] - 10https://gerrit.wikimedia.org/r/438164
[07:13:24] <wikibugs>	 (03CR) 10ArielGlenn: [C: 032] add labstore1006 back to hosts that get rsync from stats servers [puppet] - 10https://gerrit.wikimedia.org/r/443029 (owner: 10ArielGlenn)
[07:13:34] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Reenable notifications on es1016 after reimage [puppet] - 10https://gerrit.wikimedia.org/r/443028 (owner: 10Jcrespo)
[07:13:43] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Reenable notifications on es1016 after reimage [puppet] - 10https://gerrit.wikimedia.org/r/443028
[07:17:42] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] "Looks good!" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/442876 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans)
[07:41:12] <wikibugs>	 (03PS1) 10Marostegui: install_server: Allow reimage db2054 [puppet] - 10https://gerrit.wikimedia.org/r/443032
[07:41:38] <wikibugs>	 (03PS1) 10Jcrespo: Revert "mariadb: Depool es1016 for reimage to stretch" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443033
[07:43:18] <wikibugs>	 (03PS2) 10Jcrespo: Revert "mariadb: Depool es1016 for reimage to stretch" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443033
[07:43:22] <wikibugs>	 (03PS2) 10Marostegui: install_server: Allow reimage db2054 [puppet] - 10https://gerrit.wikimedia.org/r/443032
[07:45:06] <wikibugs>	 (03CR) 10Marostegui: [C: 032] install_server: Allow reimage db2054 [puppet] - 10https://gerrit.wikimedia.org/r/443032 (owner: 10Marostegui)
[07:46:42] <wikibugs>	 (03PS1) 10Marostegui: db-codfw.php: Depool db2054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443035
[07:47:37] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool es1016 for reimage to stretch" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443033 (owner: 10Jcrespo)
[07:48:55] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "mariadb: Depool es1016 for reimage to stretch" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443033 (owner: 10Jcrespo)
[07:49:49] <wikibugs>	 (03CR) 10jenkins-bot: Revert "mariadb: Depool es1016 for reimage to stretch" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443033 (owner: 10Jcrespo)
[07:50:39] <wikibugs>	 (03PS1) 10Muehlenhoff: Enable microcode updates for all elasticsearch hosts [puppet] - 10https://gerrit.wikimedia.org/r/443038 (https://phabricator.wikimedia.org/T127825)
[07:51:09] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443035 (owner: 10Marostegui)
[07:52:31] <wikibugs>	 (03Merged) 10jenkins-bot: db-codfw.php: Depool db2054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443035 (owner: 10Marostegui)
[07:53:46] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2054 for reimage (duration: 00m 50s)
[07:53:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:53:50] <marostegui>	 !log Stop MySQL on db2054 for reimage 
[07:53:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:55:11] <wikibugs>	 (03CR) 10jenkins-bot: db-codfw.php: Depool db2054 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443035 (owner: 10Marostegui)
[07:56:14] <wikibugs>	 10Operations, 10netops: Allow labnet/labnodepool/labvirt to connect to debmonitor hosts/443 - https://phabricator.wikimedia.org/T198375 (10MoritzMuehlenhoff) Thanks, confirmed working fine. All missing hosts were able to ingest their package data and servermon and debdeploy are now tracking the same number of...
[07:57:23] <jynus>	 marostegui: are you deploying?
[07:57:30] <marostegui>	 I just did
[07:57:49] <jynus>	 not a problem, but you pulled also my change
[07:58:06] <jynus>	 only mentioning it because on other case it could be a problem
[07:58:30] <marostegui>	 yeah, I saw your change, but I thought you already deployed it
[07:58:38] <jynus>	 not yet
[07:58:52] <jynus>	 I was waiting for the +2
[07:59:37] <marostegui>	 but it arrived at 09:48
[07:59:39] <marostegui>	 no?
[08:00:40] <jynus>	 I am just saying check how many changes you pull, sometimes you find random things
[08:00:49] <jynus>	 even non-db stuff
[08:01:02] <marostegui>	 yeah
[08:01:27] <jynus>	 so what did you deploy, only codfw?
[08:01:35] <marostegui>	 yes
[08:01:46] <jynus>	 then this is strange
[08:01:52] <jynus>	 because I saw my change deployed
[08:01:58] <jynus>	 which is only eqiad
[08:02:09] <marostegui>	 07:53:46 Synchronized wmf-config/db-codfw.php: Depool db2054 for reimage (duration: 00m 50s)
[08:02:12] <jynus>	 oh, it wasn't
[08:02:23] <jynus>	 I was looking at the wrong place, I will do it now
[08:03:48] <logmsgbot>	 !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool es1016 (duration: 00m 50s)
[08:03:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:18:48] <wikibugs>	 10Operations, 10Analytics, 10Traffic: Size of headers processed by varnish? - https://phabricator.wikimedia.org/T198152 (10ema) p:05Triage>03Normal
[08:23:43] <wikibugs>	 (03PS1) 10Vgutierrez: varnishkafka: Set TLS curve list and sigalgs list defaults [puppet] - 10https://gerrit.wikimedia.org/r/443043 (https://phabricator.wikimedia.org/T182993)
[08:24:17] <wikibugs>	 (03CR) 10Elukey: [C: 031] varnishkafka: Set TLS curve list and sigalgs list defaults [puppet] - 10https://gerrit.wikimedia.org/r/443043 (https://phabricator.wikimedia.org/T182993) (owner: 10Vgutierrez)
[08:27:47] <akosiaris>	 !log upgrade php on phab1001 to 5.6.33+dfsg-0+deb8u1
[08:27:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:28:30] <akosiaris>	 moritzm: fyi ^
[08:35:49] <wikibugs>	 (03CR) 10Vgutierrez: [C: 032] "pcc looks happy and shows noop in upload and misc and the expected changes in text: https://puppet-compiler.wmflabs.org/compiler02/11613/" [puppet] - 10https://gerrit.wikimedia.org/r/443043 (https://phabricator.wikimedia.org/T182993) (owner: 10Vgutierrez)
[08:39:23] <vgutierrez>	 !log Apply new TLS varnishkafka settings in cache::text nodes - T182993
[08:39:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:39:26] <stashbot>	 T182993: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993
[08:39:59] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: phabricator: Use the mysql native driver [puppet] - 10https://gerrit.wikimedia.org/r/443045
[08:46:24] <wikibugs>	 (03CR) 10Gehel: [C: 031] "LGTM, no problem identified on the test hosts." [puppet] - 10https://gerrit.wikimedia.org/r/443038 (https://phabricator.wikimedia.org/T127825) (owner: 10Muehlenhoff)
[08:49:45] <wikibugs>	 10Operations, 10Mail, 10Wikimedia-Mailing-lists: Reach out to Google about @yahoo.com emails not reaching gmail inboxes (when sent to mailing lists) - https://phabricator.wikimedia.org/T146841 (10Aklapper) >>! In T146841#3729163, @Dzahn wrote: > @Seb35 @Peachey88  @Herron since T168467 is resolved meanwhile,...
[08:50:00] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Enable notifications on db2054 [puppet] - 10https://gerrit.wikimedia.org/r/443046
[08:51:08] <wikibugs>	 10Operations, 10Analytics, 10Traffic: Size of headers processed by varnish? - https://phabricator.wikimedia.org/T198152 (10ema) Both [[ https://varnish-cache.org/docs/5.1/reference/varnishd.html#http-req-hdr-len | varnish ]] and [[http://nginx.org/en/docs/http/ngx_http_core_module.html#large_client_header_bu...
[08:52:59] <wikibugs>	 (03CR) 10Marostegui: [C: 032] mariadb: Enable notifications on db2054 [puppet] - 10https://gerrit.wikimedia.org/r/443046 (owner: 10Marostegui)
[08:57:40] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2054" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443048
[09:02:02] <wikibugs>	 (03PS1) 10Muehlenhoff: Add component/vp9 [puppet] - 10https://gerrit.wikimedia.org/r/443052 (https://phabricator.wikimedia.org/T190333)
[09:05:16] <wikibugs>	 (03PS2) 10Muehlenhoff: Add component/vp9 [puppet] - 10https://gerrit.wikimedia.org/r/443052 (https://phabricator.wikimedia.org/T190333)
[09:05:18] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2054" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443048 (owner: 10Marostegui)
[09:06:29] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Add component/vp9 [puppet] - 10https://gerrit.wikimedia.org/r/443052 (https://phabricator.wikimedia.org/T190333) (owner: 10Muehlenhoff)
[09:06:40] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2054" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443048 (owner: 10Marostegui)
[09:08:33] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2054 after reimage (duration: 00m 51s)
[09:08:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:10:35] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2054" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443048 (owner: 10Marostegui)
[09:16:56] <wikibugs>	 (03PS1) 10Marostegui: install_server: Allow reimage db2058 [puppet] - 10https://gerrit.wikimedia.org/r/443054
[09:18:34] <moritzm>	 !log uploaded libvpx 1.7.0-3+wmf1 to apt.wikimedia.org/stretch-wikimedia (component/vp9) (T190333)
[09:18:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:18:38] <stashbot>	 T190333: Backport libvpx 1.7.0, ffmpeg packages for VP9 -row-mt option - https://phabricator.wikimedia.org/T190333
[09:18:41] <wikibugs>	 (03PS1) 10Marostegui: db-codfw.php: Depool db2058 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443055
[09:18:51] <wikibugs>	 (03PS2) 10Marostegui: install_server: Allow reimage db2058 [puppet] - 10https://gerrit.wikimedia.org/r/443054
[09:19:41] <wikibugs>	 (03CR) 10Marostegui: [C: 032] install_server: Allow reimage db2058 [puppet] - 10https://gerrit.wikimedia.org/r/443054 (owner: 10Marostegui)
[09:20:38] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2058 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443055 (owner: 10Marostegui)
[09:22:07] <wikibugs>	 (03Merged) 10jenkins-bot: db-codfw.php: Depool db2058 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443055 (owner: 10Marostegui)
[09:23:14] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2058 for reimage (duration: 00m 50s)
[09:23:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:23:17] <marostegui>	 !log Stop MySQL on db2058 to reimage it
[09:23:18] <wikibugs>	 (03CR) 10Gehel: Enable smater logging for wdqs (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/443001 (https://phabricator.wikimedia.org/T197645) (owner: 10Smalyshev)
[09:23:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:23:50] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 031] Improve validation on host package updates [software/debmonitor] - 10https://gerrit.wikimedia.org/r/442876 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans)
[09:25:45] <wikibugs>	 (03CR) 10jenkins-bot: db-codfw.php: Depool db2058 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443055 (owner: 10Marostegui)
[09:26:30] <wikibugs>	 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Datasets-General-or-Unknown: rack upgraded storage capacity in labstore100[67].eqiad.wmnet - https://phabricator.wikimedia.org/T196651 (10ArielGlenn) Fsck on labstore1006 completed last night, it did not take long.         labstore1006 after 'repair' of logical dri...
[09:26:47] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] manage.py: add custom command for GC (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/442901 (owner: 10Volans)
[09:35:25] <wikibugs>	 (03PS2) 10Volans: Add custom Django management command [software/debmonitor] - 10https://gerrit.wikimedia.org/r/442901
[09:35:41] <wikibugs>	 (03PS3) 10Gehel: Enable smater logging for wdqs [puppet] - 10https://gerrit.wikimedia.org/r/443001 (https://phabricator.wikimedia.org/T197645) (owner: 10Smalyshev)
[09:36:09] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Enable smater logging for wdqs [puppet] - 10https://gerrit.wikimedia.org/r/443001 (https://phabricator.wikimedia.org/T197645) (owner: 10Smalyshev)
[09:36:15] <wikibugs>	 (03CR) 10Gehel: Enable smater logging for wdqs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/443001 (https://phabricator.wikimedia.org/T197645) (owner: 10Smalyshev)
[09:36:25] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add custom Django management command [software/debmonitor] - 10https://gerrit.wikimedia.org/r/442901 (owner: 10Volans)
[09:36:37] <wikibugs>	 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 4 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 (10mobrovac)
[09:39:26] <wikibugs>	 (03CR) 10Volans: "Addressed comment, thanks for the review!" (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/442901 (owner: 10Volans)
[09:41:09] <icinga-wm>	 PROBLEM - Disk space on labstore1006 is CRITICAL: DISK CRITICAL - free space: / 33468 MB (3% inode=99%)
[09:42:44] <wikibugs>	 (03PS4) 10Gehel: Enable smater logging for wdqs [puppet] - 10https://gerrit.wikimedia.org/r/443001 (https://phabricator.wikimedia.org/T197645) (owner: 10Smalyshev)
[09:43:25] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Enable smater logging for wdqs [puppet] - 10https://gerrit.wikimedia.org/r/443001 (https://phabricator.wikimedia.org/T197645) (owner: 10Smalyshev)
[09:44:46] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 031] Add custom Django management command [software/debmonitor] - 10https://gerrit.wikimedia.org/r/442901 (owner: 10Volans)
[09:45:34] <wikibugs>	 10Operations, 10Discovery, 10Discovery-Search: migrate elasticsearch cirrus cluster to RAID0 - https://phabricator.wikimedia.org/T198391 (10Gehel) About RAID10 / RAID5(0), in both cases it would require adding more disk to those servers, which is something we are trying to avoid. The JBOD approach could work...
[09:46:41] <wikibugs>	 (03CR) 10Gehel: "puppet compiler looks happy: https://puppet-compiler.wmflabs.org/compiler02/11616/" [puppet] - 10https://gerrit.wikimedia.org/r/443001 (https://phabricator.wikimedia.org/T197645) (owner: 10Smalyshev)
[09:48:14] <wikibugs>	 (03PS5) 10Gehel: Enable smater logging for wdqs [puppet] - 10https://gerrit.wikimedia.org/r/443001 (https://phabricator.wikimedia.org/T197645) (owner: 10Smalyshev)
[09:56:38] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2058" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443061
[09:56:48] <wikibugs>	 (03PS1) 10Marostegui: Revert "install_server: Allow reimage db2058" [puppet] - 10https://gerrit.wikimedia.org/r/443062
[10:00:27] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "install_server: Allow reimage db2058" [puppet] - 10https://gerrit.wikimedia.org/r/443062 (owner: 10Marostegui)
[10:10:47] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1104" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443065
[10:10:53] <wikibugs>	 (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1104" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443065
[10:12:08] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1104" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443065 (owner: 10Marostegui)
[10:13:27] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1104" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443065 (owner: 10Marostegui)
[10:13:44] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1104" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443065 (owner: 10Marostegui)
[10:14:36] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1104 after alter table (duration: 00m 50s)
[10:14:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:15:00] <wikibugs>	 (03PS2) 10Muehlenhoff: Enable microcode updates for all elasticsearch hosts [puppet] - 10https://gerrit.wikimedia.org/r/443038 (https://phabricator.wikimedia.org/T127825)
[10:15:17] <wikibugs>	 (03CR) 10Gehel: [C: 031] "LGTM - I'd love to have Giuseppe input on this, but this is already enough of an improvement that we could go forward as-is." [puppet] - 10https://gerrit.wikimedia.org/r/440498 (owner: 10EBernhardson)
[10:15:21] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443067 (https://phabricator.wikimedia.org/T191316)
[10:15:50] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Enable microcode updates for all elasticsearch hosts [puppet] - 10https://gerrit.wikimedia.org/r/443038 (https://phabricator.wikimedia.org/T127825) (owner: 10Muehlenhoff)
[10:17:04] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443067 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui)
[10:18:06] <logmsgbot>	 !log mobrovac@deploy1001 Started deploy [citoid/deploy@40cdff7]: Update citoid to fd77117 - T165105 T197853
[10:18:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:18:10] <stashbot>	 T165105: Some requests for DOIs are failing or very slow; if we have a DOI and the request is taking too long, just use CrossRef data instead. - https://phabricator.wikimedia.org/T165105
[10:18:11] <stashbot>	 T197853: Trailing single quote in DOIs breaks citoid - https://phabricator.wikimedia.org/T197853
[10:18:25] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443067 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui)
[10:19:39] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1092 for alter table (duration: 00m 50s)
[10:19:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:20:01] <marostegui>	 !log Deploy schema change on db1092 T191316 T192926 T89737 T195193
[10:20:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:20:07] <stashbot>	 T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737
[10:20:08] <stashbot>	 T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926
[10:20:08] <stashbot>	 T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193
[10:20:08] <stashbot>	 T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316
[10:21:07] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443067 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui)
[10:21:32] <logmsgbot>	 !log mobrovac@deploy1001 Finished deploy [citoid/deploy@40cdff7]: Update citoid to fd77117 - T165105 T197853 (duration: 03m 26s)
[10:21:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:32:36] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2058" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443061 (owner: 10Marostegui)
[10:33:52] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2058" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443061 (owner: 10Marostegui)
[10:34:56] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2058 after reimage (duration: 00m 51s)
[10:34:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:37:19] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2058" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443061 (owner: 10Marostegui)
[11:01:56] <moritzm>	 !log installing lame security updates
[11:01:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:04:21] <wikibugs>	 10Operations, 10Operations-Software-Development, 10Phabricator, 10Technical-Debt: Update Puppet repo code that uses maniphest.update and maniphest.createtask conduit api - https://phabricator.wikimedia.org/T159045 (10Aklapper)
[11:05:09] <wikibugs>	 (03PS1) 10Muehlenhoff: Add library hint for lame [puppet] - 10https://gerrit.wikimedia.org/r/443076
[11:05:58] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Add library hint for lame [puppet] - 10https://gerrit.wikimedia.org/r/443076 (owner: 10Muehlenhoff)
[11:11:11] <moritzm>	 !log installing zsh updates from jessie 8.11 point release
[11:11:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:21:25] <wikibugs>	 (03CR) 10Ladsgroup: "This can be deployed now. Probably on Monday I guess but it's in prod now." [puppet] - 10https://gerrit.wikimedia.org/r/440986 (https://phabricator.wikimedia.org/T147169) (owner: 10Ladsgroup)
[11:21:53] <wikibugs>	 (03PS1) 10Muehlenhoff: Fix Cumin alias for kafka/analytics [puppet] - 10https://gerrit.wikimedia.org/r/443077
[11:44:03] <wikibugs>	 (03PS1) 10Muehlenhoff: Add library hint for xerces-c [puppet] - 10https://gerrit.wikimedia.org/r/443078
[11:44:35] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add library hint for xerces-c [puppet] - 10https://gerrit.wikimedia.org/r/443078 (owner: 10Muehlenhoff)
[11:49:57] <icinga-wm>	 PROBLEM - Check systemd state on labstore1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[12:03:27] <icinga-wm>	 RECOVERY - Check systemd state on labstore1006 is OK: OK - running: The system is fully operational
[12:30:55] <moritzm>	 !log uploaded ffmpeg 3.2.10-1~deb9u1+wmf1 to apt.wikimedia.org/stretch-wikimedia (component/vp9) (linked against libvpx 1.7 and with backported row-mt support) (T190333)
[12:30:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:30:58] <stashbot>	 T190333: Backport libvpx 1.7.0, ffmpeg packages for VP9 -row-mt option - https://phabricator.wikimedia.org/T190333
[12:50:06] <icinga-wm>	 PROBLEM - Check systemd state on labstore1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[12:54:56] <wikibugs>	 (03CR) 10Paladox: "We are planning to reinstall with stretch (by switching to phab1002 and then back to phab1001)" [puppet] - 10https://gerrit.wikimedia.org/r/443045 (owner: 10Alexandros Kosiaris)
[12:55:05] <wikibugs>	 (03CR) 10Paladox: [C: 031] phabricator: Use the mysql native driver [puppet] - 10https://gerrit.wikimedia.org/r/443045 (owner: 10Alexandros Kosiaris)
[12:59:51] <wikibugs>	 (03PS2) 10Elukey: Fix Cumin alias for kafka/analytics [puppet] - 10https://gerrit.wikimedia.org/r/443077 (owner: 10Muehlenhoff)
[13:00:42] <wikibugs>	 (03CR) 10Elukey: [C: 032] Fix Cumin alias for kafka/analytics [puppet] - 10https://gerrit.wikimedia.org/r/443077 (owner: 10Muehlenhoff)
[13:01:36] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10User-Elukey: Tune Varnishkafka delivery errors to be more sensitive - https://phabricator.wikimedia.org/T173492 (10elukey)
[13:03:09] <wikibugs>	 (03PS1) 10Elukey: icing::monitor::analytics: move per host vk alarms to aggregates [puppet] - 10https://gerrit.wikimedia.org/r/443086 (https://phabricator.wikimedia.org/T173492)
[13:03:13] <chasemp>	 ^ apergos do we know why this is flapping?
[13:03:33] <apergos>	 no
[13:03:52] <apergos>	 ● systemd-timedated.service loaded failed failed Time & Date Service
[13:04:05] <apergos>	 sorry, I was / am adding to the incident report
[13:04:07] <apergos>	 from yesterday
[13:12:15] <apergos>	 /dev/dm-0               916G  916G     0 100% /
[13:12:17] <apergos>	 chasemp: 
[13:12:46] <chasemp>	 interesting
[13:15:56] <icinga-wm>	 RECOVERY - Disk space on labstore1006 is OK: DISK OK
[13:15:58] <apergos>	 something wrote to /srv/dumps while the filesystem was unmounted
[13:16:35] <apergos>	 I unmounted, cleared out the crap, remounted just now
[13:17:57] <icinga-wm>	 RECOVERY - Check systemd state on labstore1006 is OK: OK - running: The system is fully operational
[13:19:37] <apergos>	 and that's done
[13:21:24] <wikibugs>	 (03CR) 10Ottomata: "+1 to the general idea, but I don't think this should be in the 'analytics' profile.  With the exception of maybe eventlogging, none of th" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/443086 (https://phabricator.wikimedia.org/T173492) (owner: 10Elukey)
[13:22:27] <wikibugs>	 (03CR) 10Ottomata: "Q Tho: Won't this mean that if a single vk host starts dropping messages, we won't get alerts?" [puppet] - 10https://gerrit.wikimedia.org/r/443086 (https://phabricator.wikimedia.org/T173492) (owner: 10Elukey)
[13:31:10] <wikibugs>	 (03CR) 10Elukey: "> +1 to the general idea, but I don't think this should be in the" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/443086 (https://phabricator.wikimedia.org/T173492) (owner: 10Elukey)
[13:39:37] <wikibugs>	 (03PS2) 10Elukey: profile::cache::kafka::alerts: move per host vk alarms to aggregates [puppet] - 10https://gerrit.wikimedia.org/r/443086 (https://phabricator.wikimedia.org/T173492)
[13:52:10] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: labvirt1009 HP Raid alert - https://phabricator.wikimedia.org/T198479 (10Andrew)
[13:52:18] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: labvirt1009 HP Raid alert - https://phabricator.wikimedia.org/T198479 (10Andrew) p:05Triage>03High
[14:12:04] <librenms-wmf>	 04Critical Alert for device asw-a-eqiad.mgmt.eqiad.wmnet - Critical syslog messages
[14:13:04] <librenms-wmf>	 04Critical Alert for device asw-b-eqiad.mgmt.eqiad.wmnet - Critical syslog messages
[14:13:13] <marostegui>	 XioNoX: ^
[14:13:25] <XioNoX>	 thx, looking
[14:13:47] <XioNoX>	 "   #1: msg => fatal: no kex alg; "
[14:13:56] <wikibugs>	 10Operations, 10Security-Team, 10Traffic, 10Wikimedia-General-or-Unknown: Add restrictive CSP to upload.wikimedia.org - https://phabricator.wikimedia.org/T117618 (10Aklapper)
[14:15:14] <jynus>	 it is not mgmt only
[14:15:36] <XioNoX>	 chasemp: did you try to ssh to asw-a/b using weird ssh settings?
[14:15:54] <jynus>	 oh ignore me, unrelated bgp issue
[14:16:13] <XioNoX>	 jynus: bgp issue?
[14:16:20] <chasemp>	 XioNoX: yes, looking to cleanup the cloud-hosts-d vlan to coopt that 1120 vlan-id as discussed 
[14:16:33] <jynus>	 not bgp issue
[14:16:49] <jynus>	 that is how it is named on icinga
[14:16:56] <jynus>	 ignore me
[14:17:03] <librenms-wmf>	 04̶C̶r̶i̶t̶i̶c̶a̶l Device asw-a-eqiad.mgmt.eqiad.wmnet recovered from Critical syslog messages
[14:17:14] <XioNoX>	 chasemp: ok, wanted to understand that "critical" syslog message
[14:17:20] <chasemp>	 asw-a can't handle my diffie-hellman-group-exchange-sha256 bad self
[14:17:55] <chasemp>	 XioNoX: ack, I had pinned settings for asw2* but same negotiation won't work on old devices, sorry for the noise man
[14:17:58] <XioNoX>	 no idea why juniper classifies it as "critical" anyway
[14:18:04] <chasemp>	 that is...weird
[14:18:11] <XioNoX>	 hehe
[14:19:11] <chasemp>	 probably views it as probing
[14:22:03] <librenms-wmf>	 04̶C̶r̶i̶t̶i̶c̶a̶l Device asw-b-eqiad.mgmt.eqiad.wmnet recovered from Critical syslog messages
[14:25:15] <wikibugs>	 10Operations, 10TimedMediaHandler-Transcode, 10Patch-For-Review: Backport libvpx 1.7.0, ffmpeg packages for VP9 -row-mt option - https://phabricator.wikimedia.org/T190333 (10MoritzMuehlenhoff) I've backported the row-mt patch to ffmpeg 3.2. This allows us to stick with the ffmpeg security updates for Debian...
[14:27:51] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1092" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443092
[14:29:01] <wikibugs>	 10Operations, 10Kubernetes, 10Security-Other: Network segmentation for WMF servers - https://phabricator.wikimedia.org/T101912 (10Aklapper)
[14:40:40] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1092" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443092 (owner: 10Marostegui)
[14:41:59] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1092" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443092 (owner: 10Marostegui)
[14:42:16] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1092" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443092 (owner: 10Marostegui)
[14:43:04] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1092 after alter table (duration: 00m 50s)
[14:43:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:43:18] <wikibugs>	 (03CR) 10Ottomata: [C: 031] "profile::cache::kafka::alerts, I like :)" [puppet] - 10https://gerrit.wikimedia.org/r/443086 (https://phabricator.wikimedia.org/T173492) (owner: 10Elukey)
[14:46:03] <wikibugs>	 (03Abandoned) 10Gehel: elasticsearch: raise alerting limit for free disk space [puppet] - 10https://gerrit.wikimedia.org/r/430066 (https://phabricator.wikimedia.org/T192972) (owner: 10Gehel)
[14:54:07] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet operation_type={create_container,image_status,podsandbox_status,remove_container,start_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[14:55:17] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[15:20:34] <wikibugs>	 (03PS1) 10Papaul: DHCP: Add MAC address entries for graphite2003 [puppet] - 10https://gerrit.wikimedia.org/r/443100 (https://phabricator.wikimedia.org/T196483)
[15:20:39] <wikibugs>	 10Operations, 10Pywikibot-General, 10Pywikibot-core, 10Wikimedia-Mailing-lists: recent e-mails missing from pywikibot archive (due to wrong file system permissions) - https://phabricator.wikimedia.org/T107769 (10Aklapper)
[15:23:05] <librenms-wmf>	 04Critical Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80%
[15:23:34] <elukey>	 XioNoX: --^ ?
[15:23:51] <XioNoX>	 thx
[15:24:49] <XioNoX>	 come on, fridays are supposed to be quiet
[15:25:09] <XioNoX>	 Hello, traffic spike
[15:25:10] <XioNoX>	 https://librenms.wikimedia.org/graphs/to=1530285600/id=11600/type=port_bits/from=1530199200/
[15:26:26] <XioNoX>	 well, our ashburn IXP port is saturating
[15:26:42] <elukey>	 for traffic originate from us, or towards us
[15:26:43] <elukey>	 ?
[15:26:57] <XioNoX>	 outbound traffic
[15:27:09] <XioNoX>	 tiny amount of inbound traffic, huge outbound
[15:28:01] <XioNoX>	 bblack, vgutierrez, ema, might need some help figuring out what's going on
[15:28:28] <jynus>	 I can see the change in traffic, but not in number of requests
[15:28:33] <XioNoX>	 in ~5/10min the equinix IXP dashboard should tell us toward whom that traffic is
[15:31:07] <XioNoX>	 there is no drop from any other providers, so it's not just traffic re-routing there
[15:31:08] <jynus>	 I think it is upload
[15:31:30] <jynus>	 I also see commons slightly high load
[15:31:39] <bblack>	 usually upload is somehow related, whenever it's more about bytes than reqs
[15:32:02] <bblack>	 I know there were also some changes related to Commons yesterday that were rolled out and then reverted, and I didn't follow closely
[15:32:09] <bblack>	 there were related complaints about commons images, etc
[15:32:13] <jynus>	 swift: https://grafana.wikimedia.org/dashboard/db/prometheus-cluster-breakdown?orgId=1
[15:32:14] <bblack>	 maybe somewhere inter-related with this
[15:32:34] <jynus>	 cache upload: https://grafana.wikimedia.org/dashboard/db/prometheus-cluster-breakdown?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=cache_upload&var-instance=All&from=1530275542254&to=1530286342254
[15:33:28] <bblack>	 https://grafana.wikimedia.org/dashboard/db/varnish-caching?refresh=15m&orgId=1&from=now-3h&to=now&var-cluster=cache_upload&var-site=All&var-status=1&var-status=2&var-status=3&var-status=4&var-status=5
[15:33:42] <bblack>	 ^ upload hitrate dropped by a full percent as well (which is a ton more misses to the backends)
[15:34:05] <XioNoX>	 not sure if related but I see a big spike of traffic for for example ms-be https://librenms.wikimedia.org/graphs/to=1530286200/id=2303/type=port_bits/from=1530199800/
[15:34:31] <bblack>	 the slope of the hitrate dropoff was over the period ~15:09->15:14
[15:34:33] <jynus>	 ms-be is switf, yes, related
[15:34:34] <mark>	 that would almost certainly be related :)
[15:35:12] <bblack>	 so, at the very least there's a big pattern shift that's causing more outbound traffic and more cache misses, which will indirectly cause more network on cache_upload+ms_be, and more load on ms_be in general
[15:35:52] <XioNoX>	 I can't disable that port otherwise it will most likely staturate a transit link instead
[15:35:56] <bblack>	 as best I can tell it's only the eqiad cache frontends seeing this
[15:36:05] <mark>	 let's convert this in a cookbook incident scenario next week :)
[15:36:10] <bblack>	 so it's eqiad-specific, which likely means it's specific to a certain client network
[15:37:00] <XioNoX>	 still waiting on the equinix IXP dashboard to update and show where traffic is going
[15:37:36] <bblack>	 (limiting the hitrate impact graph to eqiad, the dropoff is more like 97.5% -> 93.6%, which means the backend reqrate to swift there has multiplied by ~2.5x.
[15:37:39] <bblack>	 )
[15:37:56] <bblack>	 (for eqiad's reqs, which are a fraction of overall swift reqs, but still)
[15:38:08] <jynus>	 it is back to normal now
[15:38:21] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: labvirt1009 HP Raid alert - https://phabricator.wikimedia.org/T198479 (10Bstorm) Should be.  It's an HP Smart P420i in a RAID 10 logical disk and is the only failure.  Unless the disk itself isn't a hot swap form factor, it should be good, right?...
[15:38:49] <jynus>	 1.1GB for all upload caches
[15:40:09] <bblack>	 I'm digging through oxygen sampled-1000 for IP/network stuff, should have something shortly
[15:41:01] <bblack>	 seems at least somewhat distributed
[15:41:40] <jynus>	 I will check performance impact
[15:41:41] <mark>	 in terms of destinations or URIs?
[15:41:50] <bblack>	 client IPs I mean
[15:42:02] <bblack>	 although now that I'm filtering down to /24 before aggregating, it's looking like Google
[15:42:53] <bblack>	 googlebot is high in the list of client networks for 15:00 -> 15:39 anyways, but that may be normal heh :)
[15:43:05] <librenms-wmf>	 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80%
[15:44:03] <chasemp>	 I don't remember seeing this alert before from librenms-wmf...could it be this is somewhat common over time (pun) but we were not noticing it?
[15:44:37] <bblack>	 it's relatively-new, but been ... a month or two?
[15:45:13] <bblack>	 anyways, there's no obvious source network that's doing a huge percentage of requests on its own, all the traffic looks fairly-reasonably distributed, which is odd.
[15:45:29] <XioNoX>	 I see a big spike ~2Gbps toward VADATA-DC2-IX-01 ASN: 16509, google doesn't seem to show a spike (yet at least)
[15:45:36] <chasemp>	 ack, a month or two is long enough for it to be anomalous for sure
[15:45:39] <bblack>	 but it may be this is a problem that's hiding in request-counts, because they're very few but very very large outbound requests
[15:46:18] <bblack>	 maybe someone ripping down a bunch of huge tiff/djvu from one or more 1Gbps+ hosts that happen to be located right by us in VA?
[15:46:21] <mark>	 wasn't there a varnish tool that showed the top objects in terms of object size?
[15:46:24] <mark>	 (too late now I guess)
[15:46:37] <bblack>	 not that correlates well for this problem
[15:47:15] <bblack>	 16509 == EC2?
[15:47:42] <wikibugs>	 (03CR) 10PleaseStand: "Is this change actually "part II" instead of "part I"? I would think the use of the variable (in CommonSettings.php) should be removed fir" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441518 (owner: 10Jforrester)
[15:48:00] <bblack>	 so yeah, maybe an individual user doing something heavy from an EC2 instance at low latency from eqiad, not realizing the impact
[15:48:02] <XioNoX>	 indeed, AWS
[15:48:11] <bblack>	 (or from more than one)
[15:48:46] <bblack>	 we have req ratelimits we impose on everyone by default to ensure some sanity-cap, but not byte ratelimits
[15:50:12] <XioNoX>	 is a byte ratelimit possible?
[15:50:29] <XioNoX>	 aka, what to do if this happens again?
[15:51:24] <bblack>	 I don't know, it's not something we have an easy switch for anyways, would require some engineering and thinking
[15:53:32] <mark>	 peering with ec2 ;)
[15:53:36] <mark>	 private peering I mean
[15:54:45] <XioNoX>	 sounds like a heavy option :)
[15:55:09] <mark>	 also not a quick one :)
[15:56:40] <chasemp>	 being able to easily parse out top requesters by cache misses would be interesting
[15:57:46] <mark>	 yeah although for this particular case it wouldn't matter much if they were cache hits or not
[15:58:05] <mark>	 there's varnishtop
[15:58:16] <mark>	 but that's based on #requests, not #bytes I think
[15:58:27] <mark>	 (maybe it can do that too nowadays, i haven't used it in a long while)
[15:58:42] <mark>	 and analytics may have interesting tools for it nowadays too
[16:00:53] <XioNoX>	 Equinix graph settled, all the spike is toward ASW
[16:00:55] <XioNoX>	 ASW
[16:00:59] <XioNoX>	 er, AWS
[16:06:53] <wikibugs>	 (03PS1) 10Reedy: Remove duplicate phpunit entry from composer.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443108
[16:07:09] <Nemo_bis>	 3 GiB/s for 25 min, so 4500 GiB?
[16:07:28] <Nemo_bis>	 Could be just the download of a relatively small collection of videos from Commons
[16:07:33] <wikibugs>	 (03CR) 10BryanDavis: [C: 031] Remove duplicate phpunit entry from composer.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443108 (owner: 10Reedy)
[16:08:07] <bblack>	 usually it's hard, at least with a single host/thread, for one user to suck so much bandwidth from us though.  it's easier if they're very low-latency to us, as EC2 us-east-1 likely is :)
[16:10:21] <jynus>	 Nemo_bis: 3GBYtes
[16:15:20] <wikibugs>	 (03PS1) 10Andrew Bogott: Openstack: add region-migrate script and config [puppet] - 10https://gerrit.wikimedia.org/r/443109
[16:15:47] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Openstack: add region-migrate script and config [puppet] - 10https://gerrit.wikimedia.org/r/443109 (owner: 10Andrew Bogott)
[16:17:39] <wikibugs>	 (03PS2) 10Andrew Bogott: Openstack: add region-migrate script and config [puppet] - 10https://gerrit.wikimedia.org/r/443109
[16:18:04] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Openstack: add region-migrate script and config [puppet] - 10https://gerrit.wikimedia.org/r/443109 (owner: 10Andrew Bogott)
[16:19:24] <wikibugs>	 (03PS3) 10Andrew Bogott: Openstack: add region-migrate script and config [puppet] - 10https://gerrit.wikimedia.org/r/443109
[16:36:30] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] Openstack: add region-migrate script and config [puppet] - 10https://gerrit.wikimedia.org/r/443109 (owner: 10Andrew Bogott)
[16:39:43] <wikibugs>	 (03PS1) 10Andrew Bogott: nova region-migrate: fix erb mistake [puppet] - 10https://gerrit.wikimedia.org/r/443112
[16:40:15] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] nova region-migrate: fix erb mistake [puppet] - 10https://gerrit.wikimedia.org/r/443112 (owner: 10Andrew Bogott)
[16:43:22] <wikibugs>	 10Operations, 10Mail, 10Wikimedia-Mailing-lists: Reach out to Google about @yahoo.com emails not reaching gmail inboxes (when sent to mailing lists) - https://phabricator.wikimedia.org/T146841 (10herron) For wikitech-l the DMARC moderation policy is accept, which is known to cause issues with ISPs using stri...
[16:52:51] <wikibugs>	 (03PS1) 10Eevans: restbase: cleanup remainging detritus from storage transition [puppet] - 10https://gerrit.wikimedia.org/r/443114 (https://phabricator.wikimedia.org/T191659)
[16:58:25] <wikibugs>	 10Operations, 10Traffic, 10netops: Can't reach eqiad or esams from Comcast in Portland, Oregon - https://phabricator.wikimedia.org/T198502 (10brion)
[16:59:20] <brion>	 looks like a big outage with comcast, so not sure if anything to be done
[17:00:03] <wikibugs>	 10Operations, 10Traffic, 10netops: Can't reach eqiad or esams from Comcast in Portland, Oregon - https://phabricator.wikimedia.org/T198502 (10ayounsi) a:03ayounsi
[17:04:14] <wikibugs>	 10Operations, 10Traffic, 10netops: Can't reach eqiad or esams from Comcast in Portland, Oregon - https://phabricator.wikimedia.org/T198502 (10ayounsi) 1st look seem to indicate an issue between Telia and Comcast or within Comcast.  ``` ayounsi@bast1002:~$ mtr 73.37.60.183 -z  --report-wide Start: Fri Jun 29...
[17:09:15] <wikibugs>	 10Operations, 10Traffic, 10netops: Can't reach eqiad or esams from Comcast in Portland, Oregon - https://phabricator.wikimedia.org/T198502 (10brion) ``` $ sudo mtr bast1002.wikimedia.org -z  --report-wide Password: Start: 2018-06-29T10:07:32-0700 HOST: Orac.local                                             L...
[17:11:59] <wikibugs>	 10Operations, 10Traffic, 10netops: Can't reach eqiad or esams from Comcast in Portland, Oregon - https://phabricator.wikimedia.org/T198502 (10brion) and in ipv4:  ``` $ sudo mtr bast1002.wikimedia.org -z  --report-wide -4 Start: 2018-06-29T10:10:31-0700 HOST: Orac.local...
[17:17:56] <wikibugs>	 10Operations, 10Traffic, 10netops: Can't reach eqiad or esams from Comcast in Portland, Oregon - https://phabricator.wikimedia.org/T198502 (10ayounsi) Telia's NOC contacted.
[17:21:24] <XioNoX>	 !log deactivating v6 BGP session to Telia on cr2-eqiad - T198502
[17:21:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:21:28] <stashbot>	 T198502: Can't reach eqiad or esams from Comcast in Portland, Oregon - https://phabricator.wikimedia.org/T198502
[17:23:49] <wikibugs>	 10Operations, 10Traffic, 10netops: Can't reach eqiad or esams from Comcast in Portland, Oregon - https://phabricator.wikimedia.org/T198502 (10brion) Now getting  ``` $ sudo mtr bast1002.wikimedia.org -z  --report-wide Password: Start: 2018-06-29T10:22:50-0700 HOST: Orac.local...
[17:28:19] <XioNoX>	 !log Re-activating v6 BGP session to Telia on cr2-eqiad - T198502
[17:28:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:28:23] <stashbot>	 T198502: Can't reach eqiad or esams from Comcast in Portland, Oregon - https://phabricator.wikimedia.org/T198502
[17:32:08] <wikibugs>	 10Operations, 10Traffic, 10netops: Can't reach eqiad or esams from Comcast in Portland, Oregon - https://phabricator.wikimedia.org/T198502 (10bearND) I'm affected, too. Comcast in CO.
[17:32:13] <wikibugs>	 10Operations, 10Traffic, 10netops: Can't reach eqiad or esams from Comcast in Portland, Oregon - https://phabricator.wikimedia.org/T198502 (10ayounsi) Traffic takes another path, GTT to us, HE back, but still no luck, so the issue seems to be within Comcast.  Looking at some Netops IRC channels, there seem t...
[17:36:04] <Niharika>	 James_F: o/ Is there a task for broken i18n on testwiki? I can't find one. 
[17:37:50] <James_F>	 Niharika: No, just follow-up from breakage two weeks ago. :-(
[17:38:16] <Niharika>	 Okay. Running full scap yesterday did not rebuild pagetriage messages, unfortunately. 
[17:39:09] <James_F>	 Yeah, I think there's something more seriously wrong with the wiki.
[17:39:51] <Reedy>	 Wasn't there some cache refactoring issues?
[17:40:00] <Reedy>	 either MCR or something Aaron was working on
[17:40:05] <Reedy>	 And I thought there was a patch to fix that
[17:40:22] <James_F>	 Yeah, I pinged Aaron but he said he didn't think it was him.
[17:40:29] * James_F digs.
[17:40:51] <James_F>	 https://phabricator.wikimedia.org/T197450
[17:41:24] <Niharika>	 test2 also has missing messages in Preferences > Gadgets. 
[17:59:16] <wikibugs>	 10Operations, 10Gadgets, 10MediaWiki-Cache, 10Performance-Team: test.wp is using test2.wp's message cache - https://phabricator.wikimedia.org/T197450 (10Reedy) p:05Triage>03High
[18:05:02] <wikibugs>	 10Operations, 10Traffic, 10netops: Can't reach eqiad or esams from Comcast in Portland, Oregon - https://phabricator.wikimedia.org/T198502 (10bearND) translatewiki.net was also affected for me. But both TWN and Gerrit are back for me now.
[18:10:31] <wikibugs>	 10Operations, 10Traffic, 10netops: Can't reach eqiad or esams from Comcast in Portland, Oregon - https://phabricator.wikimedia.org/T198502 (10brion) 05Open>03Resolved Seems to have cleared up for me too now. Marking resolved. \o/
[18:11:27] <wikibugs>	 10Operations, 10Analytics, 10Traffic: Size of headers processed by varnish? - https://phabricator.wikimedia.org/T198152 (10Nuria) Ya, 8k seems quite  a bit, not sure why would we need more than that in either end.
[18:17:48] <wikibugs>	 10Operations, 10Research, 10Research-collaborations, 10Research-management, and 2 others: Remove shell access for ironholds on 2018-06-29 - https://phabricator.wikimedia.org/T197895 (10herron) a:03herron
[18:18:03] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Give Amir access to access to researchers group - https://phabricator.wikimedia.org/T198211 (10herron) a:03herron
[18:18:56] <wikibugs>	 (03PS2) 10Herron: adding amire80 to researchers [puppet] - 10https://gerrit.wikimedia.org/r/442143 (https://phabricator.wikimedia.org/T198211) (owner: 10RobH)
[18:21:57] <wikibugs>	 (03CR) 10Herron: [C: 032] adding amire80 to researchers [puppet] - 10https://gerrit.wikimedia.org/r/442143 (https://phabricator.wikimedia.org/T198211) (owner: 10RobH)
[18:24:03] <wikibugs>	 (03PS4) 10Herron: remove oliver's access to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/441434 (https://phabricator.wikimedia.org/T197895) (owner: 10RobH)
[18:25:52] <wikibugs>	 (03CR) 10Herron: [C: 032] remove oliver's access to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/441434 (https://phabricator.wikimedia.org/T197895) (owner: 10RobH)
[18:40:17] <wikibugs>	 (03CR) 10Smalyshev: [C: 031] Enable smater logging for wdqs [puppet] - 10https://gerrit.wikimedia.org/r/443001 (https://phabricator.wikimedia.org/T197645) (owner: 10Smalyshev)
[18:45:17] <wikibugs>	 10Operations, 10Research, 10Research-collaborations, 10Research-management, and 2 others: Remove shell access for ironholds on 2018-06-29 - https://phabricator.wikimedia.org/T197895 (10herron) 05Open>03Resolved This has been merged and the access removal is propagating out across the fleet now.  ``` No...
[18:45:20] <wikibugs>	 10Operations, 10Ops-Access-Reviews, 10Research, 10Research-collaborations, and 3 others: Request access to data for Wikimedia Donation Patterns research - https://phabricator.wikimedia.org/T188945 (10herron)
[18:48:36] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Give Amir access to access to researchers group - https://phabricator.wikimedia.org/T198211 (10herron) 05Open>03Resolved This has been completed  ``` stat1006:~# id amire80 uid=2076(amire80) gid=500(wikidev) groups=500(wikidev),726(statistics-users...
[18:50:50] <wikibugs>	 (03CR) 10Eevans: [C: 031] "[PC output](http://puppet-compiler.wmflabs.org/11619) shows that all this does remove some redundant IPs from ferm (it is functionally a n" [puppet] - 10https://gerrit.wikimedia.org/r/443114 (https://phabricator.wikimedia.org/T191659) (owner: 10Eevans)
[19:03:13] <icinga-wm>	 PROBLEM - puppet last run on stat1005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): User[ironholds]
[19:06:48] <herron>	 ^ that seems related to what I recently merged — looking
[19:11:33] <wikibugs>	 (03PS1) 10Rush: cloud: labs* VLANs are renamed in the switches [dns] - 10https://gerrit.wikimedia.org/r/443135
[19:11:51] <herron>	 !log stat1005:~# killall -u ironholds T197895
[19:11:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:11:57] <stashbot>	 T197895: Remove shell access for ironholds on 2018-06-29 - https://phabricator.wikimedia.org/T197895
[19:13:23] <icinga-wm>	 RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[19:18:40] <wikibugs_>	 (03PS2) 10Eevans: restbase: cleanup remainging detritus from storage transition [puppet] - 10https://gerrit.wikimedia.org/r/443114 (https://phabricator.wikimedia.org/T191659)
[19:18:41] <wikibugs_>	 (03PS1) 10Eevans: WIP: restbase: use lower threshold when monitoring instance-data partition [puppet] - 10https://gerrit.wikimedia.org/r/443137 (https://phabricator.wikimedia.org/T191659)
[19:19:27] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] WIP: restbase: use lower threshold when monitoring instance-data partition [puppet] - 10https://gerrit.wikimedia.org/r/443137 (https://phabricator.wikimedia.org/T191659) (owner: 10Eevans)
[19:19:37] <wikibugs_>	 (03PS3) 10Eevans: restbase: cleanup remainging detritus from storage transition [puppet] - 10https://gerrit.wikimedia.org/r/443114 (https://phabricator.wikimedia.org/T191659)
[19:19:39] <wikibugs_>	 (03PS2) 10Eevans: WIP: restbase: use lower threshold when monitoring instance-data partition [puppet] - 10https://gerrit.wikimedia.org/r/443137 (https://phabricator.wikimedia.org/T191659)
[19:20:18] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] WIP: restbase: use lower threshold when monitoring instance-data partition [puppet] - 10https://gerrit.wikimedia.org/r/443137 (https://phabricator.wikimedia.org/T191659) (owner: 10Eevans)
[19:20:50] <wikibugs_>	 (03PS4) 10Eevans: restbase: cleanup remaining detritus from storage transition [puppet] - 10https://gerrit.wikimedia.org/r/443114 (https://phabricator.wikimedia.org/T191659)
[19:20:52] <wikibugs_>	 (03PS3) 10Eevans: WIP: restbase: use lower threshold when monitoring instance-data partition [puppet] - 10https://gerrit.wikimedia.org/r/443137 (https://phabricator.wikimedia.org/T191659)
[19:21:23] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] WIP: restbase: use lower threshold when monitoring instance-data partition [puppet] - 10https://gerrit.wikimedia.org/r/443137 (https://phabricator.wikimedia.org/T191659) (owner: 10Eevans)
[19:23:33] <wikibugs>	 (03PS4) 10Eevans: WIP: restbase: use lower threshold when monitoring instance-data partition [puppet] - 10https://gerrit.wikimedia.org/r/443137 (https://phabricator.wikimedia.org/T191659)
[19:24:12] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] WIP: restbase: use lower threshold when monitoring instance-data partition [puppet] - 10https://gerrit.wikimedia.org/r/443137 (https://phabricator.wikimedia.org/T191659) (owner: 10Eevans)
[19:32:38] <wikibugs>	 (03CR) 10Bmansurov: [C: 04-1] "to be deployed on 7/5" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441568 (https://phabricator.wikimedia.org/T191086) (owner: 10Bmansurov)
[19:56:18] <wikibugs>	 (03PS1) 10Andrew Bogott: nova region-migrate: move region-migrate.conf into the envscripts class [puppet] - 10https://gerrit.wikimedia.org/r/443139
[20:33:43] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] nova region-migrate: move region-migrate.conf into the envscripts class [puppet] - 10https://gerrit.wikimedia.org/r/443139 (owner: 10Andrew Bogott)
[20:36:20] <wikibugs>	 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: WDQS timeout on the public eqiad cluster - https://phabricator.wikimedia.org/T198042 (10Smalyshev) Do we have anything left to do here?
[20:39:26] <wikibugs>	 (03PS1) 10Nehajha: Removing gridengine as default and encouraging the use of Kubernetes [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/443190 (https://phabricator.wikimedia.org/T154504)
[20:44:09] <wikibugs>	 (03PS1) 10Andrew Bogott: region-migrate.conf: remove some injurious quote marks [puppet] - 10https://gerrit.wikimedia.org/r/443191
[20:47:56] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] region-migrate.conf: remove some injurious quote marks [puppet] - 10https://gerrit.wikimedia.org/r/443191 (owner: 10Andrew Bogott)
[21:09:21] <wikibugs>	 10Operations, 10fundraising-tech-ops, 10netops: NAT and DNS for fundraising monitor host - https://phabricator.wikimedia.org/T198516 (10cwdent)
[21:20:49] <wikibugs>	 (03PS1) 10Rush: openstack: eqiad1 initial neutron l3-agent manifest [puppet] - 10https://gerrit.wikimedia.org/r/443197 (https://phabricator.wikimedia.org/T196633)
[21:23:14] <wikibugs>	 (03PS2) 10Rush: openstack: eqiad1 initial neutron l3-agent manifest [puppet] - 10https://gerrit.wikimedia.org/r/443197 (https://phabricator.wikimedia.org/T196633)
[21:27:57] <wikibugs>	 (03PS3) 10Rush: openstack: eqiad1 initial neutron l3-agent manifest [puppet] - 10https://gerrit.wikimedia.org/r/443197 (https://phabricator.wikimedia.org/T196633)
[21:32:53] <wikibugs>	 (03CR) 10Rush: [C: 032] openstack: eqiad1 initial neutron l3-agent manifest [puppet] - 10https://gerrit.wikimedia.org/r/443197 (https://phabricator.wikimedia.org/T196633) (owner: 10Rush)
[21:37:57] <chasemp>	 !log (late log) of VLAN creations for T184209 cloud-instances2-b-eqiad and cloud-instance-transport1-b-eqiad
[21:38:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:39:26] <thcipriani>	 MatmaRex: backport just merged, got a few minutes to deploy this with me?
[21:40:15] <MatmaRex>	 thcipriani: yeah. i've just figured out how to allow myself to rollback changes on test.wiki
[21:41:15] <thcipriani>	 awesome, ok, it's live on mwdebug1002
[21:41:45] <thcipriani>	 thank you for your help
[21:43:21] <MatmaRex>	 thcipriani: i'm testing, the page is taking ages to load. i've had that happen before with mwdebug1002, dunno why
[21:43:56] <MatmaRex>	 ugh, it crapped out, actually. [Wzan7wpAAC4AAGvGsZAAAAAK] 2018-06-29 21:43:28: Fatal exception of type "Wikimedia\Rdbms\DBQueryError"
[21:44:22] <MatmaRex>	 but it rollbacked correctly: https://test.wikipedia.org/w/index.php?title=T198449&action=history
[21:44:23] <stashbot>	 T198449: Rollback from autopatrolled user was marked as not patrolled - https://phabricator.wikimedia.org/T198449
[21:44:31] <MatmaRex>	 and the change is patrolled on https://test.wikipedia.org/wiki/Special:RecentChanges
[21:45:04] <MatmaRex>	 thcipriani: can you look up the exception for Wzan7wpAAC4AAGvGsZAAAAAK? i don't see it in logstash
[21:45:28] <MatmaRex>	 i just got that when making that rollback (which succeeded, otherwise)
[21:46:40] * thcipriani looking
[21:47:14] <thcipriani>	 > Error: 1205 Lock wait timeout exceeded;
[21:47:30] <MatmaRex>	 (which is another concerning trend. that's the third time in two days where i can't lookup an exception in logstash)
[21:48:45] <thcipriani>	 https://phabricator.wikimedia.org/P7318
[21:52:27] <MatmaRex>	 thcipriani: i have no idea if we should proceed. it might be an issue specifically with mwdebug1002; the last time i was testing something on it this week, it took a super long time to give me a response, just like this time
[21:52:59] <MatmaRex>	 (except this time it exceeded a lock wait timeout because of it)
[21:54:01] <thcipriani>	 MatmaRex: me either re:should proceed. In light of the fact that it's Friday, I'll rollback and comment on the task. thank you for all your help! If it were a cleaner path forward could have saved some folks some headache over the weekend :(
[21:54:27] <greg-g>	 yeah, I'm not comfortable right now.
[21:55:38] <MatmaRex>	 yeah, i understand.
[21:55:41] <MatmaRex>	 this is dumb, eh
[21:55:48] <MatmaRex>	 logstash has some timeouts and slow queries logged when i search for "WikiPage::lockAndGetLatest", by the way
[22:23:47] <wikibugs>	 (03CR) 10BryanDavis: [C: 04-1] "I'm not sure that the order of checks here is quite right. The tool could have a ~/.webservicerc file without having set a --backend value" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/443190 (https://phabricator.wikimedia.org/T154504) (owner: 10Nehajha)
[22:40:49] <wikibugs>	 10Operations, 10fundraising-tech-ops, 10netops: NAT and DNS for fundraising monitor host - https://phabricator.wikimedia.org/T198516 (10ayounsi) I don't neither.  Note that 8.155.80.208.in-addr.arpa domain name pointer frbast1001.wikimedia.org.
[22:44:19] <wikibugs>	 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: WDQS timeout on the public eqiad cluster - https://phabricator.wikimedia.org/T198042 (10Gehel) I think we're all done here.
[22:44:55] <wikibugs>	 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Investigate HTTP 500 on POST request to WDQS - https://phabricator.wikimedia.org/T198055 (10Gehel) Looks good, we can close this.
[22:44:59] <wikibugs>	 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: WDQS timeout on the public eqiad cluster - https://phabricator.wikimedia.org/T198042 (10Gehel)
[22:45:02] <wikibugs>	 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Investigate HTTP 500 on POST request to WDQS - https://phabricator.wikimedia.org/T198055 (10Gehel) 05Open>03Resolved a:03Gehel
[22:58:43] <wikibugs>	 10Operations, 10fundraising-tech-ops, 10netops: NAT and DNS for fundraising monitor host - https://phabricator.wikimedia.org/T198516 (10cwdent) @ayounsi ah yes thanks, I forgot to update the documentation for that, but just did